statistical analysis of player performance

AlesCieply · **#21**

Bill Spight wrote:

I am not talking about proof of cheating, but of the comparison between Metta's play and that of his opponents (win-loss aside). I haven't seen any test of that comparison aside from the one I did on your early data, which showed no significant difference. Statistically, Carlo's play online was better than his play offline in the games analyzed. But I did not find that Carlo's online play was better than the online play of his opponents. If you can show that, please do.

Interesting, I did not get the idea of comparing Metta's histograms with those generated for his "combined opponents". However, I see outright that the percentage of good moves played by his opponents in internet games does vary and differs from the "more consistent" Metta's performance. Thus, I would guess that the chi2 test on the patterns would show a difference, maybe not as large as the one between Metta's regular and internet performances but it should still be there. I will definitely look at it, though it may take me some time before I do the chi2 test on the new data. Right now I am more concerned about Leela 0.11 ability to provide reliable winrates in some types of positions (large scale life and death problems, complicated fights and I do not know what else). I assume it only generates some "noise" at the tail of the histograms but it is a serious hindrance. I also got rather busy with other things in real life ...

bugsti · **#22**

AlesCieply wrote:

It does not tell us whether he used Leela, got someone else playing instead of him or whatever. It just says that is is very unlikely that Metta's regular games and internet games "were played by the same person".

Here it is where you are wrong. It does not say that "is very unlikely that Metta's regular games and internet games were played by the same person" but it says that Metta played differently in "those" regular games compared to "those" online games. The reasons for these differences can be many, among hundreds of possible reason you choose the cheating argument, this point makes you a biased judge.

AlesCieply · **#23**

bugsti wrote:

AlesCieply wrote:

It does not tell us whether he used Leela, got someone else playing instead of him or whatever. It just says that is is very unlikely that Metta's regular games and internet games "were played by the same person".

... but it says that Metta played differently in "those" regular games compared to "those" online games.

Actually, it is exactly what I mean and why I put the "were played by same person" in the parentheses. OK, thanks for precising my statement.

bugsti wrote:

AlesCieply wrote:

The reasons for these differences can be many, among hundreds of possible reason you choose the cheating argument, this point makes you a biased judge.

No problem with that, I do not consider myself unbiased on the matter any more. I would only like to add that at the time I started with the analysis I did not know what will come out of it and I clearly see its flaws. And as I said several times already, for me the turning point was finding out about the altered kifu. I really do not believe the explanation provided by the Italians. The analysis is just another piece of evidence, nothing more, it will never be 100%.

Bill Spight · **#24**

AlesCieply wrote:

Jan.van.Rongen wrote:

IMO the whole analysis is deeply flawed because the very basis of the analysis: the "Leela top 3 suggestions" is undefined, ...

Jan, have you noted that my analysis does not use the "top 3 suggestions". Maybe you speak here about someone's else analysis.

What my analysis does use are the winrates, the bot estimates of the probability to win the game if the top suggested move is played. There, I see all your three runs provide values in a reasonable agreement with what I have there. The agreement might have been even better if you ran the bot at a higher number of playouts.

Please note that these three runs provide the best estimates for the winrate of :w50:

(for Black). Running Leela 11 for 60 sec. provides an average of 167K playouts for :w50:

, 123K for :b51:

. Running Leela for 72 sec. would get an average of around 200K playouts. The results of the 3 runs range between 53.75% and 55.23%. That's a larger error range than we might wish for analysis, while still good enough for a reasonably strong bot.

One thing that van Rongen's runs illustrates is the unreliability of the third choice. Even with a much longer run than the average run a cheater would have used, the rollouts for the third choice is less than 10K, which is, IMO, the minimum desired precision. Metta played more quickly than necessary, averaging 9.7 sec./move. At 15 sec./move the second choice would still have more than 10K rollouts only occasionally. So at realistic run times for a cheater, only the top choice has any reliability, IMO.

AlesCieply wrote:

Sure, I would very much like to check it with alternate bots (AQ, Leela Zero). The problem I have here is that I do not have access to GPU powered computer, so the numbers of playouts I am getting with them are small (below 10k with 3 min per move). Anyone willing to help? All I need are the rsgf files generated by the Go Review Partner. Just let it analyse the games with a setting to spend (let's say) 3 min on a move (or to get at least 100k playouts), moves 30-181 do suffice for me.

Using Go Review Partner I don't think you want just the single run. The single run provides the its best estimates for the win rates of the human's plays. Then you want (i.e., I want) to go back and add variations where Metta (the person of interest) made different play than the bot's top choice; play the bot's top choice in the new branch and run the bot for the same length of time used for the plays in the actual game (or the same number of rollouts). The difference of the two winrate estimates will give you the bot's best error estimate.

You could also add a branch for each play. If the human played the bot's top choice you get your precision estimate.

And why not get estimates for the whole game? You can always ignore moves before 30 and after 181 if the comparisons are not interesting.

Edit: Corrected Leela to the bot.

AlesCieply · **#25**

Bill Spight wrote:

Please note that these three runs provide the best estimates for the winrate of :w50:

(for Black).

Sure, looking for the best move for :b51:

provides the winrate for :w50:

. That's how it is done (see column "winrate") in my spreadsheets.

Bill Spight wrote:

Using Go Review Partner I don't think you want just the single run. The single run provides the its best estimates for the win rates of the human's plays. Then you want (i.e., I want) to go back and add variations where Metta (the person of interest) made different play than the bot's top choice; play the bot's top choice in the new branch and run the bot for the same length of time used for the plays in the actual game (or the same number of rollouts). The difference of the two winrate estimates will give you the bot's best error estimate.

In theory, I completely agree with you. In practice, the matter is more complex. The problem is I could not find a way to automate this. I have discussed it with the author of GRP and and we were even not able to find a way to fix the number of playouts in the automated GRP runs. There are two choices: either set the GRP to get at least a minimal number of playouts at each position/move (I do this with the minimum set to 200k) or set a fixed time the bot will spend on each move. For the later there is no guarantee a move will get the required number of playouts if the computer gets busy with some other unpredictable tasks. When setting the minimal number of playouts the bot goes to the next move as soon as the minimum is reached but this 200k playouts are added on top of what the bot already spent on the variation while analyzing the previous moves. In the end, when a long (forced) sequence of top move suggestions is played the top move suggestion gets analyzed with incredibly high number of playouts (300k-400k are normal, I have noted even 1000k). Thus, very often the winrate estimate for an unplayed top move suggestion gets higher number of playouts than the move actually played, contrary to what you would have expected.

My assumption (checked several times) is that the winrate is about settled with the 200k playouts, so it does not really matter so much if some moves are analyzed with much higher number of playouts.

To sum it up, in an ideal world we would like to have the winrates for each played move and for the top move suggested by the bot (if it differs) estimated with the same number of playouts. In reality, I have not found a way to automate this with the available tools.

Bill Spight wrote:

And why not get estimates for the whole game? You can always ignore moves before 30 and after 181 if the comparisons are not interesting.

OK, next week I will run the analysis of the Metta-BenDavid game in its completeness "for you".

I can do more later on, but will leave for holidays in the middle of next week.

Bill Spight · **#26**

AlesCieply wrote:

Bill Spight wrote:

Please note that these three runs provide the best estimates for the winrate of :w50:

(for Black).

Sure, looking for the best move for :b51:

provides the winrate for :w50:

. That's how it is done (see column "winrate") in my spreadsheets.

Bill Spight wrote:

Using Go Review Partner I don't think you want just the single run. The single run provides the its best estimates for the win rates of the human's plays. Then you want (i.e., I want) to go back and add variations where Metta (the person of interest) made different play than the bot's top choice; play the bot's top choice in the new branch and run the bot for the same length of time used for the plays in the actual game (or the same number of rollouts). The difference of the two winrate estimates will give you the bot's best error estimate.

In theory, I completely agree with you. In practice, the matter is more complex. The problem is I could not find a way to automate this.

Thanks.

Quote:

Bill Spight wrote:

And why not get estimates for the whole game? You can always ignore moves before 30 and after 181 if the comparisons are not interesting.

OK, next week I will run the analysis of the Metta-BenDavid game in its completeness "for you".

I can do more later on, but will leave for holidays in the middle of next week.

Thanks.

Metta took 47 sec. on :b17:

and 73 sec. on :b27:

, so both of those we can consider as subjectively difficult for him. IMO, :b11:

and

are questionable. It will be interesting to see the results.

Jan.van.Rongen · **#27**

Bill Spight wrote:

Please note that these three runs provide the best estimates for the winrate of :w50:

(for Black). Running Leela 11 for 60 sec. provides an average of 167K playouts for :w50:

, 123K for :b51:

.

I am not sure I understand what you write. IMO it runs 160K simulations on the board position after :w50:

and the 122K is the number of simulations for the top candidate.

The N-percentage is the likelihood Leela 0.11 thinks a certain move has to be played. These percentages are fixed for the position. For L17 Leela has that as the most likely move (24.1%), even though that is not the best move. L17 starts out as the more likely candidate and as you can see it can remain the favourite by chance because of the nature of random processes in the tree search and the value network.

That also means that the "precision" which you definede elsewhere as being proportional to 1/sqrt(N) where N is the number of simulations for that move is incorrect. These simulations are not at all independent.

This also shows that on a smaller machine without GPU and with more limited run time you are more likely to find the most plausible (human) move and not the best move.

Assessing this situation after move 50 with Leela Zero network #155 (180 seconds) gives a remarkable result.

Quote:

M17 -> 36697 (V: 38.45%) (N: 22.47%)
L17 -> 592 (V: 34.83%) (N: 13.83%)
K16 -> 580 (V: 36.40%) (N: 7.79%)
B13 -> 472 (V: 34.26%) (N: 12.73%)
J14 -> 448 (V: 34.04%) (N: 12.64%)

M17 -> 36773 (V: 39.23%) (N: 7.07%)
J14 -> 1382 (V: 34.37%) (N: 42.50%)
K16 -> 348 (V: 36.12%) (N: 6.87%)
B13 -> 223 (V: 33.85%) (N: 7.59%)
E13 -> 152 (V: 33.49%) (N: 5.53%)

M17 -> 39535 (V: 38.66%) (N: 8.24%)
J14 -> 1635 (V: 34.52%) (N: 41.35%)
B13 -> 371 (V: 34.40%) (N: 9.66%)
E13 -> 258 (V: 33.41%) (N: 8.28%)
H9 -> 202 (V: 33.60%) (N: 6.25%)

The most remarkable point is the almost 16% lower evaluation by LZ #155 for Black in this position. But it also shows that the evaluation cannot be used as an absolute measure of the value of a move. It is always relative to the capabilities of a bot.

Bill Spight · **#28**

Jan.van.Rongen wrote:

Bill Spight wrote:

Please note that these three runs provide the best estimates for the winrate of :w50:

(for Black). Running Leela 11 for 60 sec. provides an average of 167K playouts for :w50:

, 123K for :b51:

.

I am not sure I understand what you write. IMO it runs 160K simulations on the board position after :w50:

and the 122K is the number of simulations for the top candidate.

I averaged the three runs.

Quote:

The N-percentage is the likelihood Leela 0.11 thinks a certain move has to be played.

It pretends to be, anyway. The winrates are dependent, and we are not sure upon what.

Quote:

These percentages are fixed for the position. For L17 Leela has that as the most likely move (24.1%), even though that is not the best move. L17 starts out as the more likely candidate and as you can see it can remain the favourite by chance because of the nature of random processes in the tree search and the value network.

That also means that the "precision" which you definede elsewhere as being proportional to 1/sqrt(N) where N is the number of simulations for that move is incorrect. These simulations are not at all independent.

Sorry for not being clear. In general, we can regard errors as being random, even if the processes from which they err are not. As you point out below, Leela 11 could be estimating an incorrect winrate. If so, the more rollouts it uses, the closer it should get to it. (That is not always the case, as I have pointed out, for instance, if it is hill climbing, when it is close to the top of the hill.) As it gets closer to its estimate, its precision generally increases. But its estimate may be completely off. (moha, I believe, thinks that Leela's limitations may prevent it from ever finding the right move, even in infinite time. I don't pretend to know.) Anyway, that is why I took care to distinguish between precision and accuracy. Greater precision does not mean greater accuracy. And anyway, I did not define precision as being proportional to 1/sqrt(n), I offered that as a rule of thumb. And I also indicated that the accuracy of an estimate is no greater than precision. So that, for instance, even if we have a precision of ± 1%, we should not regard the error rate of the accuracy as being that tight.

Quote:

Assessing this situation after move 50 with Leela Zero network #155 (180 seconds) gives a remarkable result.

Quote:

M17 -> 36697 (V: 38.45%) (N: 22.47%)
L17 -> 592 (V: 34.83%) (N: 13.83%)
K16 -> 580 (V: 36.40%) (N: 7.79%)
B13 -> 472 (V: 34.26%) (N: 12.73%)
J14 -> 448 (V: 34.04%) (N: 12.64%)

M17 -> 36773 (V: 39.23%) (N: 7.07%)
J14 -> 1382 (V: 34.37%) (N: 42.50%)
K16 -> 348 (V: 36.12%) (N: 6.87%)
B13 -> 223 (V: 33.85%) (N: 7.59%)
E13 -> 152 (V: 33.49%) (N: 5.53%)

M17 -> 39535 (V: 38.66%) (N: 8.24%)
J14 -> 1635 (V: 34.52%) (N: 41.35%)
B13 -> 371 (V: 34.40%) (N: 9.66%)
E13 -> 258 (V: 33.41%) (N: 8.28%)
H9 -> 202 (V: 33.60%) (N: 6.25%)

The most remarkable point is the almost 16% lower evaluation by LZ #155 for Black in this position. But it also shows that the evaluation cannot be used as an absolute measure of the value of a move. It is always relative to the capabilities of a bot.

Leela Zero, being a much better player than Leela 11, should yield more accurate choices of plays, as well as more accurate estimates of winrates. As Uberdude has pointed out, the meaning of Leela Zero's winrates is almost surely different from the meaning of Leela 11's winrates. So it is not that they are estimating the same thing, and Leela Zero is estimating it better. It looks to me that Leela Zero's evaluation of the position after :w50:

is less precise than Leela 11's, but more accurate.

Bill Spight · **#29**

Chess GM Jan Ludvig Hammer begins to use Leela Chess Zero here: https://www.youtube.com/watch?v=TxiNUPK ... gs=pl%2Cwn

I find this video painful to watch because Hammer is struggling with the software. People are helping him, but he is plainly frustrated.

One thing that he complains about is that he is unable (at least at the moment) to teach Leela Chess Zero about the game he is analyzing by entering lines of play, something that he does with chess engines. In particular there is a move that Leela does not find, but when he plays it, Leela realizes that it is a better than Leela's top choice. But then when he backs the game up, Leela Chess Zero does not change its evaluation of the previous position. I suppose that this is a feature of Leela Chess Zero, and I am not going to complain about it myself.

However, in the midst of his explorations of the software, he makes an observation that I can resonate with. He does not care about whether the software plays better than other software (I do, though), he wants to use it for analysis and review. If you are trying to understand a particular game or variation, when your software does not learn along with you, that limits its value for that purpose.

As far as go bots are concerned, I think that we are at a place where they still have a lot of improvement to make, and I think that getting them to play as well as they can is an important goal. At some time we are likely to reach a point of diminishing returns, but we do not seem to be near that point yet. Let us forge ahead.

But people are starting to use bots for review and analysis, tasks for which they were not designed. One feature that Hammer wants is for the evaluation of plays or variations that the human enters to propagate up the game tree. We know that life and death is a relative weakness in current go bots. If a human, even an SDK, shows the program that at a certain node in the game tree there is a play that the program missed that kills or saves a group, then that fact ought to affect the program's earlier decisions. If the new evaluation does not propagate up the tree, that will not happen.

Currently the program is used in reviews to compare different plays, to show people where they made a mistake. The bots use winrate percentages to evaluate positions and plays. How much worse, in percentage terms, does a human's play have to be by comparison with the bot's top pick for it to be a mistake? (OC, we cannot be sure that the bot's top pick is best, but that's another story, for now.) For some people, it seems that a difference of less than 1% is enough, for others it takes a difference of 1%, for others, 2%, for some, 4%. But we are all guessing.

What we would like to know is the error rates and ranges of the evaluations. Bots are trained on millions of self play games. Those games should provide enough data to generate error terms for the winrates. But the error terms are not generated, because accurate evaluation is not the goal of the programs. Winning games is. And simply picking the play with the best evaluation is not how modern bots work. They are more complicated than that. Changes that you might think would help a bot play better, may actually make it play worse. But like Hammer, when I am analyzing a game or position, I am not concerned with how well the software plays in general, I am concerned with evaluating a specific game, position, or play.

Recently I saw a position at the end of the game where Zen7 evaluated a pass by Black as giving White a 61% chance of winning the game. (Edit: See viewtopic.php?p=233790#p233790 ) It seemed obvious to me that the pass was correct, indeed, the only winning choice, since it was a 0.5 pt. win. It was easy to show that Black could defend against White's threat, at least for an amateur dan player, and probably for many SDKs as well. At the 10 kyu level, Zen's evaluation might be right. But in analyzing a game I do not need 10 kyu help, thank you very much. :roll:

In a position where play was essentially over, a top bot's evaluation was off by 61%. :shock:

Obviously, Zen7 does not go around giving 10 kyu advice. But in this specific case it was horribly wrong. And in doing a review or analysis it is specific cases we are interested in. Evaluations made by a program whose goal is evaluation may still be wrong, but they can tell us how good they expect any evaluation to be.

Jan.van.Rongen wrote:

That also means that the "precision" which you definede elsewhere as being proportional to 1/sqrt(N) where N is the number of simulations for that move is incorrect. These simulations are not at all independent.

Lacking error estimates, I can at least compare the precision of evaluations in terms of 1/sqrt(Playouts). Faute de mieux. If anyone would like to provide good error estimates for winrates, that would be great! Meanwhile, I'll make do.

Jan.van.Rongen wrote:

Assessing this situation after move 50 with Leela Zero network #155 (180 seconds) gives a remarkable result.

Quote:

M17 -> 36697 (V: 38.45%) (N: 22.47%)
L17 -> 592 (V: 34.83%) (N: 13.83%)
K16 -> 580 (V: 36.40%) (N: 7.79%)
B13 -> 472 (V: 34.26%) (N: 12.73%)
J14 -> 448 (V: 34.04%) (N: 12.64%)

M17 -> 36773 (V: 39.23%) (N: 7.07%)
J14 -> 1382 (V: 34.37%) (N: 42.50%)
K16 -> 348 (V: 36.12%) (N: 6.87%)
B13 -> 223 (V: 33.85%) (N: 7.59%)
E13 -> 152 (V: 33.49%) (N: 5.53%)

M17 -> 39535 (V: 38.66%) (N: 8.24%)
J14 -> 1635 (V: 34.52%) (N: 41.35%)
B13 -> 371 (V: 34.40%) (N: 9.66%)
E13 -> 258 (V: 33.41%) (N: 8.28%)
H9 -> 202 (V: 33.60%) (N: 6.25%)

OK, Leela Zero, after thinking for 3 minutes, a long time for it, figures the best play to be M-17. It's a better player than I am, so I'll say OK. With 36607 playouts for M-17 in the first run, I'll guess the precision of its evaluations as ± 0.5%, yielding 38.4% ± 0.5%. A likely win for White.

But what about L-17? With only 592 playouts, I'll guess its precision as ± 4.1%. But its evaluation is only 3.8% worse that that of M-17. So there is a chance that L-17 is a better play than M-17. (L-17 does not even show up in the top 5 choices in other runs, however. So it is out of the running.)

Now, Leela Zero's evaluations are quite good enough for it to play well, in general. M-17 is a good move. But whether L-17 or K-14 is better than M-17 in this specific position (with this komi) is a different question. At the very least we would like to evaluate other options to the same degree of precision as the top choice and have an error term to indicate the degree of confidence in the comparison. And, given that people are using software for review and analysis and are asking whether particular plays are mistakes, some software should be designed for that purpose.

Edit: I do not mean to disparage Lizzie or Go Review Partner or any other analysis or review program, but it would be better if they were able to use software developed for evaluation, not play.

Javaness2 · **#30**

AlesCieply wrote:

Expect more later.[/b]

Can we expect more non Carlo files?
Like some games Dragos Bajneru played offline for instance - you can easily take those from Desprego.ro
It would be nice to see how stability is panning out offline for players other than Carlo.

AlesCieply · **#31**

Javaness2 wrote:

AlesCieply wrote:

Expect more later.[/b]

Can we expect more non Carlo files?

Sure, in some time.

I have the analysis of Ondrej Kruml (Czech 5d) games almost complete. Preliminary, I can say, his percentage of good moves ranges from about 40% to almost 70% (in one game from 8).

What troubles me right now is the accuracy and precision of the winrates provided by Leela in some specific positions. These positions are reasonably rare but when they occur, Leela struggles and evaluates several moves by the players as big mistakes. This affects the tails of the mistake histograms with small counts, so I am considering removing these last low-count bins from the chi2 comparisons.

It also looks I found someone to help me out with running Leela Zero with sufficiently high playouts setting, so I am wondering what to do first. :scratch:

AlesCieply · **#32**

Something for Bill:

Attachment:

File comment: Meta-BenDavid analysis at 300k+ nodes.

MettaBenDavid.rsgf.zip [53.18 KiB]
Downloaded 558 times

This is a rsgf file generated by GRP for the Metta-BenDavid PGETC game, all moves. The analysis was done with 300k+ nodes, so it should be slightly more precise than what I use normally. It is quite fresh, I have not checked it myself yet but intend to make a comparison with my "standard" 200k+ file.

Bill Spight · **#33**

AlesCieply wrote:

What troubles me right now is the accuracy and precision of the winrates provided by Leela in some specific positions. These positions are reasonably rare but when they occur, Leela struggles and evaluates several moves by the players as big mistakes. This affects the tails of the mistake histograms with small counts, so I am considering removing these last low-count bins from the chi2 comparisons.

Instead of throwing data out, the general rule of thumb is to combine the low count bins into one bin with a count of at least 5. And that also probably means combining the bin where the human play had a better evaluation than Leela's top play.

Quote:

It also looks I found someone to help me out with running Leela Zero with sufficiently high playouts setting, so I am wondering what to do first. :scratch:

Leela Zero is more accurate than Leela 11.

Bill Spight · **#34**

AlesCieply wrote:

Something for Bill:

Attachment:

MettaBenDavid.rsgf.zip

This is a rsgf file generated by GRP for the Metta-BenDavid PGETC game, all moves. The analysis was done with 300k+ nodes, so it should be slightly more precise than what I use normally. It is quite fresh, I have not checked it myself yet but intend to make a comparison with my "standard" 200k+ file.

Wow! Much grass!

AlesCieply · **#35**

I have uploaded an analysis of 10 games played by Ondrej Kruml, a Czech 5d player. The links are in the second message of this thread and the OP was also altered slightly to reflect the appearance of the new data. They should serve for comparison with the results obtained for the two sets of games played by Carlo Metta. The selection of these 10 games was not quite random as I wanted to include the same number of wins and losses, have games from different tournaments, and not more than two games played with the same opponent. I do not think there was any other bias when I was selecting the games to analyze.

pnprog · **#36**

Hi Bill!

Bill Spight wrote:

So to find a good delta we don't want to do what Go Review Partner does. It's OK for casual review, but not for scientific purposes. We want to start from the same place, and we want to have an equal number of playouts for each play we are comparing. With Go Review Partner I think we can do that by making each play we are comparing and then running the bot for a certain number of rollouts, or for a certain length of time. That way we are comparing apples with apples.

There is not direct way to ask Leela (or other bot) to evaluate one specific move. So do you mean something like:
For one given position:

Check out the move "A1" played in actual game (let's imagine D16)
Check out what move, "B1", would have been played by the bot (let's imagine D17)
Ask the bot for its best counter move "A2", to the move "A1" (let's imagine C14)
Ask the bot for its best counter move "B2", to the move "B1" (let's imagine C15)

Then, if W(X) is the win rate of move at X, then: delta = W(B2)-W(A2)
And then, the thinking parameters (time and play-outs) should be the same when asking the bot to come out with "B1","A2" and "B2".

Is that what you mean?

By the way, we could ask the Leela Zero team is they can come up with a specific GTP command to evaluate one precise move. Maybe it's not that hard to implement.

AlesCieply · **#37**

pnprog wrote:

Then, if W(X) is the win rate of move at X, then: delta = W(B2)-W(A2)

Maybe I can answer this instead of Bill.

You got it absolutely correct! Just note that my deltas are defined with an opposite sign, but that's not important. I just wanted to have positive value when the player finds a better move than the top bot suggestion. With the improving quality of bots it might be better to define delta as a value of a mistake the player made by playing his/her move.

pnprog wrote:

And then, the thinking parameters (time and play-outs) should be the same when asking the bot to come out with "B1","A2" and "B2".

It would be best if the numbers of playouts/nodes were the same (and could have been preset) for the winrate estimates made for A2 and B2.

pnprog wrote:

By the way, we could ask the Leela Zero team is they can come up with a specific GTP command to evaluate one precise move. Maybe it's not that hard to implement.

This would be great!

Bill Spight · **#38**

pnprog wrote:

Hi Bill!

Bill Spight wrote:

So to find a good delta we don't want to do what Go Review Partner does. It's OK for casual review, but not for scientific purposes. We want to start from the same place, and we want to have an equal number of playouts for each play we are comparing. With Go Review Partner I think we can do that by making each play we are comparing and then running the bot for a certain number of rollouts, or for a certain length of time. That way we are comparing apples with apples.

There is not direct way to ask Leela (or other bot) to evaluate one specific move. So do you mean something like:
For one given position:

Check out the move "A1" played in actual game (let's imagine D16)
Check out what move, "B1", would have been played by the bot (let's imagine D17)
Ask the bot for its best counter move "A2", to the move "A1" (let's imagine C14)
Ask the bot for its best counter move "B2", to the move "B1" (let's imagine C15)

Then, if W(X) is the win rate of move at X, then: delta = W(B2)-W(A2)
And then, the thinking parameters (time and play-outs) should be the same when asking the bot to come out with "B1","A2" and "B2".

Is that what you mean?

By the way, we could ask the Leela Zero team is they can come up with a specific GTP command to evaluate one precise move. Maybe it's not that hard to implement.

Here's what I am talking about. Let's look at moves :w14:

and

in the Metta-Ben David game.

Click Here To Show Diagram Code: [go]$$Wcm14 $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . X . . | $$ | . . . O . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . O . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . X . . . . . . . . . . . . . . . . | $$ | . . . X X X O . . , . . . . . , . . . | $$ | . . X O O . O . . . . . . 1 . . X . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

Leela evaluates 9 different replies to :w14:

.

Its top choice is the keima.

Click Here To Show Diagram Code: [go]$$Wcm14 Keima $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . X . . | $$ | . . . O . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . O . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . X . . . . . . . . . . . . 2 . . . | $$ | . . . X X X O . . , . . . . . , . . . | $$ | . . X O O . O . . . . . . 1 . . X . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

It evaluates this as 55.90% for Black with 222084 playouts.

Click Here To Show Diagram Code: [go]$$Wcm14 De $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . X . . | $$ | . . . O . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . O . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . X . . . . . . . . . . . . . . . . | $$ | . . . X X X O . . , . . . . . , . . . | $$ | . . X O O 2 O . . . . . . 1 . . X . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

Its second choice is the de, which it evaluates as 54.72% for Black with 136882 playouts. That is fewer playouts, but they are in the same ballpark, and good enough, I think, for a winrate difference of 1.2%.

Click Here To Show Diagram Code: [go]$$Wcm14 Two space extension $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . X . . | $$ | . . . O . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . O . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . 2 . . | $$ | . . X . . . . . . . . . . . . . . . . | $$ | . . . X X X O . . , . . . . . , . . . | $$ | . . X O O . O . . . . . . 1 . . X . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

Leela's third choice is the two space extension, which it evaluates as 54.37% with 41693 playouts. The two winrates are not all that comparable, but good enough for the winrate difference of 1.5%.

Click Here To Show Diagram Code: [go]$$Wcm14 Two space high pincer $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . X . . | $$ | . . . O . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . O . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . X . . . . . . . . . . . . . . . . | $$ | . . . X X X O . . , 2 . . . . , . . . | $$ | . . X O O . O . . . . . . 1 . . X . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

Leela's sixth choice is the two space high pincer, with only 782 playouts. With so few playouts, it is not worth figuring a winrate difference.

Now let's look at :b15:

in the game.

Click Here To Show Diagram Code: [go]$$Wcm14 Two space extension $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . X . . | $$ | . . . O . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . O . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . 2 . . | $$ | . . X . . . . . . . . . . . . . . . . | $$ | . . . X X X O . . , . . . . . , . . . | $$ | . . X O O . O . . . . . . 1 . . X . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

Leela evaluates it as 55.01% for Black with around 341,000 playouts. (I did not add them up exactly.) With many more playouts the winrate is more precise, and presumably more accurate. Not a big difference in this case, but bigger differences have been observed. The delta is 0.9% instead of a winrate difference of 1.5%.

The first comparison, between Leela's first and third choices, is at the same depth of the tree, but with quite different playouts. IIUC, it is not easy to equalize the number of playouts, because, as a kind of Monte Carlo bot, Leela uses the number of playouts as one of its criteria to decide which play to choose. Its purpose is to pick plays, not just evaluate positions and plays.

The second comparison, for the delta, has a comparable number of playouts for the two choices in this case, but they start at different levels of the game tree. Now, Leela is run at :b15:

in the game tree, according to the conditions set, time or number of playouts, or whatever. Is it not possible to make a separate variation with Leela's first choice, the keima, and run Leela for it, under the same conditions as Leela is run for the actual play in the game? That would give comparisons made under the same conditions at the same level in the game tree.

One way to do that might be, after Leela has evaluated :b15:

in the actual game, before it evaluates :w16:

in the actual game, have it evaluate Leela's first choice for :b15:

. (You wouldn't even have to check to see if it is different from the actual play. Double comparisons of the same play would give you an idea of the error rate of the winrate estimates. Something that we do not currently have.) Another possibility would be go over the game a second time, this time only evaluating the variations with Leela's first choices from the initial run.

Bill Spight · **#39**

pnprog wrote:

There is not direct way to ask Leela (or other bot) to evaluate one specific move. So do you mean something like:
For one given position:

Check out the move "A1" played in actual game (let's imagine D16)
Check out what move, "B1", would have been played by the bot (let's imagine D17)
Ask the bot for its best counter move "A2", to the move "A1" (let's imagine C14)
Ask the bot for its best counter move "B2", to the move "B1" (let's imagine C15)

Then, if W(X) is the win rate of move at X, then: delta = W(B2)-W(A2)
And then, the thinking parameters (time and play-outs) should be the same when asking the bot to come out with "B1","A2" and "B2".

Is that what you mean?

Let me try again. IIUC, Leela has already come up with its replies to A1, the move already made in the game. That's what Go Review Partner asks it to do, right? Ask it to do the same for B1. Then we compare the associated winrate estimates.

Quote:

By the way, we could ask the Leela Zero team is they can come up with a specific GTP command to evaluate one precise move. Maybe it's not that hard to implement.

Isn't that what Leela does when it actually plays a game? It evaluates the position after the opponent's move?

AlesCieply · **#40**

Chi2 tests results added to the second post.

statistical analysis of player performance

Who is online