Chess GM Jan Ludvig Hammer begins to use Leela Chess Zero here:
https://www.youtube.com/watch?v=TxiNUPK ... gs=pl%2CwnI find this video painful to watch because Hammer is struggling with the software. People are helping him, but he is plainly frustrated.
One thing that he complains about is that he is unable (at least at the moment) to teach Leela Chess Zero about the game he is analyzing by entering lines of play, something that he does with chess engines. In particular there is a move that Leela does not find, but when he plays it, Leela realizes that it is a better than Leela's top choice. But then when he backs the game up, Leela Chess Zero does not change its evaluation of the previous position. I suppose that this is a feature of Leela Chess Zero, and I am not going to complain about it myself.
However, in the midst of his explorations of the software, he makes an observation that I can resonate with. He does not care about whether the software plays better than other software (I do, though), he wants to use it for analysis and review. If you are trying to understand a particular game or variation, when your software does not learn along with you, that limits its value for that purpose.
As far as go bots are concerned, I think that we are at a place where they still have a lot of improvement to make, and I think that getting them to play as well as they can is an important goal. At some time we are likely to reach a point of diminishing returns, but we do not seem to be near that point yet. Let us forge ahead.

But people are starting to use bots for review and analysis, tasks for which they were not designed. One feature that Hammer wants is for the evaluation of plays or variations that the human enters to propagate up the game tree. We know that life and death is a relative weakness in current go bots. If a human, even an SDK, shows the program that at a certain node in the game tree there is a play that the program missed that kills or saves a group, then that fact ought to affect the program's earlier decisions. If the new evaluation does not propagate up the tree, that will not happen.
Currently the program is used in reviews to compare different plays, to show people where they made a mistake. The bots use winrate percentages to evaluate positions and plays. How much worse, in percentage terms, does a human's play have to be by comparison with the bot's top pick for it to be a mistake? (OC, we cannot be sure that the bot's top pick is best, but that's another story, for now.) For some people, it seems that a difference of less than 1% is enough, for others it takes a difference of 1%, for others, 2%, for some, 4%. But we are all guessing.

What we would like to know is the error rates and ranges of the evaluations. Bots are trained on millions of self play games. Those games should provide enough data to generate error terms for the winrates. But the error terms are not generated, because accurate evaluation is not the goal of the programs. Winning games is. And simply picking the play with the best evaluation is not how modern bots work. They are more complicated than that. Changes that you might think would help a bot play better, may actually make it play worse. But like Hammer, when I am analyzing a game or position, I am not concerned with how well the software plays in general, I am concerned with evaluating a specific game, position, or play.
Recently I saw a position at the end of the game where Zen7 evaluated a pass by Black as giving White a 61% chance of winning the game. (Edit: See
viewtopic.php?p=233790#p233790 ) It seemed obvious to me that the pass was correct, indeed, the only winning choice, since it was a 0.5 pt. win. It was easy to show that Black could defend against White's threat, at least for an amateur dan player, and probably for many SDKs as well. At the 10 kyu level, Zen's evaluation might be right. But in analyzing a game I do not need 10 kyu help, thank you very much.

In a position where play was essentially over, a top bot's evaluation was off by 61%.

Obviously, Zen7 does not go around giving 10 kyu advice. But in this specific case it was horribly wrong. And in doing a review or analysis it is specific cases we are interested in. Evaluations made by a program whose goal is evaluation may still be wrong, but they can tell us how good they expect any evaluation to be.
Jan.van.Rongen wrote:
That also means that the "precision" which you definede elsewhere as being proportional to 1/sqrt(N) where N is the number of simulations for that move is incorrect. These simulations are not at all independent.
Lacking error estimates, I can at least compare the precision of evaluations in terms of 1/sqrt(Playouts). Faute de mieux. If anyone would like to provide good error estimates for winrates, that would be great! Meanwhile, I'll make do.
Jan.van.Rongen wrote:
Assessing this situation after move 50 with Leela Zero network #155 (180 seconds) gives a remarkable result.
Quote:
M17 -> 36697 (V: 38.45%) (N: 22.47%)
L17 -> 592 (V: 34.83%) (N: 13.83%)
K16 -> 580 (V: 36.40%) (N: 7.79%)
B13 -> 472 (V: 34.26%) (N: 12.73%)
J14 -> 448 (V: 34.04%) (N: 12.64%)
M17 -> 36773 (V: 39.23%) (N: 7.07%)
J14 -> 1382 (V: 34.37%) (N: 42.50%)
K16 -> 348 (V: 36.12%) (N: 6.87%)
B13 -> 223 (V: 33.85%) (N: 7.59%)
E13 -> 152 (V: 33.49%) (N: 5.53%)
M17 -> 39535 (V: 38.66%) (N: 8.24%)
J14 -> 1635 (V: 34.52%) (N: 41.35%)
B13 -> 371 (V: 34.40%) (N: 9.66%)
E13 -> 258 (V: 33.41%) (N: 8.28%)
H9 -> 202 (V: 33.60%) (N: 6.25%)
OK, Leela Zero, after thinking for 3 minutes, a long time for it, figures the best play to be M-17. It's a better player than I am, so I'll say OK. With 36607 playouts for M-17 in the first run, I'll guess the precision of its evaluations as ± 0.5%, yielding 38.4% ± 0.5%. A likely win for White.
But what about L-17? With only 592 playouts, I'll guess its precision as ± 4.1%. But its evaluation is only 3.8% worse that that of M-17. So there is a chance that L-17 is a better play than M-17. (L-17 does not even show up in the top 5 choices in other runs, however. So it is out of the running.)
Now, Leela Zero's evaluations are quite good enough for it to play well, in general. M-17 is a good move. But whether L-17 or K-14 is better than M-17 in this specific position (with this komi) is a different question. At the very least we would like to evaluate other options to the same degree of precision as the top choice and have an error term to indicate the degree of confidence in the comparison. And, given that people are using software for review and analysis and are asking whether particular plays are mistakes, some software should be designed for that purpose.
Edit: I do not mean to disparage Lizzie or Go Review Partner or any other analysis or review program, but it would be better if they were able to use software developed for evaluation, not play.
