AlesCieply wrote:
I have already tried to analyze two games with AQ but found Leela's winrate estimates much more reliable/stable.
bugsti wrote:
Leela is trained to fit human data, it is not surprising that you found more similarity between Leela and human middle dan games than other software. Leela first purpose is to predict human middle dan moves not to find super human move.
As bugsti indicates stability does not mean reliability. Remember the old joke about a stopped clock being right twice a day? Very stable, very unreliable.

In the games you analyzed you found several human moves that were better (according to Leela) than Leela's first choices. One was even estimated as more than 10% better.

OC, Leela does provide an estimate of the "winrate" of a play, but an imperfect estimate. Since we have tools such as AQ, Leela Zero and Leela Zero Elf that provide better estimates, there is no reason not to use one of them.

Even at chess, humans sometimes find better moves than the superhuman chess engines. How do we know? The evaluation after the human move is better than the evaluation of the engine's top choice.

At this point I think that it is worth talking about precision and accuracy. As the stopped clock illustrates the two are not the same. But our estimates cannot be more accurate than their precision. As you know, we can increase both the accuracy and precision of our evaluations by increasing the number of rollouts. Accuracy is hard to determine, particularly as we do not know exactly what "winrate" means. But we do have a pretty good idea of the precision of our evaluations.
Bojanic's RSGF file for the Metta-Ben David game provides a good example. For move 30 Leela 11 looks at only two options. It's top choice yields a winrate for White of 49.80% with 9818 playouts, its second choice yields a winrate of 42.96% with 1318 playouts. Considering their precision, those estimates would better be presented as 50% ± 1% and 43% ± 3%. That degree of precision is enough for us to be confident that Leela prefers the first play to the second, given the 7% difference. However, it is not enough for us to be confident that the first play is 7% better than the second (according to Leela). For getting precise differences in winrate estimates we require more playouts than we need just to compare two plays. In general we want the error of our estimates to be 1/10 the precision that we report. To reach that precision for these estimates would probably require around 1,000,000 playouts each. (Precision probably increases as the square root of the playouts.) For our purposes we probably do not need that degree of precision. Maybe around 100K playouts for each choice that we compare is good enough.
IIRC, for the top choice to be stable you used 200,000 playouts. That number addresses accuracy as well as precision, since more playouts mean that a larger search tree is being built, IIUC.
For move 31 Bojanic's Leela looked at several plays. It top choice had a winrate of 51% ± 1% with 8221 playouts. Metta played its 7th choice, which had a winrate of 48% ± 3% with 1054 playouts. Since the difference between winrates is only 3%, that isn't really enough to say that the top choice is better than the 7th choice, is it?
Now, when we look at the winrate after

was played, we get 51.22%. That is clearly wrong. All that happened is that the winrate for the top choice was copied into the winrate for the actual play. Bug in the program. (Not Leela itself, surely.)
But when we look at the estimate for the top choice for

we get a winrate of 48.5% ± 0.5% with 35153 playouts. That's consistent with the earlier estimate, but more precise. The total number of playouts was 50,908. I do not know how Leela estimates winrates for a position and player to play, if it does at all, but if we take the estimate for the top choice, then we would want around 3 times as many playouts to reach our desired precision. You choice of a total of 200K playouts seems reasonable.
Ben David played Leela's 4th choice, with a winrate (for Metta, not Ben David) of 55% ± 9% with 126 playouts. It seems to have been a blunder, with a loss of 6.5% versus best play, but with an error range of 9%, quien sabe? As I said, Leela (and other bots) are not optimized for evaluation. The winrate for Leela's top choice for

is 57% ± 1% with 7004 rollouts. So Ben David's play does seem to have been a blunder.
Anyway, one lesson is that we can't just take the difference between the estimates of Leela's top choice and the human's play, if we want to get a desirable precision. We have to make the human's play and evaluate the resulting position. Edit: In fact, for a fair comparison we should probably also make Leela's top choice and recalculate.
