On the accuracy of winrates

Bill Spight · #1

By now it is well understood that the winrates calculated by today's top bots are not actually win rates of anything known. For instance, if at a certain point in a game White is said to have a winrate of 60%, that does not mean that if we play out the game 10,000 times under certain known conditions, White will win approximately 6,000 times. So we cannot test estimated winrates against actual play and determine the accuracy (error rates) of those estimates.

But we do need to be able to measure the accuracy of those estimates. If at some point in the game Black is estimated to have a winrate of 55%, how confident are we that Black is really ahead? Then if White makes a play that increases Black's estimated winrate by 3%, how confident are we that White has made a mistake? If we compare two plays and one of them has a winrate 1.2% better than the other, how confident are we that it is the better play?

The purpose of this thread is to address these questions.

gowan · #2

I'm not well versed in how the bot-calculated winrates are done. Say the AI says that a move has p% winrate for White. Does that White won p% of the games the AI did in playouts? I assume that number depends on the depth of "reading" and the number of playouts. Given the known flaws in evaluation by AI (ladders, for example) how much can we believe the winrate? I guess that's what Bill is asking.

Bill Spight · #3

We may not be able to verify winrates by the winning percentages of actual games. (Although that might be possible with empirical research.) But we can compare winrates with winrates, and use those differences to tell us something about the accuracy of those winrates. The more we explore the game tree, the more accurate we expect the winrate estimates to become. We may measure the exploration in terms of playouts or time (which, OC, is machine dependent, and subject to contamination by multitasking). In any event, we want to find out the variability of our winrate estimates for different positions at different levels of exploration. We can do that by setting the parameters of exploration and generating multiple winrate estimates.

If we then find, for instance, that with 10,000 playouts we get an average deviation from the mean estimate of 2.1%, then we won't place much stock in a winrate difference between two plays of only 1.4%.

OC, comparing these estimates only puts limits on their accuracy. The estimates can be precise, but still wrong. Perhaps greater exploration will reveal mistakes.

Bill Spight · #4

gowan wrote:

I'm not well versed in how the bot-calculated winrates are done. Say the AI says that a move has p% winrate for White. Does that White won p% of the games the AI did in playouts? I assume that number depends on the depth of "reading" and the number of playouts. Given the known flaws in evaluation by AI (ladders, for example) how much can we believe the winrate? I guess that's what Bill is asking.

Hi, gowan.

I don't believe the winrates, as applied to actual games by humans.

But winrates are still evaluations, and we can learn something by comparing them to themselves.

Bill Spight · #5

What about the best move?

The way winrates are currently estimated, IIUC, they are based upon the winrates of the bots' top choices of play. However, the bots do not pick their top choices based solely on estimated winrates. That is somewhat problematical. If the bots do not trust the winrates, why should we? But the bots are not trained to produce the best evaluations, or even pick the best play in any given position, they are trained to win. It's a different goal. It would be nice to have a program which is trained to make evaluations, but we use what we have.

That said, when we run a program multiple times on a given position with a certain player to play, we expect that we will get more than one play that is picked at least once as a top choice. The runs that produce different top choices may also produce different average winrates for each such top choice. We can probably learn from those differences, but how to do so is another question.

dfan · #6

In MCTS as originally designed, the win rate for a move is the fraction of playout games won when making that move. Since the playout "policy" is quite bad, the moves are much more random, and the "win rates" don't reflect very good play; they're generally closer to 50% than "reality" (because the player who's ahead has a non-zero probability of giving away a group or something).

In AlphaGo Zero, the win rate really is supposed to be "the fraction of games that would be won by AlphaGo Zero if it played against itself starting from this position"; that is exactly what the network is trained to produce. This isn't done by playing out individual positions a thousand times and totaling up the number of wins; it's done by generating lots of self-play games (including tree search) and asking the network to generate a 1 if the player won from some position and a 0 if it lost. That sounds extremely non-robust, but since the learning rate is small, things get smoothed out, and a position that would win 2/3 of the time, and would get trained towards 1.0 2/3 of the time and towards 0.0 1/3 of the time, will end up having a value of around 2/3. (In reality, of course, it doesn't see the same position lots of times, but it sees similar positions and generalizes.)

None of this helps much with evaluating accuracy, of course. And we all know about the differences between a 50% that means "it's still early in the game, anything could happen" and a 50% that means "this capturing race will decide the game but I can't read it out".

Bill Spight · #7

dfan wrote:

In AlphaGo Zero, the win rate really is supposed to be "the fraction of games that would be won by AlphaGo Zero if it played against itself starting from this position"; that is exactly what the network is trained to produce. This isn't done by playing out individual positions a thousand times and totaling up the number of wins; it's done by generating lots of self-play games (including tree search) and asking the network to generate a 1 if the player won from some position and a 0 if it lost. That sounds extremely non-robust, but since the learning rate is small, things get smoothed out, and a position that would win 2/3 of the time, and would get trained towards 1.0 2/3 of the time and towards 0.0 1/3 of the time, will end up having a value of around 2/3. (In reality, of course, it doesn't see the same position lots of times, but it sees similar positions and generalizes.)

That's a plausible theory. But that theory still needs to be tested.

Quote:

None of this helps much with evaluating accuracy, of course.

IMHO, good science in general means knowing about the errors.

mitsun · #8

Naively we would expect the variance in the winrate estimate to be p(1-p)/N, where N is the number of trials leading to the winrate estimate p. Unfortunately this does not help much. N cannot simply be taken as the number of playouts, as they are probably highly correlated.

I am no expert, but I imagine a network could be trained to provide additional outputs, besides the choice of next move, including for example the winrate and variance for the selected move. Building the desired outputs directly into the network, subject to the same training which goes into choosing the best move, seems like a relatively small addition to the program. These additional outputs would not improve play, but would make a better teaching tool.

Getting a little off topic, it would be even better for the network to output the numerical value of the game result for the selected move, along with an error estimate. This would provide a more useful teaching tool than winrate. It would be useful even in the limit where the network is infinitely strong, where every move has winrate exactly 1 or 0, with no error.

moha · #9

Bill Spight wrote:

By now it is well understood that the winrates calculated by today's top bots are not actually win rates of anything known. For instance, if at a certain point in a game White is said to have a winrate of 60%, that does not mean that if we play out the game 10,000 times under certain known conditions, White will win approximately 6,000 times.

This would OC only be true in an ideal situation, and is affected by that in most (but not all) positions the evaluations are only NN approximations. It is also affected by how the simplest MCTS works (the resulting winrate is averaging evaluations from all lines from a move, and bot picks "best" move only indirectly). But I'm not sure which is behind your strong statement.

In practice, during training value nets are actually pulled towards such percentages (towards 0% or 100% depending on actual outcome, but NN has lowest loss/error values if it outputs the correct percentage for seen samples). OC, this also depends on generalization. There is no separate set of positions where the % value would be applied to, as "similar" positions are fuzzy. And training a sufficiently large net on a limited set of training games to the extreme will only output 0% or 100% - loss function ->0, overfitting to the samples completely, memorizing all outcomes. But the rough direction is still the correct percentage for seen samples (as dfan pointed out).

However, even if the winrates themselves may be inaccurate, their ORDERING should be reasonable good, since the bot strength directly depends on this. Even 1% differences can be meaningful - afterall, the bot gets ahead by accumulating those 1% advantages, accoring to its own value evaluations (with the help of the opponent's errors of course).

Also pre-zero bots used to use an average of NN evaluations and actual percentages from rollouts (with a non-NN based policy unfortunately, because of time constraints). If NN passes would somehow became very fast in the future, rollouts would probably return and would give almost accurate winrates.

jlt · **#10**

If bots played perfectly, there wouldn't be any notion of winrate (the winrate would be 0% or 100%). The notion of winrate depends on the bot, and is not an absolute measure of quality of play. The only absolute measure would be "score of the game starting from this position, with both players playing perfectly in order to maximize their score". Unfortunately we don't have access to that, but we hope, without being able to prove it, that the stronger the bot is, the better its calculated winrate correlates with absolute quality of play.

On the other hand, except for very strong players, I am not convinced about statements like "AlphaZero says this joseki is bad so it shouldn't be played". For a 1d player, asking LeelaZero 10d if move A is slightly better than move B amounts to a 9k player asking a 1d if move C is slightly better than move D. However, moves that are good for 1dans may be to risky for 9 kyus. So while strong bots may be useful in indicating which moves are blunders, I am not sure that humans should necessarily prefer moves with 52% winrate over moves with 50% winrate.

moha · **#11**

moha wrote:

There is no separate set of positions where the % value would be applied to, as "similar" positions are fuzzy.

Actually it seems better to think about this as the estimated probability of winning, even in training, and the bot trying to minimize it's bad guesses, as this is closer to reality. Taking a lot of time, playing a lot of selfplay games from the position would likely result in values closer to 100% and 0%, rarely 50% (esp. with limited randomness of bot play). However, early positions with only a few stones are different - the winrates there can be almost real percentages from the millions of selfplays, for the exact position instead of "similar" ones.

dfan · **#12**

Bill Spight wrote:

dfan wrote:

In AlphaGo Zero, the win rate really is supposed to be "the fraction of games that would be won by AlphaGo Zero if it played against itself starting from this position"; that is exactly what the network is trained to produce. This isn't done by playing out individual positions a thousand times and totaling up the number of wins; it's done by generating lots of self-play games (including tree search) and asking the network to generate a 1 if the player won from some position and a 0 if it lost. That sounds extremely non-robust, but since the learning rate is small, things get smoothed out, and a position that would win 2/3 of the time, and would get trained towards 1.0 2/3 of the time and towards 0.0 1/3 of the time, will end up having a value of around 2/3. (In reality, of course, it doesn't see the same position lots of times, but it sees similar positions and generalizes.)

That's a plausible theory. But that theory still needs to be tested.

The theory has been tested in other machine learning contexts, but I don't know about AlphaGo in particular (other than the fact that in the end it plays very strongly). If you train on multiple copies of the same input with different targets using a standard loss function, the total loss is minimized when the network outputs the expected value of the target (both in theory and in practice

).

If similar inputs produce very different outputs, then the sensitivity of the network to the input depends on the architecture of the network. Of course we actually want our Go engine to be very sensitive to the input, since changing the position of one stone by one line can make a gigantic difference! In a way we are pretty lucky that the network somehow learns to generalize in a way that doesn't overreact to any individual position, rather than just memorizing the data that's been fed to it, especially since it's been shown in experiments that neural networks are capable of doing a heck of a lot of memorization (e.g., you can give them a digit-recognition task with completely made-up answers that have nothing to do with the actual images and they will still learn to get them almost all "correct"). Why exactly it is that neural networks are able to generalize so well is still an area of very active research.

I am currently working professionally on a technique to allow neural networks to output an amount of confidence in their results as well as the results themselves, but I'm not sure when it will be public. I actually think it could work fairly well in something like Leela Zero. It could both provide interesting data to humans and also improve the program's strength (in theory) by guiding the tree search.

Bill Spight · **#13**

Leela Elf applied to the Metta vs. Ben David game

I think that the first steps to take are as indicated above. But Ales Cieply has graciously made two analyses (rsgf files) of the now infamous Metta - Ben David game by Leela Elf available. ( viewtopic.php?p=234293#p234293 ) One file is set for 100k rollouts for the whole game; the other is set for 200k rollouts, starting with move 30. I'm not sure what the rollout number means, as I have not been able to make the reported rollouts in the files add up to either number. But the 200k rollout file does, I guess, twice as much exploration as the 100k rollout file.

Now, since the rollout settings differ, we cannot do the basic comparison I have suggested above. But, on the assumption that the 200k rollout winrates are more accurate than the 100k rollout winrates, we can take the difference between the winrates for each position as an estimate of the error of the 100k rollout winrate.

Well, it certainly is plausible that the 200k winrates are more accurate than the 100k winrates, but do we have any evidence of that? Indeed, we do.

Suppose that a position has features that lead the 100k rollout winrate to overestimate the probability that Black wins. One move is not likely to change the position so much that the 100k rollout winrate will be correct or an underestimate. Its winrate for the next position is likely to also be an overestimate. Thus, if the 200k winrates are sufficiently more accurate than the 100k winrates, the sign of the difference between the two should usually stay the same between successive plays. (OC, you can flip the argument. The same would hold true if the 200k rollout winrates were underestimates. But still, if persistent features of the board lead to misestimates, the signs of consecutive differences should tend to remain the same.)

I calculated the differences for moves 30 - 166. (The last play was 165, but Elf calculated estimates for move 166.) That yields 135 consecutive differences. If the signs of the differences change randomly, there should be approximately 67.5 sign changes. There were two zero differences. One continued to change from plus to minus, the other went plus-zero-plus. If we count the latter as one half of a double sign change, we get 45 sign changes. If not, we get 44. That's at least 22.5 fewer sign changes than expected if they were random.

Edit: I have made a PDF summarizing these results and posted it at the end of the current topic. viewtopic.php?p=237612#p237612 I made a better comparison by subtracting the median difference from the actual differences; we get 50 sign changes instead of the expected 68. Also, there were 11 differences of 3% or greater out of 137. I think we can take smaller differences as falling within the normal range at a setting of 100K.

Bill Spight · **#14**

dfan wrote:

Bill Spight wrote:

dfan wrote:

In AlphaGo Zero, the win rate really is supposed to be "the fraction of games that would be won by AlphaGo Zero if it played against itself starting from this position"; that is exactly what the network is trained to produce. This isn't done by playing out individual positions a thousand times and totaling up the number of wins; it's done by generating lots of self-play games (including tree search) and asking the network to generate a 1 if the player won from some position and a 0 if it lost. That sounds extremely non-robust, but since the learning rate is small, things get smoothed out, and a position that would win 2/3 of the time, and would get trained towards 1.0 2/3 of the time and towards 0.0 1/3 of the time, will end up having a value of around 2/3. (In reality, of course, it doesn't see the same position lots of times, but it sees similar positions and generalizes.)

That's a plausible theory. But that theory still needs to be tested.

The theory has been tested in other machine learning contexts, but I don't know about AlphaGo in particular (other than the fact that in the end it plays very strongly). If you train on multiple copies of the same input with different targets using a standard loss function, the total loss is minimized when the network outputs the expected value of the target (both in theory and in practice

).

I bow to your superior knowledge, but aren't the bots trained only on complete games? Furthermore, the bots do not choose their plays based solely on winrates. Convergence of winrates may be guaranteed in infinite time, but, while not a side effect, it is not the main effect, or goal of training. Pardon me if I am skeptical of finite results.

Quote:

I am currently working professionally on a technique to allow neural networks to output an amount of confidence in their results as well as the results themselves, but I'm not sure when it will be public. I actually think it could work fairly well in something like Leela Zero. It could both provide interesting data to humans and also improve the program's strength (in theory) by guiding the tree search.

That's terrific! Bonne chance! :clap:

dfan · **#15**

Bill Spight wrote:

I bow to your superior knowledge, but aren't the bots trained only on complete games?

Yes. (Resigning is allowed.)

Quote:

Furthermore, the bots do not choose their plays based solely on winrates.

Right, there's sort of a chicken and egg thing going on. The bot's network is trying to emulate the thinking of a stronger bot (which consists of itself armed with tree search). Its value output is trying to predict the result of that stronger bot's play. So it is always a little behind itself, in some sense, as is usually the case in reinforcement learning.

Quote:

Convergence of winrates may be guaranteed in infinite time, but, while not a side effect, it is not the main effect, or goal of training. Pardon me if I am skeptical of finite results.

I think what you are challenging me to show (and are rightfully skeptical of) is different from what I thought I was showing.

I didn't intend to claim that a win probability of 62.4% actually means that the engine playing itself would win 624 games out of 1000. I just mean that although 1) the win probabilities that are being trained are moving targets, and 2) they are targets that the network is trying to learn to hit, not its actual outputs (both big caveats), it is still true that the units of those targets really are "fraction of times this bot would win against itself". In your opening post of this thread I thought you were trying to argue that the win rates were much more abstract than that (although, rereading your post now, I may have been putting words in your mouth).

moha · **#16**

dfan wrote:

I am currently working professionally on a technique to allow neural networks to output an amount of confidence in their results as well as the results themselves

To an extent a binary output does this in itself. If the net would have no idea it would output 0.5 as this minimizes the loss. For some classification tasks if the outputs are kept separate with individual activations, this behaviour is quite noticable. A board evaluation is somewhat similar (70% winrate -> +0.4 in the asymetry scale -> the confidence in we are winning).

Bill Spight · **#17**

dfan wrote:

Bill Spight wrote:

I bow to your superior knowledge, but aren't the bots trained only on complete games?

Yes. (Resigning is allowed.)

Quote:

Furthermore, the bots do not choose their plays based solely on winrates.

Right, there's sort of a chicken and egg thing going on. The bot's network is trying to emulate the thinking of a stronger bot (which consists of itself armed with tree search). Its value output is trying to predict the result of that stronger bot's play. So it is always a little behind itself, in some sense, as is usually the case in reinforcement learning.

Quote:

Convergence of winrates may be guaranteed in infinite time, but, while not a side effect, it is not the main effect, or goal of training. Pardon me if I am skeptical of finite results.

I think what you are challenging me to show (and are rightfully skeptical of) is different from what I thought I was showing.

I didn't intend to claim that a win probability of 62.4% actually means that the engine playing itself would win 624 games out of 1000. I just mean that although 1) the win probabilities that are being trained are moving targets, and 2) they are targets that the network is trying to learn to hit, not its actual outputs (both big caveats), it is still true that the units of those targets really are "fraction of times this bot would win against itself". In your opening post of this thread I thought you were trying to argue that the win rates were much more abstract than that (although, rereading your post now, I may have been putting words in your mouth).

Thanks for your post. Very helpful.

Just think of me as being 25 years behind the times.

People are responding to my winrate claims, which may be incorrect or poorly expressed. Fine.

I am still willing to take winrate estimates of Leela (but not Leela Zero or certainly not Elf) of an ongoing game at move 100 between two evenly matched amateur dans or pros as the basis for a moderate bet, if I can bet on the projected winner. I think that they are too close to 50%.

What I am hoping is for people to generate multiple winrate estimates for games and to come up with error estimates. I think that information would be very helpful to people using bots to review games and joseki.

Gomoto · **#18**

I manage without any problems to reach around 70% in 6 tournament games in row and loose all the six games with one silly tactical mistake. So at least for me 70% winrate has not much meaning :oops:

. Or it has the meaning I have to stop playing thousands of games and have to start doing thousands of problems if I want to improve any further reliable. Pros tell me you have a good feeling, you will win the next one. They tell me this 6 times in a row :lol:

.

I will enjoy go anyway. If I win or loose, does not matter.

moha · **#19**

Following up on binary outputs: even if we interpret NN value outputs as confidence values, it seems quite easy to get precise masurement of the accuracy/relevancy of them.

Just take the last few million selfplay records from LZ, where each move/position is labeled by search results (and hopefully value net evaluation as well, though I'm not sure - maybe they only record visit counts for each move? would be a pity). Then create a graph with a few dozen data points, like actual game win% of cases where B had 50%, 51%, 52% and so on (maybe subdivide further for game phase, move ranges 0-20, 40-60 ...) The resulting graph should tell you everything about accuracy in LZ<>LZ games (not in LZ<>nonLZ or nonLZ<>nonLZ games though, OC).

Javaness2 · **#20**

In chess I believe the evaluations are done in terms of centipawns. This can be translated into actual pieces on the board. The classic values being Pawn=100, Bishop/Knight=300, Rook=500, Queen=900. The evaluation has a material basis.

In go, the evaluation (winrate) has no material basis, or cannot be translated to one. This differs completely from the human approach to evaluation. As a result, most of us must have a hard time understanding what the hell a computer is spitting out at terminal. Dropping 4% doesn't correspond to a fixed points value on the board. Can the AI of today ever translate their winrates into material values, or can they co-display material value estimates in their output?

I suspect that they cannot, thus I personally struggle to trust the accuracy of their winrates in early parts of the game.
I also feel that AI is also going to lack value in terms of instruction until such an approach can really exist.

On the accuracy of winrates

Who is online