EdLee wrote:
Quote:
under perfect play the position must either be entirely won or entirely lost, so the true winrate will either be 100% or it will be 0%.
Probably matters little, if at all, to this point: but how do we know perfect play doesn't always lead to no-result (e.g. triple ko, etc.) ?
With area rules with superko and half-integer komi (which is the only kind of rules that most current bots use) the game must always terminate with a win or loss. And yes, this is a bit of a distraction from the actual issue.
Bill Spight wrote:
Quote:
That's pretty easy to quantify, but that it doesn't seem like an ideal metric.
True enough. But when you see a winrate estimate with 700 playouts and after the next play, which is the bot's first choice, the new winrate estimate differs by 2% with 12,000 playouts, you have to suspect that the margin of error with 700 playouts is at least 2%.
Note that this is still tricky. Consider the case where two moves differ by less than 2%, and therefore you don't trust that difference, but actually the estimates of the two are highly correlated, due to using leading to almost the same variations, differing only in one forcing move that changes the territory slightly but doesn't tactically matter? In that case, while the "error" (whatever that means) in each of the two moves is at least 2%, the "error" (whatever that means) in their
difference could be far less than 2%, since whatever part of it is correlated will cancel out in the difference.
Bill Spight wrote:
lightvector wrote:
Bill, regarding winrates specifically, when you say you want a margin of error, presumably you are talking about the error in the bot's estimate relative to something. What precisely is that something?
"That's not my department, says Wernher Von Braun." — Tom Lehrer
Color me oldfashioned, but when I come up with an approximate measure, I am interested in its error function.
The winrate is a prediction of the binary outcome of win/loss as seen statistically in the self-play game data. The problem is that to talk about the error for a
binary outcome prediction (win/loss) as a separate and independent quantity from the prediction itself and as something intrinsic to the prediction itself is dangerously close to being mathematically incoherent. So you have to tread carefully, because unlike some other areas, where human intuition usually points at something genuinely meaningful even if it may be fantastically hard to make precise and quantify, sometimes in this specific area it may be human intuition that is the problem.
The straightforward and perhaps-unhelpful answer to your question is that so long as the probability prediction is
well-calibrated* with respect to a player population, then whenever a bot predicts 80%, the "error function" is that 80% of the time it will be predicting too low by 20%, because the game actually was won, and 20% of the time it will be predicting too high by 80%, because the game actually was lost. And then the straightforward answer would say that's it, that's all there is to know regarding the error of that prediction. The percentage itself IS the expression of uncertainty about the game outcome!
(* "well-calibrated" means that among all times of the time the bot says, e.g. 80% in positions randomly drawn from games by those players, indeed about 80% of time the game is then won and 20% of the time the game is then lost. Bot winrates are obviously not well-calibrated with respect to human player populations, but if you have enough games from the desired player population, it is very possible to make it well-calibrated. You just plot the bot winrates among all the positions within those games against the empirical outcome of the games, fit a curve, and then have the bot report what the curve says instead of what it would have said originally).
--------------------------------
To give another analogy - imagine a well-calibrated weather station predicts in city A a chance of rain today of 70%, and suppose the city is too small or the potential rainclouds too big for there to be any appreciable chance of rain only hitting part of the city without hitting the whole. What is the "error function" on this prediction?
Now, in reality, it actually either does rain in A or not. So the 70% isn't a fact about the world, it's a fact about the weather station's own uncertainty about the world. The weather station is not making a prediction of a platonic probability "70%" out there in the world where that prediction itself has some additional uncertainty, rather it's making prediction of rain or no rain and "70%" is the expression of uncertainty about "rain or no rain". So error in the prediction will be 70% 30% of the time, and 30% 70% of the time.
In what cases would 80% have been a "better" prediction? If it did in fact rain in A, then it would be better. If it didn't actually rain in A, then it would be worse. A better model might indeed instead make a prediction of 80% because it recognizes features that more strongly suggest rain that original model didn't see. Or it might make a prediction of 10% or 0% because it sees features that make rain extremely unlikely that original model didn't see, this being likely among the 30% of times that the original model would be wrong in cases like that. In either case, again the percentage itself already is the expression of uncertainty and error of that particular model (so long as the model is well-calibrated).
-----------------------------------------
So, when a bot rates a move at 60%, (note: major caveats regarding differences between match play and self play, let's suppose the bot has been well-calibrated to the population of its own *match* play games rather than self-play games) the bot is saying "I'm 60% certain that I will win if I play here, but 40% uncertain about winning if I do so". There's no further "error" to talk about regarding that 60% number. The 60% itself is ALREADY saying that the bot expects to be "wrong" (i.e. in
error) about winning 40% of the time.
So that's sort of what I'm getting at. When you are trying to predict a binary,
inherent to the prediction itself there is no further error to speak of since the percentage itself is already the expression of what the probable error is. This is not always intuitive for humans, and is easy to get tripped up by even for experienced statisticians.
Now, while there is no further
inherent notion of error, you DO get other notions of "error" when you start talking about comparing the prediction to OTHER statistical averages (like proportions of games won/lost by humans, etc), or about the way that successive predictions may change over time. Then there IS plenty more to speak of. And of course, much of the above does not apply to score, which is not binary. But for winrate what notion of error you get is *entirely* a function of what other thing you choose to try to compare against. And what other thing to try to compare against to give what humans they want for review is not completely a scientific question, but in part also a psychology question, and a user-education question, and a question of "what do you actually want, it's your free choice what statistics you would personally find most useful", which is why it's difficult to approach.
I hope this helps clear up some of the mathematical trickyness regarding what winrates "mean". Or maybe it makes people more confused.