Frankly, I'm entirely unconvinced that the AI "winrates" and probability have anything to do with one other whatsoever. We have a lot of fuzzy talk about it in this thread, and since I'm confused I'd like to descend to the nitty-gritty. If it's not too dry, I'd like to try to define rigorously and without probability, what Bill is talking about, so that he or anyone else can show me where precisely we differ in opinion (if we do). Feel free to skip this post if you hate math
I think it's fair to model every AI as an effectively reproducible function f: X -> Y, where X is the set of all legal board states [note that this is finite], and Y is vaguely similar to [0,1] (let's call 0 opponent win and 1 AI win for discussion). If f(x) > 0.5, then the AI believes that with perfect play it wins, if f(x) < 0.5 then the AI believes its opponent wins. Beyond that is some sort of metric of the AI's confidence, which seems to generally give more extreme values when we humans have confidence as well. We'd like to be able to take two different confidence evaluations and call the magnitude of the difference significant, so that choosing to play one variation is certainly a mistake over choosing the other.
Define g(x): X -> X, g(x) = x U k, where k is the move that has the best f(x U k) for any legal move k. This function just plays the game that the AI says is best.
Then define h(x): X -> X as perfect play - I'm going to assume perfect understanding of the AI, which could be possible if you had instant access to f(x) for all x. h(x) = x U j, where j is a legal move [not necessarily noticed by the AI] such that the worst boardstate x' that AI would play towards with its own assessment of worst f(x') is reachable via playing the move j. Since I trust the AI to always count points correctly after endgame, and any working trap must be sprung eventually, there is no other mysterious better sequence to play to further beat the AI. Therefore, for any position x, g(h(g(h(g(h.....(x))) would produce the game that leads the AI to, at some point, encounter the position it views most dimly, but would play into repeatedly from the position x. Obviously, if the AI is losing, then this state will be 0, but if the AI is winning, it may only ever drop to 0.6, 0.8, etc. before eventually climbing up to 1.
Consider the function w(x): X -> X to output the exact boardstate where f(w(x)) is the minimal assessment reachable by the compound g(h(g(h(...(x))). If we're only interested in the next "stage" of the game, then we could limit ourselves to some finite number of g(h(... and redefine. We could even limit ourselves to exactly the depth of the AI's search tree. I'm not sure exactly what Bill wants here.
Our goal is to find some number M>0 that we'll call a "margin of error" independent of boardstate, such that for any two positions x, y in X, f(x)>0.5, f(y)>0.5, [f(x)-f(y) > M] => [f(w(x)) >/= f(w(y))].
The intent of this statement is that the margin of error will tell apart the 'comfortability' of the winning moves. The restriction of viewing 'winning' positions is, I think, sensible, as knowing how to be less likely to lose is opponent-based, as opposed to knowing how to be more likely to win. A difference above the margin of error will tell us that against a 'perfect' opponent, the AI can't handle either position, the better position is either winning and the other position is not, or that the position with the higher value can go less wrong when well handled.
It's likely that this value M isn't constant, and actually depends on the value of f(x) and f(y). Bill appears to suggest that in the range [0.5, 0.75] a reasonable value for M should be around 0.04 for at least one particular AI that he studied in depth.
I'm not quite so sure about that. The very position this thread started with showed a position x for which f(x) and f(h(x)) differed by more than 0.04 already. It seems likely that a game of the AI against our elusive perfect player h(x) would result in many similar traps being laid for the AI, most outside of its own evaluation. Perhaps what we'd really like to do is consider a strategic game, where the AI isn't ever surprised by h(x), but perhaps a little outplayed.
To do this, we can consider an augmented f'(x) such that for any position where the AI misses a tesuji that h(x) finds, [that is: f(w(g(x)) < f(w(h(x)))] the AI is forced to analyze the tesuji until it reevaluates (so that f(h(x)) > f(g(x)), redefining g(x) for that x). Then, the normal search process is extrapolated for all possible preceding boardstates, so that the AI no longer misses any tesuji, but retains its signature fluid evaluation.
If we similarly define a margin of error off of this augmented AI, M'>0 s.t. for all x,y in X, f'(x)>0.5, f'(y)>0.5, [f'(x)-f'(y)>M'] => [f'(w(x)) >/= f'(w(y'))], then M' = 0.04 seems pretty reasonable to me, actually. The AI is bound to make some strategic errors, but not an overwhelming amount. This seems excessively difficult to actually evaluate though - perhaps a method such as my process finding the move the AI missed could assist you for any position x, but that seems really arduous.
Talking about move-by-move 'margin of error' [more precisely, max(f'(g(x)) - f'(x)) over x in X] seems quite difficult to pin down, as it's almost entirely dependent on the AI missing something. Perhaps you can do it for positions in some subset of x, lacking in critical sequences that the AI ignores? I got the impression that this is what you're interested in, but it seems very difficult to qualify.
Note that I still haven't talked about probability once, and at this point I don't think I should see a need for it. If you can show that my model is incomplete without probability, then please, show me. But, as far as I understand this talk of Bayesian logic and whatnot is confusing the concepts of AI 'winrate', which is mostly just an arbitrary number who's programmers at Google once gave bounds of 0 and 1, and actual player v. player winrates, which to me are entirely separate.