This 'n' that

lightvector · **#641**

Bill, regarding winrates specifically, when you say you want a margin of error, presumably you are talking about the error in the bot's estimate relative to something. What precisely is that something?

Obviously it's not "theoretical perfect play", because under perfect play the position must either be entirely won or entirely lost, so the true winrate will either be 100% or it will be 0%. In that case, the error in a winrate estimate like 60% would of course be precisely either 60% or 40%, and it would be generally be impossible to determine which.
Is is "the probability that the bot would win from here against itself, using the actual self-play settings and parameters used in training?". Well that could be either 100% or 0% too! Because it is not atypical for bots to only randomize self-play early in the game, for the rest of the game they might actually just deterministicly always choose the move that got the best search results. Or maybe they may randomize just a little too, in which case it might not be exactly 100% or 0%, but could still vary wildly depending sensitively on the details. And these details don't actually matter much! The neural net during training sees pretty much the same thing either way: it sees a game with mostly good moves ending in a win or loss. You're not going to go back and replay exactly that same game again, so it doesn't matter if the later moves were deterministic or not. And it would be weird if what we wanted was an error estimate relative to something that might vary so sensitively with respect to details of training that actually don't matter much.
Is it "the average probability that randomly chosen professional human players would win from here against other randomly chosen pro opponents"? Well in that case the error is going to be often vastly greater than small numbers like 4%, as human pro players routinely lose highly-winning games or win highly-losing games or make other huge swings from strong bots' perspectives. And of course you need to consider possible issues like move A is definitely better than B for bots and the bots are "right" to evaluate it so, maybe it's even better in some "objective" sense, but move B actually leads to better practical chances for a human because relative to human strengths/weaknesses, move A makes it both harder for you and easier for your opponent to handle the resulting fight.
Is it "the winrate that the bot itself will report in the future after more moves are played", with the hopes that with more moves the bot can better judge whether it was 'right' or 'wrong'?". In that case, you need to specify some sort of time horizon. With a way-too-long horizon, of course we're back to 100% or 0%, because that's what it will be at the end of the game. With a very short horizon though, you're measuring short-term fluctuation noise. So you want some intermediate horizon, but what horizon is tricky, as it may take highly variable numbers of moves for the bot to realize, depending on the potential judgment/misjudgment involves a short-term fight or a long-term shape that will only come into play much later in the game. Either way, you still actually need to say what the time horizon you care about is (possibly different for different situations?). And of course, it's not guaranteed that the numbers you get will apply to humans, who have different strengths and biases.
Or maybe you actually do mean the move-to-move fluctuation noise, i.e. you want something like the error with respect to "the winrate that the bot itself will report on the very next move"? That's pretty easy to quantify, but that it doesn't seem like an ideal metric. If the bot rates move A 5% higher than move B, and you play both A and B on the board, the winrate will then fluctuate a bit for each, but the magnitude of that fluctuation isn't necessarily tied to whether A is really a "better move" than B. Similar to earlier, it depends on things like whether it's a short term tactic the bot can realize imminently, or it's a longer-term judgment difference that won't get resolved soon. And of course, here too it's not guaranteed that the numbers you get will apply to humans.

Or you do mean something else entirely? Apologies if you've explained this already somewhere and I missed it.

Basically, it's kind of hard to think about how one would add an error estimate (or how one would research how to add it) when not sure what precisely that error is supposed to measure in the first place.

EdLee · **#642**

Quote:

under perfect play the position must either be entirely won or entirely lost, so the true winrate will either be 100% or it will be 0%.

Probably matters little, if at all, to this point: but how do we know perfect play doesn't always lead to no-result (e.g. triple ko, etc.) ?

Bill Spight · **#643**

lightvector wrote:

Bill, regarding winrates specifically, when you say you want a margin of error, presumably you are talking about the error in the bot's estimate relative to something. What precisely is that something?

"That's not my department, says Wernher Von Braun." — Tom Lehrer

Color me oldfashioned, but when I come up with an approximate measure, I am interested in its error function. Now, the inventors of the winrate estimate have good reasons for not providing an error function. For one reason, it's not the only thing they use to choose plays. For another, the number of playouts or visits indicates the degree of confidence in the winrate estimate. For another, for choosing the best play, the order is more important than the absolute value.

{This paragraph may be skipped.} My first foray into game related evaluation was coming up with a point count for Quick Tricks in contract bridge. I took a Chebyshev approach and minimized the maximum error, given knowledge of the total of the point counts of the two partner's hands and certain assumptions about the play. Starting with the errors is what enabled me to come up with the evaluation function.

(I had not assumed a point count, it just worked out that way.

)

However, human reviewers are obviously interested in winrate estimates and, from my point of view, are hampered by the lack of error estimates. If LZ, Elf, or KataGo says that my play has a winrate estimate 2% lower than its first choice, does that mean that my play was a mistake? It is apparent from reading reviews that some people even think that a difference of ½% is significant (in the playing sense, not the statistical sense), something that strikes me as absurd. There are other questions that I have, as an analyst, but this is a basic question that human reviewers have, but they have no guidance in the matter.

Now, it would be possible to use the data from bots to come up with margins of error, however defined, but 1) it would take a good bit of time and effort, 2) you would have to make assumptions that people could challenge, 3) the landscape keeps changing as bots improve and new methods may be devised. Look at the exciting progress of chess engines, several years after they got better than humans. Those who devise bots have not provided margins of error, and I doubt that they will, any time soon. Perhaps some academic will do the research.

Quote:

Is is "the probability that the bot would win from here against itself, using the actual self-play settings and parameters used in training?".

If I understand dfan correctly, that's pretty much the idea, But that's not how the estimates are derived.

Quote:

Or maybe you actually do mean the move-to-move fluctuation noise, i.e. you want something like the error with respect to "the winrate that the bot itself will report on the very next move"?

That's pretty much the reinforcement learning approach, isn't it? A winrate estimate estimates the winrate estimate after the next move is played. But, IIUC, that is not tested directly, either. Rather the test is how well the bot plays the whole game, not how well it evaluates each position or play. It is a player, not an analyst.

Quote:

That's pretty easy to quantify, but that it doesn't seem like an ideal metric.

True enough. But when you see a winrate estimate with 700 playouts and after the next play, which is the bot's first choice, the new winrate estimate differs by 2% with 12,000 playouts, you have to suspect that the margin of error with 700 playouts is at least 2%.

A few years ago, with a little cleverness I compared winrate estimates for Leela 11 with 100k playouts per position (not each option) versus 200k playouts where I could argue that the difference between the two was not random, but the result of evaluation errors with 100k playouts, and came up with a minimum margin of error of around 3%. Nowadays, OC, who cares about Leela 11's margin of error?

Quote:

And of course, here too it's not guaranteed that the numbers you get will apply to humans.

True enough. But if a bot's winrate margin of error is 3% with superhuman play, surely it is larger when applied to human play.

Quote:

Basically, it's kind of hard to think about how one would add an error estimate (or how one would research how to add it) when not sure what precisely that error is supposed to measure in the first place.

Sure. The researcher has to specify what he means. Hard to do when the developers talk as though there were such a thing as the probability of winning the game. You have to make assumptions, and the developers may not even know what the assumptions are. Or maybe they don't want to say.

ez4u · **#644**

Could someone explain the relationship between the lower confidence bounds (LCB) and upper confidence bounds (UCB) and the winrate? I have naively thought that the change in the use of LCB in LZ 0.17 was in a sense a conservative adjustment for the degree of uncertainty in the winrate. Is this completely off base?

Tryss · **#645**

Bill Spight wrote:

Sure. The researcher has to specify what he means. Hard to do when the developers talk as though there were such a thing as the probability of winning the game. You have to make assumptions, and the developers may not even know what the assumptions are. Or maybe they don't want to say.

For bots like LZ, the winrate given by the network is an interpolation (for this position) based on the results of positions encountered in self play by previous networks.

Basically, you feed the algorithm positions and results, and it fit a function (the network) to these datapoints. Then you apply this function to all the positions you encounter

Playouts just apply this function to positions further in the tree, and the "final winrate" is the winrate of the last position in the "best line" (if I'm not mistaken)

lightvector · **#646**

EdLee wrote:

Quote:

under perfect play the position must either be entirely won or entirely lost, so the true winrate will either be 100% or it will be 0%.

Probably matters little, if at all, to this point: but how do we know perfect play doesn't always lead to no-result (e.g. triple ko, etc.) ?

With area rules with superko and half-integer komi (which is the only kind of rules that most current bots use) the game must always terminate with a win or loss. And yes, this is a bit of a distraction from the actual issue.

Bill Spight wrote:

Quote:

That's pretty easy to quantify, but that it doesn't seem like an ideal metric.

True enough. But when you see a winrate estimate with 700 playouts and after the next play, which is the bot's first choice, the new winrate estimate differs by 2% with 12,000 playouts, you have to suspect that the margin of error with 700 playouts is at least 2%.

Note that this is still tricky. Consider the case where two moves differ by less than 2%, and therefore you don't trust that difference, but actually the estimates of the two are highly correlated, due to using leading to almost the same variations, differing only in one forcing move that changes the territory slightly but doesn't tactically matter? In that case, while the "error" (whatever that means) in each of the two moves is at least 2%, the "error" (whatever that means) in their difference could be far less than 2%, since whatever part of it is correlated will cancel out in the difference.

Bill Spight wrote:

lightvector wrote:

Bill, regarding winrates specifically, when you say you want a margin of error, presumably you are talking about the error in the bot's estimate relative to something. What precisely is that something?

"That's not my department, says Wernher Von Braun." — Tom Lehrer

Color me oldfashioned, but when I come up with an approximate measure, I am interested in its error function.

The winrate is a prediction of the binary outcome of win/loss as seen statistically in the self-play game data. The problem is that to talk about the error for a binary outcome prediction (win/loss) as a separate and independent quantity from the prediction itself and as something intrinsic to the prediction itself is dangerously close to being mathematically incoherent. So you have to tread carefully, because unlike some other areas, where human intuition usually points at something genuinely meaningful even if it may be fantastically hard to make precise and quantify, sometimes in this specific area it may be human intuition that is the problem.

The straightforward and perhaps-unhelpful answer to your question is that so long as the probability prediction is well-calibrated* with respect to a player population, then whenever a bot predicts 80%, the "error function" is that 80% of the time it will be predicting too low by 20%, because the game actually was won, and 20% of the time it will be predicting too high by 80%, because the game actually was lost. And then the straightforward answer would say that's it, that's all there is to know regarding the error of that prediction. The percentage itself IS the expression of uncertainty about the game outcome!

(* "well-calibrated" means that among all times of the time the bot says, e.g. 80% in positions randomly drawn from games by those players, indeed about 80% of time the game is then won and 20% of the time the game is then lost. Bot winrates are obviously not well-calibrated with respect to human player populations, but if you have enough games from the desired player population, it is very possible to make it well-calibrated. You just plot the bot winrates among all the positions within those games against the empirical outcome of the games, fit a curve, and then have the bot report what the curve says instead of what it would have said originally).

--------------------------------

To give another analogy - imagine a well-calibrated weather station predicts in city A a chance of rain today of 70%, and suppose the city is too small or the potential rainclouds too big for there to be any appreciable chance of rain only hitting part of the city without hitting the whole. What is the "error function" on this prediction?

Now, in reality, it actually either does rain in A or not. So the 70% isn't a fact about the world, it's a fact about the weather station's own uncertainty about the world. The weather station is not making a prediction of a platonic probability "70%" out there in the world where that prediction itself has some additional uncertainty, rather it's making prediction of rain or no rain and "70%" is the expression of uncertainty about "rain or no rain". So error in the prediction will be 70% 30% of the time, and 30% 70% of the time.

In what cases would 80% have been a "better" prediction? If it did in fact rain in A, then it would be better. If it didn't actually rain in A, then it would be worse. A better model might indeed instead make a prediction of 80% because it recognizes features that more strongly suggest rain that original model didn't see. Or it might make a prediction of 10% or 0% because it sees features that make rain extremely unlikely that original model didn't see, this being likely among the 30% of times that the original model would be wrong in cases like that. In either case, again the percentage itself already is the expression of uncertainty and error of that particular model (so long as the model is well-calibrated).

-----------------------------------------

So, when a bot rates a move at 60%, (note: major caveats regarding differences between match play and self play, let's suppose the bot has been well-calibrated to the population of its own *match* play games rather than self-play games) the bot is saying "I'm 60% certain that I will win if I play here, but 40% uncertain about winning if I do so". There's no further "error" to talk about regarding that 60% number. The 60% itself is ALREADY saying that the bot expects to be "wrong" (i.e. in error) about winning 40% of the time.

So that's sort of what I'm getting at. When you are trying to predict a binary, inherent to the prediction itself there is no further error to speak of since the percentage itself is already the expression of what the probable error is. This is not always intuitive for humans, and is easy to get tripped up by even for experienced statisticians.

Now, while there is no further inherent notion of error, you DO get other notions of "error" when you start talking about comparing the prediction to OTHER statistical averages (like proportions of games won/lost by humans, etc), or about the way that successive predictions may change over time. Then there IS plenty more to speak of. And of course, much of the above does not apply to score, which is not binary. But for winrate what notion of error you get is *entirely* a function of what other thing you choose to try to compare against. And what other thing to try to compare against to give what humans they want for review is not completely a scientific question, but in part also a psychology question, and a user-education question, and a question of "what do you actually want, it's your free choice what statistics you would personally find most useful", which is why it's difficult to approach.

I hope this helps clear up some of the mathematical trickyness regarding what winrates "mean". Or maybe it makes people more confused.

Bill Spight · **#647**

Just a brief note for now. Busy, busy today.

As I said, my purpose is not to question winrate estimates. They are not hyped anymore.

But I believe that there is a problems with how human players use them. Humans use them to evaluate plays and positions, without any clear understanding of what they mean. IIUC, bots use them along with other factors such as playouts to choose plays. Winrate estimates only have to be good enough as evaluations to be useful for that purpose. They are tested by how well bots play games, not by how well they evaluate specific positions and plays. Bots typically also supply humans with winrate estimates and playouts for plays not made. But they do so without also providing humans with guidance about how to interpret and utilize that information. IMHO, that's a problem.

Not that bot developers have to solve it, but they could address it. I have addressed that problem from time to time. For instance, I now propose a margin of error for Elf 1 of 4% for estimates in the vicinty of 50%, with at least 4k playouts. That's a guess, but it's an educated guess. I have neither the time, nor the energy, nor the inclination to do a lot of research on that question right now. Who does? (Besides, it's a thankless task.

)

Bill Spight · **#648**

Continuing in the same vein, here are opening moves in a game between Dosaku (W) and Doetsu 350 years ago (GoGoD 1669-07-16), with comments by Elf and my reflections

Click Here To Show Diagram Code: [go]$$Bc Dosaku (W) vs. Doetsu $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . 6 . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . 5 . . | $$ | . . 1 , . . . . . , . . . . . , . . . | $$ | . . . . 2 . . 3 . . . . . . . 4 . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

loses 6½% to par, according to Elf, with 26k playouts. (Note: :w2:

was hardly on Elf's radar, getting only 1 playout originally. That is not enough to establish any winrate estimate. Elf established it with its preferred reply, which had 26k playouts. I have followed this procedure for the number of playouts for many of the moves below.)
:b3:

loses 5%, with 4k playouts
:b5:

loses only 2% to par (16k playouts). I have already hypothesized why.
:w6:

loses 7.5% (16k playouts). I don't know why this loses more than :b3:

. The two corners are mirrored, with a 90°rotation. With a 180° rotation I would expect White to have gained a slight advantage because of a small temperature drop, but Elf thinks that Black has gained around 7%.

Click Here To Show Diagram Code: [go]$$Bcm7 Variation 1 $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . O . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . X . . | $$ | . . X , . . . . . , . . 5 3 1 , . . . | $$ | . . . . O . . B . . . 6 . 4 2 O . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

The answer surely lies in the assessment of this position, which Elf gives as the main variation starting with :b7:

. We know that bots like :b7:

, but Black would have a corresponding play in the top right corner if play had continued there instead of in the bottom right. Elf has a small preference for :w10:

over the one space jump. Perhaps the point is that the :bc:

stone hinders White on the bottom side, I don't know. :scratch:

Click Here To Show Diagram Code: [go]$$Bcm7 Game record 2 $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . 4 . . 5 . . 3 . . . 2 . . . . | $$ | . . . , . . . . . , . . . . . , 1 . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . 6 . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . O . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . X . . | $$ | . . X , . . . . . , . . . . a , . . . | $$ | . . . . O . . X . . 7 . . . . O . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

loses 2½% by comparison with a, with 14k playouts
:w8:

loses 7½% by not occupying the last empty corner (29k playouts)
:b9:

loses 12% to par, not just because it fails to occupy the last empty corner, but also because it is a slack 3 space pincer (21k playouts)
:w10:

occupies the last empty corner, but still loses 7% to par because it is on E-17 (17k playouts) (Elf does not show the other 3-5 point, so I don't know how it would evaluate it.)
:b11:

is curious to our eyes. It is plainly too slow. But apparently it was in vogue at the time. Players plainly valued making a base on the side. Bots like it less than we do. According to Elf it loses 12% to par (30k playouts). Elf prefers to invade the top left corner on the 3-3.
:w12:

returns the favor, making a base on the right side instead of enclosing the top left corner. It loses 15% to par. Perhaps the extra 3% can be explained by the fact that :b11:

makes the corner invasion a pincer, as well.
:b13:

loses 17% to par (30k playouts). What explains the extra 2% loss? I think it's a temperature drop. (More below.)

Click Here To Show Diagram Code: [go]$$Wcm14 Game record 3 $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . O . . X . . X . . . O . . . . | $$ | . . 4 , . . . . . , . . . . . , X . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . 3 . . . . . . . . . . . . . O . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . 1 . . . . . . . . . . . . . O . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . 2 . . . . . . . . . . . . X . . | $$ | . . X , . . . . . , . . . . . , . . . | $$ | . . . . W . . X . . X . . . . O . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

plays a counter pincer, avoiding the top left corner once more. It loses 20½% to par (29k playouts), a serious loss. (If Ohashi had chosen this or a similar game to highlight Dosaku's play, Dosaku would not have looked so good, eh? { https://lifein19x19.com/viewtopic.php?f=13&t=16844 } But some selection bias is expected and excusable, don't you think?

)

breaks a sector line and threatens the :wc:

stone. It loses only 15% to par (30k playouts).
:w16:

makes a base on the left side, no surprise by now.

It loses 20½% to par (29k playouts). The extra percentage lost to par by comparison with :b13:

is also, I think, because of a temperature drop.
:b17:

is Elf's first choice.

It does not show any variation, but I think it picks the 3-4 instead of the 3-3 now because if it played the 3-3, the outside attachment on this 3-4 would work well with the White base on the left side.

The alternative to each of these base making plays is to invade or enclose the top left corner, clearly par play, according to Elf. And there is not a lot of difference locally, so why the additional losses to par? As I said, I think it has to do with temperature drops. Look at the whole board after :b17:

. There is a base on every side. True, each corner has a weak stone in it, but single weak stones can be handled. ( :wc:

is the weakest White stone, and Dosaku bolstered it on the next play.) One thing the bases on the side do is to reduce the temperature around them. You can't pincer them, and they restrict the opponent's development. Each time a base is made on the side the temperature drops there. :w16:

produces a significant temperature drop after :b17:

.

is the last play of the opening in spades. An unusual situation.

Click Here To Show Diagram Code: [go]$$Wcm16 Variation 2 $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . O . . X . . X . . . O . . . . | $$ | . . 1 , . . . . . , . . . . . , X . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . 3 . . . . . . . . . . . . O . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . O . . . . . . . . . . . . . O . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . X . . . . . . . . . . . . X . . | $$ | . . X , . 2 . . . , . . . . . , . . . | $$ | . . . . W . . X . . X . . . . O . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

Elf recommends enclosing the corner, OC. Then :b17:

covers the :wc:

stone and

gets the last big play of the opening.

Bill Spight · **#649**

More in the same vein

Click Here To Show Diagram Code: [go]$$Bc Honinbo Retsugen (W) - Yasui Senchi Senkaku, Swastika 5-3 opening $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . 3 . . . . . . . . . . 2 . . . . | $$ | . . . , . . . . . , . . . . . , 1 . . | $$ | . . 4 . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . 8 . . | $$ | . . 5 , . . . . . , . . . . . , . . . | $$ | . . . . 6 . . . . . . . . . . 7 . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

(GoGod 1799-12-13a, Castle Game)

The Swastika 5-3 goes back, I believe, to Honinbo Dosaku Meijin. It suggests that White thought that the 5-3 was at least as good as the 3-4, and not just a slightly inferior play that prevented Black from having an easy game. I was somewhat surprised to find it played as late as 1799.

As Senchi was known for his central influence, perhaps Retsugen played four 5-3s to throw him off his game.

OC, Elf regards this board as advantageous for Black, who has gained 10½%, assuming 7½ pt. komi. That means that the 3-4 is, on average, about 2½% better than the 5-3. How many pts. ahead does Golaxy or KataGo estimate Black is? OC, simply playing the averages is not very reliable, as winrate estimates indicate.

Both

and

are par moves, for instance.

To get a feel for this position, let's look at Elf's main continuations from here. Elf regards only two plays as worth considering, the kosumi and the kick. Senchi played the kosumi, Elf prefers the kick by 1%.

Click Here To Show Diagram Code: [go]$$Wcm10 The kosumi $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . X 2 . 4 . . . . . . . O . . . . | $$ | . . . , 1 3 . . . , . . . . . , X . . | $$ | . . O . . . . . . . . . . . . B . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . a . . | $$ | . . 8 , . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . 5 . . . . . . . . . . . . . . . . | $$ | . . . . 9 . . . . . . . . . . . . . . | $$ | . . . 6 . . . . . . . . . . . . O . . | $$ | . . X , . . 7 . . , . . . . . , . . . | $$ | . . . 0 O . . . . . . . . . . X . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

After the kosumi, :bc:

, Elf continues with :w10:

, pressing against the top left corner, and then plays :w14:

, a favorite pincer of AlphaGo. :b15:

breaks the sector line, and White plays :w16:

, the keima extension towards the bottom side. Then :b17:

is a counter-pincer. :w18:

encircles the bottom left corner, and :b19:

secures it. (BTW, Retsugen played :w10:

at a, a move that would not be considered an error for more than 200 years, i.e., until the AI era.)

Click Here To Show Diagram Code: [go]$$Wcm10 The kick $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . X 4 . 6 . . . . . . . O B . . . | $$ | . . . , 3 5 . . . , . . . . 1 , X . . | $$ | . . O . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . 2 . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . 7 . 9 . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . 8 . 0 . . . . . . . . . . O . . | $$ | . . X , . . . . . , . . . . . , . . . | $$ | . . . . O . . . . . . . . . . X . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

After the kick, :bc:

,

stands, and then :b11:

extends towards the right side. As after the kosumi, White presses against the top left corner and plays the pincer against the bottom left. This time, with the White influence in the top right, White plays :w18:

, the one space jump towards the center. Then :b19:

pushes through. The development on the left side is similar, but :w18:

gives this diagram a different feel.

(Note: Elf's preference for the jump is less than 1%, however.)

Gotta run. More later.

bernds · **#650**

Bill Spight wrote:

OC, Elf regards this board as advantageous for Black, who has gained 10½%, assuming 7½ pt. komi. That means that the 3-4 is, on average, about 2½% better than the 5-3. How many pts. ahead does Golaxy or KataGo estimate Black is?

KataGo (using 6.5 points komi) believes :w2:

and

are problematic, dropping about 1.6 points each. The position after :w8:

is evaluated as around B+3.5, 60% winning percentage. It thinks :b3:

should have been the diagonally opposite star-point and :w4:

should have been a press in the top right, but neither move changes the score estimate very much.

Bill Spight · **#651**

A bit more about the margin of error of winrate estimates. I will be repeating myself to some extent, hopefully not too much.

lightvector wrote:

Basically, it's kind of hard to think about how one would add an error estimate (or how one would research how to add it) when not sure what precisely that error is supposed to measure in the first place.

Right.

In a go playing program, winrate estimates are used, among other factors, to choose moves. These estimates are not tested directly, so we do not know how good an estimate is for any given play or position (along with who is to play). The program does not need that information.

However, the main use, I believe, for such programs is not to play games, but to assist humans in reviewing games. One question humans ask is what is the best play in a given position? The programs provide an answer by indicating their top choice of plays. So far, so good. Another question humans ask is how two moves compare, and by how much? The how much question brings us to the margin of error.

Now, review programs do not typically, AFAIK, compare plays directly. But they do compare subsequent positions in a game. This information is an indication of whether a play is a mistake, and by how much. For instance, if Black made a play in the game that reduced Black's winrate estimate by 10%, humans typically interpret that to mean that the move was a mistake that reduced the probability of winning by 10%. But what if it reduced the winrate estimate by only 2%? Are we even sure it's a mistake? To answer that question we need to know the margin of error of these estimates. And we do not know that. Nobody is precisely sure what "that error is supposed to measure in the first place."

Our uncertainty is not trivial. For instance, Uberdude has posted a game record of a recent game between FineArt and Golaxy ( https://lifein19x19.com/viewtopic.php?f ... 45#p248045 ) along with LZ's winrate estimates and playouts. It is plain from his subsequent comments that he did not intend LZ's opinion to be definitive, or even to have enough playouts. But let's look at a few estimates and differences.

The last play of the game was Black 327, after which White resigned or was adjudicated to have lost. LZ's winrate estimate for Black was 89% with 1.3k playouts. One interpretation of that estimate is that if LZ played itself from that point, White playing first, Black would win 89% of the time, with an error of 11%, and would lose 11% of the time with an error of 89%. Well, I, for one, would be willing to bet 8 USD to 1 that if you let two randomly chosen AGA 10 kyus play out the game from there, that Black would win. Surely LZ vs. LZ would be a certain win for Black. But with only 1.3k playouts LZ is not so sure. I'll happily give that winrate estimate a margin of error of 10%.

The game was effectively over after Black 315, at which point only dame and protective plays are left. LZ's winrate estimate for Black with White to play is only 67% with 3.6k playouts. (Many of LZ's estimates were made with fewer than 1k playouts, but 3.6k is fairly respectable.

) Now, there is research that indicates that two human 5 kyus can correctly play out a game at the dame stage around 98% of the time. Betting 20 USD to 10 that Black would win in that case would be like taking candy from a baby. My impression, going back to the early days of Monte Carlo programs, is that they tend to underestimate winrates of the eventual winner.

But a 33% margin of error with 3.6k playouts? :shock:

Let's back up to Black 313, an atari on 7 stones in the bottom right, including a ko stone. In a comment I argued that taking the ko was technically correct, but in fact Black 313 wins a won game. A strong amateur who was unfamiliar with area scoring might make a mistake and play on the left side where Black played move 315. Good for FineArt for clinching the win. But LZ's winrate estimate for Black is only 52½% with 330 playouts. OC, only 330 playouts do not inspire confidence, but what can go wrong? If White allows the capture of the stones in atari, Black surely wins, and if White saves the stones, the last play is obvious to most human SDKs, requiring only a reading depth of 3 ply to see. Yet LZ thinks it's a tossup? Yes, there are only 330 playouts, but immediatel previous play suggests that the region of Black 315 has been explored to at least that depth. I don't know enough about the workings of LZ to speculate why it has such a low winrate estimate. Human dan players surely do better at this point.

One last observation. White 314 did make the obviously correct play of saving the 7 stones in atari. After that play LZ estimated White's winrate at 41% with 647 playouts. OK, so there is still a large error in the winrate estimate itself. That's not so surprising, given the givens. But what does that say about White 314? It indicates that it is a 7% mistake! The overwhelmingly obviously correct play lost 7% in White's estimated winrate.

OK. Not enough playouts. If you have any experience with these review programs, you are aware of the problem. But how many playouts do you need? Especially since a play that the program "thinks" is an error is likely to have few playouts devoted to it. That is the result of the seach strategy of the go playing program. Perhaps a review program should have a different search strategy, I dunno. I think so, but I can't really say. Anyway, humans who use go playing programs for reviews are feeling their way in the dark, or at best in dim light.

Bill Spight · **#652**

A personal note:

Today's mail contained a book, Sweet Taste of Liberty: A true story of slavery and restitution in America, by historian Caleb McDaniel, published by Oxford University Press. McDaniel wrote an article in the September issue of the Smithsonian magazine, p. 12. It is about Henrietta Wood, a slave who was freed in 1848 in Ohio but was kidnapped, along with her young son, in 1853 in Cincinnati, taken into Kentucky and sold back into slavery. After the Civil War she had to remain in the South for years before she could earn enough money to return to Ohio, where she sued one of her kidnappers and won her suit in 1878. She was my wife's great-great-grandmother. She received $2500, the largest reparations ever paid to a person for slavery. McDaniel met us a couple of years ago and Winona was able to give him copies of some family pictures. The book's dedication reads:

Quote:

In Memory of Winona Adkins (1944-2018)
Great-great-granddaughter of Henrietta Wood

You can PM me for a discount code for the book.

Bill Spight · **#653**

Some guidance on reviewing games with bots

Thanks to yoyoma's comments ( https://lifein19x19.com/viewtopic.php?p=248070#p248070 ) and following, I have a better understanding of the workings of today's go playing bots, and can offer some suggestions.

As I have already mentioned, while winrate estimates are evaluations of positions (along with whose turn it is), go playing bots do not simply rely upon them to choose plays. Nor are bots trained to make accurate evaluations, they are trained to win games. For that they only require good enough evaluations. For humans wishing to evaluate plays and positions for analysis are review, top level programs today, and for the forseeable future, are not trained to make accurate evaluations. So, for now, we have to make use of top bots, whose evaluations are generally better than ours.

How to do that?

When considering a play, today's bots produce and reveal information about plays considered but not chosen, in the form of winrate estimates and playouts or visits. Generally, but not always, the bot chooses the play with the best winrate estimate. Playouts or visits are also a factor. Each play considered leads to a node in the search tree. The number of playouts or visits for a node in the search tree is correlated with its winrate, because during search the most promising nodes are most likely to be visited and expanded. This search strategy works well for winning games.

However, for analysis or review we want to compare plays and positions. A problem arises when we compare plays with different numbers of visits or playouts. A winrate estimate with few visits is likely to be more inaccurate than one with many visits. Suppose that we are comparing a play with 50k visits with one with only 500 visits by the program when building the search tree. The play with only 500 visits probably has a worse winrate estimate than the one with 50k visits, but its winrate estimate is not as accurate. If it were expanded so that its number of visits was also 50k, then its winrate accuracy would be more comparable to that of the other play, and we would have a fairer comparison. Now, for the player to do that when choosing plays would be a waste of its time, as the chance of finding a better play would be small. But as reviewers or analysts we do not face that problem. Our concern is not winning a game, but understanding specific plays and positions.

IIUC, the typical way that reviewers use today's bots to analyze games is to chart the changes in winrate estimates for each move. OC, the problem with simply doing that is the variability in the number of playouts or visits. This procedure treats winrate estimates with different visits the same, even though those with fewer visits are less reliable.

Is there a better way? Yes, and it takes only a little more work. It is possible, with typically great effort, to get the bot to make more visits to a node when searching the tree, which then allows better comparisons, but there is a simpler and better way.

The basic idea is to evaluate a play, not by its winrate estimate, but by the winrate estimate of the bot's chosen reply to it. On average, the two winrate estimates should be the same, but, since the estimate of the reply in one ply deeper in the search tree, it should be more accurate, as a rule. And it is easy to get the go playing program to choose the reply to a play. Simply make the play.

Suppose that we wish to compare a play made in a game (play A) with another play made in the same position (play B). First, we make each play and note the winrate estimate of the program's reply. Then we may regard the play whose reply has the better winrate estimate for the player as the preferred play. We may also take the difference in winrate estimates into account when deciding whether one play is actually better than the other. The problem of the margin of error is not resolved, but the problem of inaccuarte comparisons made with too few playouts is addressed by this procedure.

An example mañana.

moha · **#654**

IMO the problem during reviews is that there is no theoretical guarantee of the accuracy of the winrate of any move other than the most explored one (or not even that). It is no coincidence the original search algorithm chooses the most visited move, ignoring all winrates.

Bill Spight wrote:

A problem arises when we compare plays with different numbers of visits or playouts. A winrate estimate with few visits is likely to be more inaccurate than one with many visits.

The practical reliability of winrates also depends on other things besides total visits. In quiet positions where most lines evaluate to similar value, their average is also more reliable than in dynamic/tactical positions. The latter needs more visits just to search deep enough, reach better evaluatable positions, and understand the tactical shape of the tree (eliminate some of the blind spots, and shift out poor top moves from the average). So comparing estimates with identical visits can be just as dangerous as different visits if the dynamics of the positions differ.

Quote:

The problem of the margin of error is not resolved

Winrate estimates are theoretically incorrect, for example, an estimate of 50% is surely off by nearly 50%. Similarly, their correlation to actual bot-vs-bot playout games can be low as that depends on the set level of randomness. If run without any randomness all playout games can be identical (0% or 100%). And bots often use different level of randomness for actual play than for selfplay (where those winrates originate from).

But for practical purposes, if one accepts winrates as fuzzy "goodness" measures, it may be possible to look at the variance of the evaluations included in the average, to get some kind of confidence metric. There are some actual experiments in this direction. (This variation in reliability, the dependence on dynamics is also problem for bot training and selfplay where uniform quality is preferred).

Bill Spight · **#655**

Example 1

Click Here To Show Diagram Code: [go]$$Wcm36 Suzuki (W) - Nozawa, Feb., 1930, game 9 of a 10 game match $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . X . . O . . . . . . . O O O . . | $$ | . . . , X . . . . , . . O O X O X . . | $$ | . . O . . X . . . . . . O X X X X . . | $$ | . . . O . . . . . . . . . . X O . . . | $$ | . . X O . 7 . . . . . . . . . . . . . | $$ | . . . X O . . . . . . . . . . . . . . | $$ | . . . X O . . . . . . . . . . . X . . | $$ | . . . X 1 . . . . , . . . . . , . . . | $$ | . a 4 3 2 6 . . . . . . . . . . . . . | $$ | . . 8 5 . . . . . . . . . . . . . . . | $$ | . . . 9 b . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . X . . . . . . . . . . . . . . . | $$ | . . X , . . . . . , . . . . . , . . . | $$ | . . . . O . . . . . . . . . . O . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

Both players were 7 dans, back when there was only one 9 dan.

From the Elf commentaries here are the Black winrate estimates, rounded to the nearest ½%, along with the number of visits for each play.

:w36:

54% (36.9k)
:b37:

44½% (6)

44½% (28.1k)
:b39:

43½% (58k)
:w40:

44% (87.9k)
:b41:

42% (54)

43½% (1.3k)
:b43:

42% (1.3k)
:w44:

44% (16.9k)

In any histogram of Black winrates, :b37:

stands out, with a winrate drop of 9½%. It certainly appears to be a serious error. Elf recommends a simple extension to 38. OC, Elf was trained on 7.5 pts. komi, but the solid extension looks good in this no komi game, as well.

According to my recommended procedure, we look at the winrate estimates for Elf's replies to :b37:

and to Elf's recommended extension. Elf's reply to :b37:

is also Suzuki's :w38:

, with a winrate estimate of 44½%. Elf's reply to its extension is 42, with a winrate estimate of 54½% (53.2k). The playout numbers are not the same, but they are in the same general ballpark. The winrate difference is 10%, ½% greater than the 9½% difference in the histogram. It is normal for this procedure to produce a result only ½% different from the histogram result. But that is not alway the case.

Elf also recommends a different play for :b41:

, despite a drop of only 2% in the histogram. It recommends the crawl at 43; White replies with the extension to 44, with a Black winrate of 44% (103.4k). For :w42:

in the game (i.e., the reply to :b41:

) Elf recommends the turn at 43, with a Black winrate of 42% (15k). The winrate difference is the same. The playouts are not exactly in the same ballpark, but 15k is much better than 54.

As mentioned, Elf recommends the turn at 43 for :w42:

. Black replies with the descent to a, for a Black winrate of 42½% (14.9k). For :b43:

(the reply to :w42:

), Elf recommends the shoulder hit at b, with a Black winrate of 49% (30k). The winrate difference is 6½% instead of the histogram difference of only 1½%. We may conclude that :w42:

is a mistake. Not a bad mistake, but still a mistake. If we looked only at the histogram we would miss it. The low playouts (1.3k) are a clue, but we still might miss it.

Edit:
We might also guess that :b43:

is a mistake. Indeed it is, but only to the tune of 5%. :w44:

in the game is also Elf's reply, with a winrate estimate of 44%. Elf's reply to Black b, its suggested play, is the turn at 43, with a winrate estimate of 49% (10.4k). Again, the histogram gives no clue.

John Fairbairn · **#656**

Bill

Stimulating thoughts. Thank you.

There are quite a few commentaries on this game, including Suzuki's own. The only English commentary, in Go Monthly, is rather anodyne. The rest (Japanese) stress more the contrary nature of Nozawa and indicate how he was playing not just the man but the system he so often railed against. The simplest summary of those views in respect of 37 (described as the "safe" option) can be summed up in one word: kiai. Are win rates relevant in those circumstances?

However, there is one commentary of the period that tells us Black was also concerned by the danger of an impending White moyo at the top and the hane 37 was a way of expressing that. In other words, Black was accepting an immediate local loss in return for what he saw as a very long-term strategic gain. So long term that it was perhaps beyond the measuring capabilities of win rates?

As to 42, it is not mentioned in every commentary, but those that do talk about it all give Black's reply to a putative White 43 as B10, not B9. One commentary specifically also states that White wants to avoid this as he loses the suji at B11. Elsewhere there is a comment that shows how White can also make a sacrificial cut at C12 which gives him various options in the lower left.

One thing I have noticed about AI play is that it cares little for options, either as physical points or in terms of timing (forcing moves are played at, to us, ridiculously early times). That seems to make sense for a machine, but is it sensible for us humans? Are options a sensible way for us to keep a grasp on complexity, and is keeping deferred options not a mark of great skill?

At any rate, even after accepting that AI bots would trounce both Suzuki and Nozawa, all of those points seem to give us both a greater understanding of the present game while also helping us find a way to improve, i.e. to improve the thinking tools that are natural to us.

I'm not trying to say analysis by win rates is irrelevant, but I do think there are risks that its value may be overstated. I see it more like video reviews in sporting events - a useful but peripheral tool which gives us (not always) right answers but doesn't improve our tennis, soccer or umpiring skills one little bit.

Bill Spight · **#657**

Many thanks for your thoughts, John, and for the comments of yore.

As for winrates, they are currently the main evaluations that bots give us, and we don't quite know what to do with them.

I am feeling my way.

The winrate errors of 5% and 6½% are minor, and the 10% loss is of the size that White players would gladly take in no komi games to create chances. It is interesting that Nozawa took the chance of :b37:

as Black with no komi instead of the cool headed nobi. Kiai, indeed!

As for the bots playing their sente early, that is something I have puzzled over. Doing so may help them deal with the complexity of reading the whole board. Humans can compartmentalize.

Or maybe it really is just better play. Consider Sakata.

Let me prepare a note with some of Elf's variations on this game.

Bill Spight · **#658**

Some cool variations by Elf.

Bill Spight · **#659**

Mistake? What mistake? Whose mistake?

I happened to look a little further into this game, and found a surprise inside an Elf variation.

I'm feeling my way along, but I thought it might be interesting to take a closer look.

Click Here To Show Diagram Code: [go]$$Bcm41 What about Black 45? $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . X . . O . . . . . . . O O O . . | $$ | . . . , X . . . a , . . O O X O X . . | $$ | . . O . . X . . . . . . O X X X X . . | $$ | . . . O . . . . . . . . . . X O . . . | $$ | . . X O . O . . . . . . . . . . . . . | $$ | . . . X O . . . . . . . . . . . . . . | $$ | . . . X O . . . . . . . . . . . X . . | $$ | . . . X O . . . . , . . . . . , . . . | $$ | . . X O X X . . . . . . . . . . . . . | $$ | . . X O . . . . . . . . . . . . . . . | $$ | . . 5 O . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . X . . . . . . . . . . . . . . . | $$ | . . X , . . . . . , . . . . . , . . . | $$ | . . . . O . . . . . . . . . . O . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

With a winrate estimate of 37% (1.3k) :b45:

apparently loses 6½% by the histogram, to be a minor error. Elf recommends a White reply at a, with a Black winrate of 38½% (19k).

Click Here To Show Diagram Code: [go]$$Bcm41 Variation $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . c . X . . O . . . . . . . O O O . . | $$ | . . . , X . . . a , . . O O X O X . . | $$ | . . O . . X . b . . . . O X X X X . . | $$ | . . . O . . . . . . . . . . X O . . . | $$ | . . X O . O . . . . . . . . . . . . . | $$ | . . . X O . . . . . . . . . . . . . . | $$ | . . . X O . . . . . . . . . . . X . . | $$ | . . . X O . . . . , . . . . . , . . . | $$ | . . X O X X . . . . . . . . . . . . . | $$ | . . X O . . . . . . . . . . . . . . . | $$ | . . 6 O . . . . . . . . . . . . . . . | $$ | . . . . 5 . . . . . . . . . . . . . . | $$ | . . . X . . . . . . . . . . . . . . . | $$ | . . X , . . . . . , . . . . . , . . . | $$ | . . . . O . . . . . . . . . . O . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

Elf recommends :b45:

in this variation, 41% (33.2k). But then it replies with :w46:

, 31½% (9.9k) :shock:

That's a drop of 9½% in one ply, with both plays chosen by Elf.

Anyway, following the suggested procedure, :b45:

in the game scores 38½% and Elf's choice scores 31½%, which makes Elf's choice a minor error by 7%.

So, what's the story? Is :w46:

in this variation so good, or :b45:

so bad? I dunno. But note that each of White a, b, and c is bad for White, according to Elf. Which raises the usual question when analyzing with bots. Why are all these plays that look so good actually so bad? :lol:

Anyway, here is the rest of the main line of this variation.

Click Here To Show Diagram Code: [go]$$Wcm46 Main line $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . X . . O . . . . . . . O O O . . | $$ | . . . , X . . . . , . . O O X O X . . | $$ | . . O . . X . . . . . . O X X X X . . | $$ | . . . O . . . . . . . . . . X O . . . | $$ | . . X O . O . . . . . . . . . . . . . | $$ | . 2 . X O . . . . . . . . . . . . . . | $$ | . . . X O . . . . . . . . . . . X . . | $$ | . . . X O . 0 . . , . . . . . , . . . | $$ | . 4 X O X X . . . . . . . . . . . . . | $$ | . 3 X O 6 5 9 . . . . . . . . . . . . | $$ | . . 1 O 7 8 . . . . . . . . . . . . . | $$ | . . . . B . . . . . . . . . . . . . . | $$ | . . . X . . . . . . . . . . . . . . . | $$ | . . X , . . . . . , . . . . . , . . . | $$ | . . . . O . . . . . . . . . . O . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

Well, by

we are down to 3.5k visits, so we can't count on these moves to be accurate. And, as usual, Elf leaves us in media res. But the Black winrate estimates are consistently around 30%. So maybe we are seeing a horizon effect, where White discovered a superior play in :w46:

.

Bill Spight · **#660**

Elf's margin of error

What is the margin of error of Elf's winrates (near 50%)?

OC, there are two errors involved. One is the accuracy of Elf's preferences, i.e., is one play or other closer to Elf's style? The variance of the estimates gives us direct information about that. The second error is whether a play is a mistake or not. We have no direct evidence of that, but that is the more important question.

As I stated I have recently revised my guess about the second margin of error for Elf to 4% from 5%. I have now slightly revised that to 4½%. As if I can tell. :lol:

OC, there is no hard and fast threshold, and the number of playouts matter. I am thinking of 10k playouts or more.

Anyway, here is a position to consider.

Click Here To Show Diagram Code: [go]$$Bc Large knight's enclosure $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . 2 . . . . . . . . . 5 . . . . . | $$ | . . . , . . . . . , . . . . . , 1 . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . b . . | $$ | . . . . . . . . . . . . . . . . a . . | $$ | . . 3 , . . . . . , . . . . . , . . . | $$ | . . . . . . . . . . . . . . . 4 . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

is not Elf's style. It prefers a, estimating Black's winrate as 50½% with 30.3k playouts. :b5:

, OTOH, gets a winrate estimate of 46% with 19.3k playouts. (Actually, :b5:

is not even on Elf's radar, getting 0 playouts. It inherits its winrate estimate from Elf's first choice for :w6:

, the enclosure at b.) Anyway, I have no problem thinking that :b5:

is not Elf's style, given the 4½% winrate difference and the 0 playouts, but I do have a problem thinking of it as an error.

This 'n' that

Who is online