Can We Stop Calling Kata "scoreMean" Points?

emerus · **#41**

uberdude wrote:

When you (emerus) talk of points do you mean:
- minimal guaranteed territory (i.e. even if opponent gets the gote endgames in the area you still get these points). I think Myungwan Kim 9p tended to count like this in his videos and called it "confirmed territory".
- expected local territory (i.e. if an endgame move is your sente but opponent's gote you assume you get the sente, if gote for both then split the difference, if ambiguous, or boundaries are not pure endgame but have life and death and aji implications with other areas then very hard)
- expected territory plus a point quantification of the value of influence (e.g. projecting 2 points of territory in front of a wall), which is essentially what I was trying to do in counting the early game position at https://lifein19x19.com/viewtopic.php?p=243147#p243147, but with simplifying assumptions of similar stones cancelling out so the absolute value is off, just the difference.
- something else?

For example, how many points is a lone 4-4 stone? Or a 3-4 stone? Or a 3-4 5-3 shimari? In terms of guaranteed territory a 4-4 has 0 points. Whilst a 3-3 has maybe 4 points. But in terms of "quantification of value on the same scale as points" as in the third definition a 4-4 is obviously similar to that 3-3 if not a little better.

Any and all of those.

The point is that 'points' is a regularly used Go term. It is already muddy enough. Why would you want to even use such a muddy term for a new evaluation value anyway?

Coming up with a better, clearer, way to refer to scoreMean(or whatever it may evolve into) can not be a bad thing. It comes at no cost either.

Bill Spight · **#42**

xela wrote:

lightvector wrote:

Okay, at the cost of several percent of selfplay efficiency I think I can do this. I'll try it out (adding the new target to predict score difference between current and even). :rambo:

If it works, then I think there's no more issue, right?

The number of people who have serious issues with the status quo seems to be approximately 1, so I'm not convinced there's a real problem that needs solving. (Still willing to be corrected on that point...) Still, it would be interesting to carry out this experiment if you have nothing better to do. We can run both versions on the same positions and see how well the two types of score estimates do or don't correlate.

jlt showed how to estimate the median score under the current setup for any position of interest.

That should yield a better temperature estimate, as well. OC, to do it for every position would be quite a chore.

Bill Spight · **#43**

emerus wrote:

The point is that 'points' is a regularly used Go term. It is already muddy enough. Why would you want to even use such a muddy term for a new evaluation value anyway?

Coming up with a better, clearer, way to refer to scoreMean(or whatever it may evolve into) can not be a bad thing. It comes at no cost either.

As you indicate, it's not like points is a clear term, anyway. Which is why Berlekamp came up with the term, count, for the current, static estimate. All of these territory estimates are expressed in terms of points, no problem there. IMHO, current points and final points, or the like, would be fine for differentiating the two estimates.

Bantari · **#44**

I am not sure if this is what OP is trying to say, but from what I understand is that when kataGo (or any AI) says you are "ahead by 20" - do they mean "points" as we, humans, understand "points"? Or not? If yes, no worries. If no, this might hurt people using this metric to evaluate errors.

I am not sure if anybody did that, but it seems a simple experiment would help here: Present an AI with a position, and let it evaluate. Then let it play this position against itself - and see if the final result is anywhere near the evaluation, point-wise. Repeat a number of times, and we will have something to talk about, I think.

If the final results will be close/identical to the evaluation - we can call them "points". If not, we might want to call them something else.

PS>
In other words - in human common-sense understanding - if both players play a perfect game for a while, then one makes a 5 points mistake, and then they continue to play perfect game - we expect the result to be 5 points adjusted for komi. This is also how we can, retroactively, measure the size of a mistake - potentially.

Does the same apply to AI and its evaluation? I think this is what would interest me here.

lightvector · **#45**

Suppose you play the bot against itself 100 times and you find that on average it loses by 20 points in some position (winning a few games barely, losing most games by a lot). Suppose that 20 points was precisely what the bot had given as its "final score difference estimate" in that position. Great, right?

Suppose you dig further into the example and determine that actually, if the bot had just played move X, it would lose only by about 4 points - the resulting endgame is stable, and although it's not clear how to play it exactly optimally, it's highly clear that it's not going to vary by more than +/- 1 point under any reasonable lines of play. If you had 4 more points, then you'd have 50-50 winning chances playing move X. And the bot also agrees. The *reason* why the bot did not play move X and instead chose Y was that X led to an easy and predictable loss, whereas move Y is a complex and uncertain move that gives some slim winning chances instead of zero, but average seems to lead to a much bigger loss.

So we have the state of affairs:

A: In the sense of self-play games -> you are on average down by 20 points (since the bot plans for move Y).
B: In the sense of points you'd need to have 50-50 chances -> you are "down" by only 4 points (since if only you had 4 more points, the position would become fair).

If you had to choose just one of A or B to be reported to you, which value would you prefer to have?

I think B is more useful.

Consider: suppose you discovered you had made a macroendgame blunder a couple moves earlier that led to you getting precisely 4 points less than you could have gotten in some particular area, with no other differences - no lingering aji or ko threat differences, same player gets sente, etc. So there is a very intuitive sense in which that blunder loses precisely 4 points. Now, if you had asked the bot prior to that mistake, it would have said:

A: In the sense of self-play games -> you are on average down by 0 points (because now it plans to choose move X, not move Y).
B: In the sense of points you need to have 50-50 chances -> you are on average down by 0 points (because the game is fair as-it-is).

If you're using A then you might get the impression that the blunder "loses 20 points" since before the mistake the estimated difference is 0, and afterward the estimated difference is -20. If you are using B, then the difference before and after is 4 points, as expected. And if you really did want A, you could just wait until after move Y is played, then B will join along with A in saying you are down by 20.

So generally it seems to me B should be more useful and differences or changes in B are more likely to be "objective" and consistent with other measures. For example B should increase/decrease precisely by 1 whenever komi increases/decreases by 1, whereas A in general would not. Passing in the opening should result in B decreasing by bot's-believed-fair-komi * 2 whereas A could decrease by a different value.

Now, just like A, it is not possible to always estimate B perfectly - the neural net will still be imperfect. And, as I mentioned in my previous post, even if the neural net is trained to estimate B, the fact that MCTS is layered on top will introduce some "A-like" behavior into the result regarding short-term plans. So in the above case, if the MCTS actually sees down to wanting to play Y, rather than the neural net merely anticipating wanting to make an "Y-like" move longer in the future, then even a B-trained KataGo will say -20 points.

But it seems to me that moving towards estimating B should be more useful in general. Even if MCTS will be "A-like" in the short term, it should be helpful to get rid of the "A-like" behavior in the estimate that the neural net anticipates long into the future beyond what MCTS can see. The long-term part is actually usually the part that's actually having the impact, particularly in the opening. For example, KataGo says passing in the opening is -20 points instead of -14 points (14 ~= 2x katago's believed fair komi) not because the MCTS actually sees a short-term plan to lose 6 points in the future to improve its winning chances, but because the neural net anticipates giving up on average 6 points to improve winning chances way off in the future, presumably during midgame fighting.

So, my thought is to try to make KataGo estimate B instead. And, I could also continue estimating A too, but it would be extra overhead in the search to carry both around, so my inclination is to just not have A once we have B. Unless people think it should keep reporting both? Thoughts?

xela wrote:

"At the cost of several percent of selfplay efficiency' -- you mean it will take slightly longer to train this model, but you don't expect a significant impact on playing strength either way?

Well, taking longer to train *is* an impact on playing strength - it means that for any fixed amount of training, it would end up weaker by the amount corresponding to have trained a few percent less long instead. But yeah, so long as my setup works well at all, I don't think it will cost too much more than that.

dfan · **#46**

lightvector wrote:

So, my thought is to try to make KataGo estimate B instead. And, I could also continue estimating A too, but it would be extra overhead in the search to carry both around, so my inclination is to just not have A once we have B. Unless people think it should keep reporting both? Thoughts?

I think it's worth calculating and exposing both values at least for a little while, because the comparison between them in different sorts of situations could provide some insight. I do agree that B seems more meaningful and useful, though.

Yakago · **#47**

I would say that it's a bit 'bloaty' to have two score estimates. Even if it could provide insight in some situation

I think the 'B' version is to be preferred, and would 'solve' this issue up to the inaccuracy of the network.

I think it should be understable that the 'points' we see is based on the preferred line of play, and during analysis we would be able to see that the two lines of play differ in winrate and points.

TelegraphGo · **#48**

Marcel Grünauer wrote:

lightvector wrote:

Suppose you play the bot against itself 100 times and you find that on average it loses by 20 points in some position (winning a few games barely, losing most games by a lot). Suppose that 20 points was precisely what the bot had given as its "final score difference estimate" in that position. Great, right?

Suppose you dig further into the example and determine that actually, if the bot had just played move X, it would lose only by about 4 points - the resulting endgame is stable, and although it's not clear how to play it exactly optimally, it's highly clear that it's not going to vary by more than +/- 1 point under any reasonable lines of play. If you had 4 more points, then you'd have 50-50 winning chances playing move X. And the bot also agrees. The *reason* why the bot did not play move X and instead chose Y was that X led to an easy and predictable loss, whereas move Y is a complex and uncertain move that gives some slim winning chances instead of zero, but average seems to lead to a much bigger loss.

Doesn't that mean that a score estimate should be qualified with a probability?

In the example, it would mean "move Y loses the game by 4 points with 100% certainty" (i.e., winrate 0%) and "move X loses the game by 20 points with 50% certainty" and "move X wins by 1 point ('barely') maybe 5% of the time".

Statistics is not my strong suit so I'm sure my example is flawed, but I hope it conveys what I mean.

If you want an AI's opinion for which move is easy for AI to handle in an AI v. AI match, then you shouldn't be looking at KataGo scores. That's literally exactly the metric that percentages are designed to give. ELF, Leela-Zero, and maybe some other AI are (I believe) a little stronger than KataGo, and thus probably better at giving percentages. You should be keeping in mind that none of these AI can tell us how easy a move is for humans to handle.

The way that AI complicates games is different than the way humans complicate games - AI is much more confident in its ability (and thus its opponent's ability) to invade than the typical human, for example. If you want to learn how to create complications that are hard for humans to deal with while losing slightly, KataGo by itself is probably not the way.

KataGo's purpose is to give useful score estimates. I see no need to dilute that, just let KataGo do KataGo's job well. I'm very excited to see the B-style network, and very impressed that lightvector seems to think it won't be that hard to create.

spook · **#49**

lightvector wrote:

I think B is more useful.

I agree.

lightvector wrote:

So, my thought is to try to make KataGo estimate B instead. And, I could also continue estimating A too, but it would be extra overhead in the search to carry both around, so my inclination is to just not have A once we have B. Unless people think it should keep reporting both? Thoughts?

Out with the old, in with the new.

xela wrote:

What software did you use to make these graphs?

It is a preview of the next ZBaduk release. For brevity (to reduce spam here): https://github.com/lightvector/KataGo/issues/57.

lightvector · **#50**

I'm going to keep both internally, since actually I'm a bit nervous there's a mathematical principledness that would break in the formulation of winloss utility + score utility if simply swapping it out. So the old value will continue to be used in the utility computation (utility is the name for what KataGo aims to maximize, which blends winning and score).

But I'm going to outright replace the "scoreMean" value which is what different GUIs are showing to the user. The old value will be hanging around in an extra new field of kata-analyze if some GUI app really really wants to show it.

The computation of the old value actually is also changing nontrivially due to some architectural changes in the neural net's outputs. The latest test run of KataGo I actually found the value to *underestimate* differences, rather than overestimate it! (Which I guess supports the point of this value not being very stable between different versions).

Gomoto · **#51**

lightvector, it is great that we have you around in this forum and that you give us some views on the inside of your work.

spook · **#52**

lightvector wrote:

But I'm going to outright replace the "scoreMean" value which is what different GUIs are showing to the user.

Does it also have an indirect influence on the calculation of the stddev field ?

Can We Stop Calling Kata "scoreMean" Points?

Who is online