Suppose you play the bot against itself 100 times and you find that on average it loses by 20 points in some position (winning a few games barely, losing most games by a lot). Suppose that 20 points was precisely what the bot had given as its "final score difference estimate" in that position. Great, right?
Suppose you dig further into the example and determine that actually, if the bot had just played move X, it would lose only by about 4 points - the resulting endgame is stable, and although it's not clear how to play it exactly optimally, it's highly clear that it's not going to vary by more than +/- 1 point under any reasonable lines of play. If you had 4 more points, then you'd have 50-50 winning chances playing move X. And the bot also agrees. The *reason* why the bot did not play move X and instead chose Y was that X led to an easy and predictable loss, whereas move Y is a complex and uncertain move that gives some slim winning chances instead of zero, but average seems to lead to a much bigger loss.
So we have the state of affairs:
A: In the sense of self-play games -> you are on average down by 20 points (since the bot plans for move Y).
B: In the sense of points you'd need to have 50-50 chances -> you are "down" by only 4 points (since if only you had 4 more points, the position would become fair).
If you had to choose just one of A or B to be reported to you, which value would you prefer to have?
I think B is more useful.
Consider: suppose you discovered you had made a macroendgame blunder a couple moves earlier that led to you getting precisely 4 points less than you could have gotten in some particular area, with no other differences - no lingering aji or ko threat differences, same player gets sente, etc. So there is a very intuitive sense in which that blunder loses precisely 4 points. Now, if you had asked the bot prior to that mistake, it would have said:
A: In the sense of self-play games -> you are on average down by 0 points (because now it plans to choose move X, not move Y).
B: In the sense of points you need to have 50-50 chances -> you are on average down by 0 points (because the game is fair as-it-is).
If you're using A then you might get the impression that the blunder "loses 20 points" since before the mistake the estimated difference is 0, and afterward the estimated difference is -20. If you are using B, then the difference before and after is 4 points, as expected. And if you really did want A, you could just wait until after move Y is played, then B will join along with A in saying you are down by 20.
So generally it seems to me B should be more useful and differences or changes in B are more likely to be "objective" and consistent with other measures. For example B should increase/decrease precisely by 1 whenever komi increases/decreases by 1, whereas A in general would not. Passing in the opening should result in B decreasing by bot's-believed-fair-komi * 2 whereas A could decrease by a different value.
Now, just like A, it is not possible to always estimate B perfectly - the neural net will still be imperfect. And, as I mentioned in my previous post, even if the neural net is trained to estimate B, the fact that MCTS is layered on top will introduce some "A-like" behavior into the result regarding short-term plans. So in the above case, if the MCTS actually sees down to wanting to play Y, rather than the neural net merely anticipating wanting to make an "Y-like" move longer in the future, then even a B-trained KataGo will say -20 points.
But it seems to me that moving towards estimating B should be more useful in general. Even if MCTS will be "A-like" in the short term, it should be helpful to get rid of the "A-like" behavior in the estimate that the neural net anticipates long into the future beyond what MCTS can see. The long-term part is actually usually the part that's actually having the impact, particularly in the opening. For example, KataGo says passing in the opening is -20 points instead of -14 points (14 ~= 2x katago's believed fair komi) not because the MCTS actually sees a short-term plan to lose 6 points in the future to improve its winning chances, but because the neural net anticipates giving up on average 6 points to improve winning chances way off in the future, presumably during midgame fighting.
So, my thought is to try to make KataGo estimate B instead. And, I could also continue estimating A too, but it would be extra overhead in the search to carry both around, so my inclination is to just not have A once we have B. Unless people think it should keep reporting both? Thoughts?
xela wrote:
"At the cost of several percent of selfplay efficiency' -- you mean it will take slightly longer to train this model, but you don't expect a significant impact on playing strength either way?
Well, taking longer to train *is* an impact on playing strength - it means that for any fixed amount of training, it would end up weaker by the amount corresponding to have trained a few percent less long instead. But yeah, so long as my setup works well at all, I don't think it will cost too much more than that.