John Fairbairn wrote:
The most significant point to cover first is the komi. AI bots generally are trained on 7.5 points komi and this badly affects the reliability of their assessments of no-komi games. I personally don't realise why but the Elf team, when they were adding AI commentary to all the GoGoD games, told me this was a major point but results in the early fuseki are probably not too badly affected.
Before looking at the Elf commentaries on old no komi games I had the same impression as the Elf team about the early fuseki. I figured that at that stage of the game the lack of komi would affect the winrate estimates, but probably not the ranking of plays. But when I took a look, I was struck by how often White took winrate losses right off the bat. OC, this was in line with what I had been taught, that White would often make objectively inferior plays in order to complicate the play. The prime example was approaching a Black corner stone at move 4 instead of playing in the open corner, to prevent Black from making a good enclosure. What surprised me was how low a winrate White would accept. Sometimes White would get a winrate of 25% or less in the opening. Since the winrate was predicated on a 7.5 komi, that meant that White was, in effect, adding significantly to the expected number of points Black was ahead.
Quote:
The overall picture was that Jowa did not make any serious booboos. Intetsu made a couple, but most moves that were not rated best by the computer were close to the best or could be adjudged either simply slack moves or deliberately risky moves - in both cases (as Ohashi takes pains to demonstrate) based on positional evaluations and explainable by psychology. That does not mean the human evaluations were correct, but they were at least rational.
I think one factor is shared assumptions of the human players. We have seen that bots trained on self play can have blind spots because in its evolution, its opponent, a slight variant of itself, made the same misjudgement as it did. Over infinite time these blind spots will disappear, but meanwhile they exist. The same thing happens with a community of human players. Even though the players are not clones of each other, they share assumptions about plays and the evaluation of positions. This is apparent in the Elf commentaries, where one top player will make play that loses, say, 10% in Elf's winrate estimate, and then the opponent turns around and returns the favor on the next move, losing, say, 11%.

Another factor, I think, is the difference in skill between players. If a 19th century pro 8 dan took White againsts a pro 5 dan, their chances were roughly equal, or perhaps White had a slight advantage. Estimating Black to be 6 pts. ahead would be way off. Furthermore, the 8 dan would probably have a good idea of the kinds of mistakes the 5 dan would make. As of yet, bots have not trained, based on these factors. (But as I have heard, a neural net chess engine has been trained specifically against Stockfish, and does well against it. The future of AI will be interesting.

)
Quote:
(Incidentally there was a case a little later where Black's winning ratio was 61.1% but territorially White was ahead by 0.2 points. Ohashi admits this is hard to understand.)
IMO, not enough research has been done into the error functions of bots. Meanwhile, the ordering of winrate estimates and territory estimates is probably more reliable than the actual figures. It is plain from the above example that Golaxy does not derive one estimate from the other. One or the other is wrong, probably both. By how much, we don't know. (Just my guess, but with 5 million playouts per move I would trust the territory estimate more. Why? Because, as the program gets better, its winrate swings become greater. In the limit they approach 100%, since a slight error in territorial terms could make the difference between winning and losing. {For instance, in one of my problems a mistake of 1/64 pt. loses the game.

) The opposite is true for the territory estimates, which should converge as the program improves its evaluations.
Edit: I take back my confidence in the territory estimate, because [lightvector] has pointed out, below, that the data could be skewed by large wins by White. IMHO the proper territory estimate should not be the average of the data, but the median, i.e., the value which divides the scores closest to 50-50,
like komi. The 0.2 pt. territory value for White may well be the average, not the median, which could be a few points for Black.
Quote:
It has already been pointed out on this forum that there are several cases where a bot does not even list a particular move in its top N moves, but when that move is actually played the win ratio barely changes. The same thing seems to happen with Golaxy.
With weaker bots or fewer playouts we see the same thing. We even see cases where the new play is an improvement on the bot's top choice. That is why I would like an analyst bot that makes a broader search than a player bot does. Since we do not have any error function for winrate estimates, the number of playouts for each option is a good proxy. It is plain with todays bots, a winrate estimate based on only 1000 playouts for that option (many more playouts for the move to be chosen) is unreliable.
Quote:
As an example of human evaluation vs computer evaluation consider this position:
{snip}
However, Jowa and Golaxy differed in their choice of reply. Jowa chose to force at A and then lived in the corner. Golaxy preferred to fight the ko with B and came up with the following line of play that gave the position where it thought it had gained 3.6 points. For a human even to just feel that White had made a gain here is surely problematical - just too much left up in the air. Jowa's (slightly) inferior move may well be regarded as correct for a human.
In all the comments I have read on the new AI style of play, many people have pointed out the new kinds of moves (e.g. high shoulder hits), there have been insightful characterisations of the style (e.g. an emphasis very early on on causing the opponent to be overconcentrated), and there have been new words (e.g. the tiger enclosure). But nowhere have I seen anything that suggests that humans have even begun to get a grip on how to evaluate positions such as the second one above. Everything seems to indicate humans are still satisfied (because they have to be) with Jowa's kind of response.
Well, White has gained around 13 pts. in the bottom right, plus something on the left side and some outside strength in the top right. In addition, White did not have a secure position in the top right corner to start with. An influence oriented human might well be satisfied with the outcome, or even prefer it.
As for human evaluations, it goes back to shared assumptions. Tomorrow's top pros, who will have grown up with the bots, will have more shared assumptions with the bots than with even today's top humans. I think that the next 20 years will be exciting times for human go.
