Life In 19x19
http://www.lifein19x19.com/

“Decision: case of using computer assistance in League A”
http://www.lifein19x19.com/viewtopic.php?f=10&t=15538
Page 7 of 36

Author:  Uberdude [ Mon Apr 02, 2018 8:01 am ]
Post subject:  Re: “Decision: case of using computer assistance in League A

Just to go back to Carlo, I thought I'd work out his performance rating for this season's PGETC. He had great results for a 4d:
- beat Andrey Kulkov 6d (Russia) by 1.5
- beat Ondrej Kruml 5d (Czechia) by 2.5
- beat Dragos Bajenaru 6d (Romania) by resign
- beat Reem Ben David 4d (Israel) by resign *** the famous 98% game
- lost to Mero Csaba 6d (Hungary) by 2.5
- beat Mijodrag Stankovic "5d" 3d by resign
- lost to Andrij Kravets 7d/1p by 7.5

At the start of the season in (1st) September Carlo's rating was 2381 [very similar to me], this was after picking up 50 points at the EGC. Of course his true strength could have been more than that and grown since then too but his rating lagged. His performance rating (using EGD GoR calculator), using current ratings of opponents is 2629, or +248.

How does that compare to other good performances?

Forum regulars may remember I beat Victor Chow 7d from South Africa a few years ago. UK were in league C for the 2014/15 season and my initial rating was 2361. My results were:
- beat Petrauskas 3d (Lithuania) by resign
- beat Chow 6/7d (South Africa) by 0.5
- beat Ganeyev 3k (Kazakhstan) by resign.
As I had no losses my performance rating with the "adjust until input = output" method is infinite, anchoring with a loss to 2700 gives 2666, anchoring with loss to 2800 gives 2719. So +300 ish with big uncertainty as no losses and few games, the only useful information is I beat a 2616 in one game, how flukey was that?

Last season Daniel on the UK team had no losses, this season he had just 1:
- beat Rasmusson 4d (Denmark)
- beat Karadaban 5d (Turkey)
- beat Welticke 6d (Germany)
- lost to Lin 6d (Austria)
Initial rating was 2402. Performance rating 2616 (+214).
If you include the wins (included some 5ds) from the previous season (for which his initial rating was 2262 but he probably wasn't much weaker than he is now) as well then you get performance rating of 2677 (+415).

Author:  Uberdude [ Mon Apr 02, 2018 12:18 pm ]
Post subject:  Re: “Decision: case of using computer assistance in League A

Time for more Leela similarity data! I analysed Carlo's first game against Andrey Kulkov. His ranking is 6d, but rating only 2538 these days, a strong 5d. He won the European Championship in 2001 aged 19 and peaked around 2650 in the early 2000s, stable around 2550 for last few years.

Using Leela 0.11 at 50k nodes the similarity with top 3 choices* Carlo scored 43/50 matches = 86% and Andrey scored 40/50 = 80%. Using the stricter top 1 choice the similarity was 31/50 = 62% for Carlo and 34/50 = 68% for Andrey. So Carlo beat a rusty 6d playing less Leela-like than a 3d I beat, and fewer of Leela's #1 moves than his opponent. This is not indicative of cheating with Leela to me, but that he played well.

* and within 5% of top 1, this was relevant a few times with them playing very bad 2nd choices. e.g. Leela thinks Carlo's a9 was a huge mistake despite being 2nd choice: it wanted to a5 capture and take sente to invade right side instead of save a few more stones in gote along the first line.

Author:  BlindGroup [ Mon Apr 02, 2018 2:29 pm ]
Post subject:  Re: “Decision: case of using computer assistance in League A

Bartleby wrote:
I still think 98 percent is really high. Although confirmatory evidence may be weak in general, at some point that becomes no longer true. If a player had a 100 percent match rate over an entire game would this not be highly suspect? 98 per cent is quite close to 100 per cent.


Uberdude wrote:
Yup. It's difficult. If we were already in a position where it's accepted 10% or something of people are cheating online then I'd be happier with much weaker evidence to convict someone, "on the balance of probabilities" level (as for civil cases in English law). But if we are still in the cheating is rare world (maybe I'm being naive) then stronger "beyond reasonable doubt" (criminal law) evidence is needed.


I think that there are two issues missing from this discussion:

1. I completely agree that our priors on the number of cheaters should factor into the analysis, but from a statistical perspective, an equally crucial element is the variance (or better yet distribution) of match rates. For example, if most high level dans match consistently enough with Leela within a very narrow band (say 85-90%), then even a small deviation from the band could be meaningful. Unfortunately, the only way to do calculate the necessary information is to run the analysis on a fairly large database of games at that level. The necessary size would obviously depend on the observed distribution, but that could be anywhere from a few hundred to a few thousand games. That would be quite a bit of work, but it may be necessary if the goal is to better understand the distribution of Leela match rates.

2. The decision about guilt or innocence is always going to be nothing more than an inference. All evidence can do is shift the probabilities around. Even if we had a video of someone entering scores from Leela during a match, there is always a chance that the video was doctored or that the player had an unknown twin... One builds evidence until the probability of "innocent" explanations becomes sufficiently small that one is comfortable with the resulting number of mistakes. Innocent people will unfortunately always be convicted thanks to the law of large numbers. The question for a given decision process is whether or not it (A.) is sufficiently likely to convict people who did cheat and (B.) is sufficiently unlikely to falsely convict those who didn't. One can choose to prioritize A or B, but one always comes at the expense of the other.

And as Bill noted, the priority placed on these two ends can depend on the penalties involved -- Criminal cases in some legal traditions prioritize B with the idea that punishing innocent people should be stringently avoided, while in civil cases (that do not involve jeopardy to life or liberty) the preference shifts slightly to A. In go, it may be that the "cost" of being falsely punished for cheating in general largely anonymous online play by having one's account closed could be considered sufficiently low that the community might be willing to accept a high emphasis on A (e.g. accepting that a number of people will be wrongly convicted in order to ensure that a large number of cheaters are caught). While in tournaments where players are known and reputations are in jeopardy, we might want to prioritize B quite a bit more.

Author:  Bill Spight [ Mon Apr 02, 2018 4:30 pm ]
Post subject:  Re: “Decision: case of using computer assistance in League A

BlindGroup wrote:
I think that there are two issues missing from this discussion:

1. I completely agree that our priors on the number of cheaters should factor into the analysis, but from a statistical perspective, an equally crucial element is the variance (or better yet distribution) of match rates. For example, if most high level dans match consistently enough with Leela within a very narrow band (say 85-90%), then even a small deviation from the band could be meaningful. Unfortunately, the only way to do calculate the necessary information is to run the analysis on a fairly large database of games at that level. The necessary size would obviously depend on the observed distribution, but that could be anywhere from a few hundred to a few thousand games. That would be quite a bit of work, but it may be necessary if the goal is to better understand the distribution of Leela match rates.


That's an important point. But also, if we are going to decide the question of cheating by match rates, we need to know the distribution of matches, given cheating. That we do not have, and nobody has proposed, except, perhaps, 100% of matches with one of Leela's top three choices.

If we are going to convict somebody based upon matches to Leela's choices, we are always (Edit: almost always) going to have more evidence that the player plays like Leela than that the person is cheating.

As for the CIT incident, I argue that :w44: is evidence that Triton did not cheat, but it is Leela's second choice. If you accept my argument, then you cannot simply take agreement with Leela as evidence of cheating. But it seems that we currently have no statistical definition of cheating except playing like some bot, Leela in this case. That's not acceptable.

Author:  BlindGroup [ Mon Apr 02, 2018 5:46 pm ]
Post subject:  Re: “Decision: case of using computer assistance in League A

Bill Spight wrote:
That's an important point. But also, if we are going to decide the question of cheating by match rates, we need to know the distribution of matches, given cheating. That we do not have, and nobody has proposed, except, perhaps, 100% of matches with one of Leela's top three choices.


That is indeed the "rub". As someone noted above regarding the sumo wrestling paper, there is a large economics literature on identifying "cheating" in various contexts -- sumo matches, standardized tests, attendance taking at school, penalty calling in basketball games, etc. In all of these instances, it is impossible to identify cheaters ex ante, and so, it's impossible construct a model of observed behavior conditional on someone cheating. The way this literature gets around this is by identifying relationships that are very unlikely to exist absent some sort of deliberate cheating. The more of these relationships one identifies, the more the evidence "piles up" making it much more likely the cheating is involved.

The problem with this approach is that it can identify whether or not on average cheating is occurring, but it cannot tell us whether or not cheating occurred in any specific instance. It would even be difficult to estimates odds of cheating for a given instance. Unfortunately, It is also very difficult to identify these relationships. (That's why they get published in top academic journals!)

In the context of go, one would need to identify specific points in a game in which one would expect a cheater to correspond more closely with Leela than someone who wasn't. So, this is consistent with the discussion about eliminating joseki play and forcing moves -- even a non-cheater will correspond with Leela there. To get separation in behavior you need a situation in which a non-cheating player is unlikely to correspond to Leela while a cheating player would be very likely to correspond. So, if you were looking at the games of 4d's, an example might be a move that 4d's are unlikely to identify on their own, but where Leela estimates a very significant difference in winning probabilities between her first and second best moves. Essentially looking for instances in which there is the 4d equivalent of the "god move" or the "ear reddening move". (In the previous discussion, this is analogous to the debate about whether or not a move was "normal" for a given person's level of play.)

Having said this, it occurs to me that we do have a wealth of information available on how non-cheaters play in all of the pre-Leela game records. So, in principle, one could search those games and try to identify the odds of different moves conditional on rank. (I'm loosely thinking about how identifying the probability of moves as the first stage for determining which moves to evaluate for Alphago.) For a given ranked player, one could then identify the "odds" of a specific move using that prior data looking for instances in which Leela finds a significantly better move than that rank of player is likely to find on her own. I feel like this is the best we will be able to do, but the other big issue is that one of the claims is that the availability of Leela has changed the playing styles of non-cheaters. If so, then our store of past games would be less useful.

Bill Spight wrote:
If we are going to convict somebody based upon matches to Leela's choices, we are always (Edit: almost always) going to have more evidence that the player plays like Leela than that the person is cheating.


Agreed. Unless one can distinguish between someone playing "like" Leela and someone copying Leela, we're stuck. My sense though is that while we can't know how someone cheating with Leela would play, we can characterize the play of people we know not to be cheating. Although this may require generating new data rather than just using what we already have available. At one extreme, one could have a large number of people play serious games while being observed. It's an expensive option, but it proves that the problem is at least tractable.

If I had to guess, I think these kinds of numerical solutions are going to be best suited to policing regular online games because (a.) the amount of available data is large to predict "normal" play is large and (b.) the consequences of an automated rule-based system is limited to forcing someone to create a new account. For in person games or online tournaments, I think this sort of system will never work given the severity of the consequences, and so the focus will have to be on preventing cheating by making it more difficult.

Author:  Javaness2 [ Tue Apr 03, 2018 7:23 am ]
Post subject:  Re: “Decision: case of using computer assistance in League A

I think you have to be careful about qualifying that as evidence.
We know already that the Leela was not used to choose ranked suggestions 1 to 3 all the time.
Is there a suggestion that Leela was used to pick moves only within a certain number goodness-metric-units? (%win is not really %win afaik)
Or is the suggestion that Leela was just used a lot in some manner that hasn't been precised? An accusation which is harder to defend against.
Reading reddit, I had the feeling that there might be some degree of "This move looks weird, are using a bot" accusation at play.

Sensationalist though this all CIT-scandal is, it is remarkably low on detail.

Author:  Calvin Clark [ Tue Apr 03, 2018 9:00 am ]
Post subject:  Re: “Decision: case of using computer assistance in League A

Bill Spight wrote:
As for the CIT incident, I argue that :w44: is evidence that Triton did not cheat, but it is Leela's second choice.


Just for yuks, I checked CrazyStone Pro on my Pixel. It would have played :w44: with 15 seconds of thinking and 46804 playouts.

Author:  Bill Spight [ Tue Apr 03, 2018 9:25 am ]
Post subject:  Re: “Decision: case of using computer assistance in League A

If you are going to use a bot to cheat, then you are going to play more like that bot than you usually do. It does not follow that you are going to make a lot of plays like the bot, however. If we are going to used bot-like play as evidence of cheating, we need to know how cheaters use bot-like play. But we don't.

I don't think that anybody has taken a look at cheating by losers. Just because you cheat doesn't mean that you will win. ;) In fact, if you have gotten yourself into dire straits in a game, then you might well be tempted to cheat to try to get yourself out of the hole you are in. (To mix metaphors. ;)) I eagerly await cheating allegations against the player who lost a game. {Tongue in cheek. ;)} But seriously, if we are going to study cheating -- and I think we should --, we should not confine ourselves to looking at the winners.

Edit: Actually, some chess players have looked at losses for evidence of cheating. The main method seems to be to find bonehead losing plays that are one space away from correct plays. The idea, OC, is that the cheater made a small error in the location of the play. The key also, is that the losing play is not just an error, but one that is uncharacteristic of the player's play when supposedly cheating. Again, this is disconfirmatory evidence. :)

Author:  Uberdude [ Tue Apr 03, 2018 12:42 pm ]
Post subject:  Re: “Decision: case of using computer assistance in League A

I also analysed Carlo's 3rd game, a win by resign against Dragos Bajenaru 6d (2589) of Romania, the strongest player he beat. Dragos was an insei and peaked around 2670 in late 2000s, stable for last few years just under 2600. Intro up to move 50: Carlo as white played double 3-3s to Dragos's double 4-4s, I didn't like Dragos's shoulder hit joseki from move 5, Carlo 3-3 invaded top right AlphaGo style (Leela doesn't even consider it), all very natural moves from Carlo until f15 surprised me (Leela #11). Dragos tried to make territory on lower side with n3 e3 combo but didn't work as 2 doors in. Carlo's k4 was the only move in first 50 which could be considered slightly suspicious: that area is clearly the place to play but a few nearby points plausible and k4 was Leela's #1. By move 50 Leela gives white 69% win and I agree black's opening was bad: he made no moyo with his 4-4s and white's territory strategy is working well.

Using the same Leela top 3 and within 5% metric for moves 50-149, Carlo scored 39/50 = 78% and Dragos scored 37/50 = 74%. Using the stricter top 1 matching Carlo scored 30/50 = 60% and Dragos 25/50 = 50%. They both matched most of the moves to 96 but at that point Dragos started endgame and Carlo followed him, whereas Leela saw a powerful shape attack at p7/m6 that Carlo neither attacked not Dragos defended, so their matching went down for a while until play moved to that area over 30 moves later.


Author:  Uberdude [ Tue Apr 03, 2018 12:57 pm ]
Post subject:  Re: “Decision: case of using computer assistance in League A

Here is a summary of my findings in a handy table (does our phpbb support tables?). Carlo's numbers highlighted with an asterisk.

Code:
+-----------------+------+----------------+------+---------+---------+---------+---------+
|      Black      | Rank |     White      | Rank | B top 3 | W top 3 | B top 1 | W top 1 |
+-----------------+------+----------------+------+---------+---------+---------+---------+
| Carlo Metta     | 4d   | Reem Ben David | 4d   |    * 98 |      80 |    * 72 |      54 |
| Andrey Kulkov   | 6d   | Carlo Metta    | 4d   |      80 |    * 86 |      68 |    * 62 |
| Dragos Bajenaru | 6d   | Carlo Metta    | 4d   |      74 |    * 78 |      50 |    * 60 |
| Andrew Simons   | 4d   | Jostein Flood  | 3d   |      80 |      88 |      54 |      62 |
+-----------------+------+----------------+------+---------+---------+---------+---------+


My hope is the referees would have something like this in their report with many more rows, but for now this is all we have to go on (plus the claimed 70-80% for top 3 in an unspecified plural amount of Carlo's offline games). Doesn't look strong evidence of Leela cheating to me.

Author:  Bonobo [ Tue Apr 03, 2018 3:05 pm ]
Post subject:  Re: “Decision: case of using computer assistance in League A

Uberdude wrote:
My hope is the referees would have something like this in their report with many more rows, but for now this is all we have to go on (plus the claimed 70-80% for top 3 in an unspecified plural amount of Carlo's offline games). Doesn't look strong evidence of Leela cheating to me.
Are you in contact with Pandanet and/or EGF about this, especially with the referee Jonas Egeberg? I imagine that they could appreciate your expertise …

Author:  bernds [ Tue Apr 03, 2018 6:39 pm ]
Post subject:  Re: “Decision: case of using computer assistance in League A

Uberdude wrote:
Doesn't look strong evidence of Leela cheating to me.
I'd still say the 98% case is a very odd outlier, although the metric "matches top three moves" is not a very good one. In chess they have things like "average centipawn loss" which would correspond to an average drop in win rate, and number of blunders (moves which are worse than a computer move by a certain amount). The problem with both of these is that they become meaningless in a winning position, which is achievable in Go if you have the computer play your opening, or if you've studied enough with the computer that you know how to wreck a human in the fuseki.

I've now run the game through both Leela and her sister Zero, and they disagree fairly significantly about what's going on (which kind of invalidates the theory that there were only ordinary moves that everyone would play). Leela does indeed think this game was played nearly perfectly from a certain point - very little disagreements from a winrate loss standpoint. However - and this I find significant - there were some disagreements in the fuseki phase, which is where I'd expect people to be able to cheat most easily without attracting attention. For example, the 3-3 invasion at C17 is not something Leela suggests. Leela Zero does want to play it, but it wants to play it a move earlier, instead of R6, and then it has some disagreements about the exact sequence. So this would suggest it's the choice of a human who has seen computers play 3-3 early and often but is not currently analyzing the position with a machine.

In the latter half you can find moments where choice A from Leela is something a human wouldn't play, and the human did not play it.

So at this point I don't know what to think. The agreement with Leela in the latter phase of the game is really odd, but if the player was using an engine, he wasn't doing it to get an advantage out of the opening. Zero doesn't think the game is going in Black's favour until move 70 (and actually thinks N16 at move 118 was catastrophic and gives White the advantage again, which I'm not sure I believe). Leela sees Black's win rate drop steadily from move 16 onwards, but thinks White made a big mistake at move 33. Then the win rate for Black drops back below 50% at move 38 with the kind of safety-first slack move that I recognize from reviewing my own play with the computer. Leela also dislikes N16, but only rates it a 4%-ish drop, which might be acceptable if one knew one was winning anyhow. But the move played in the game is a much more natural shape move which I would expect a human to play in that position.

Both engines really want to play N9 as a forcing move at some point after move 141 (differing slightly on timing), and I would find it very hard to resist making that move if I saw it suggested by an engine. When it's pointed out to you it's clearly an important move. Leela doesn't make much of it, but Zero thinks not playing it is a big blunder, for several moves in a row.

Given that only one game out of four was odd, and there is no clear trend in the first 40 moves (which there very likely would be if an engine was playing), I am inclined to give the benefit of the doubt on this one, going against my initial reaction.

Author:  Javaness2 [ Tue Apr 03, 2018 10:13 pm ]
Post subject:  Re: “Decision: case of using computer assistance in League A

bernds wrote:
I'd still say the 98% case is a very odd outlier, although the metric "matches top three moves" is not a very good one. In chess they have things like "average centipawn loss" which would correspond to an average drop in win rate, and number of blunders (moves which are worse than a computer move by a certain amount). The problem with both of these is that they become meaningless in a winning position, which is achievable in Go if you have the computer play your opening, or if you've studied enough with the computer that you know how to wreck a human in the fuseki.


I think using a top 3 metric is perhaps pointless, I don't know why a cheat would be imagined to use just the top 3. Can you produce a winrate metric or goodness metric report for each game, I seem to remember that Leela had such a chart. By the by, the fuseki used by Carlo seems to have caught some of the players off guard.

Author:  Uberdude [ Tue Apr 03, 2018 10:18 pm ]
Post subject:  Re: “Decision: case of using computer assistance in League A

Working on the premise a 98% top 3 match is good evidence of Leela cheating (which I don't believe), do the results from Carlo's other 2 games mean we accept he wasn't using Leela then? In that case we are saying he can play well enough to beat two 6 dans without Leela in rounds 1 and 3, and then used Leela to help him beat a 4d in round 4. That's odd.

A possible suggestion/implication when this story broke was he used Leela in his other games this season (in which he had more impressive wins against stronger opponents than Reem) and they would have high match rates too. I, for one, considered this. That they don't is important relevant information the referees did not release (maybe they didn't know, but this should have been part of the investigation).

Author:  Bill Spight [ Wed Apr 04, 2018 2:55 am ]
Post subject:  Re: “Decision: case of using computer assistance in League A

Uberdude wrote:
Working on the premise a 98% top 3 match is good evidence of Leela cheating (which I don't believe), do the results from Carlo's other 2 games mean we accept he wasn't using Leela then? In that case we are saying he can play well enough to beat two 6 dans without Leela in rounds 1 and 3, and then used Leela to help him beat a 4d in round 4. That's odd.


Passing strange. OC, one could claim that he didn't need to cheat as much in the other games, for some reason. The real point, however, is that matching Leela's choices is not a good metric for cheating.

Or if you think that it is the right metric, you should assume that Carlo cheated in all his games against 4 dans and higher rated players in the tournament and match over all of them, not just one.

Quote:
A possible suggestion/implication when this story broke was he used Leela in his other games this season (in which he had more impressive wins against stronger opponents than Reem) and they would have high match rates too. I, for one, considered this. That they don't is important relevant information the referees did not release (maybe they didn't know, but this should have been part of the investigation).


To quote myself:
Bill Spight wrote:
The decision rests upon one game?????

That may be enough to require a replay or throw the result out. But confirmatory evidence is weak. For disciplinary action I would want evidence from at least 10 games.


As Regan points out, the key question for accusations of cheating is whether the accused player played better than he did without cheating. Uberdude's research on the other games against stronger players bears on that point. Carlo was playing well in that tournament without matching Leela more than usual. If the referees had had a clue about how to go about investigating allegations of cheating they would have looked for evidence of cheating in the other games that Carlo played in the same tournament. Those game records were readily available. That may sound harsh, but in their defense, investigating cheating in online go is terra incognita. Still, drawing conclusions from one game is absurd. And if they did look at the other games in the tournament, matching Carlo's play with Leela's choices, they should have drawn the opposite conclusion: not guilty.

Author:  Bill Spight [ Wed Apr 04, 2018 3:10 am ]
Post subject:  Re: “Decision: case of using computer assistance in League A

As a Bayesian, I suppose that I should be pleased that so many people believe in confirmatory evidence. Bayesians do, frequentists do not. This evening I took a look at a review of some chess games from a chess scandal. https://www.youtube.com/watch?v=cx0nurp-mpM There were, to me, some awesome tactics in those games. I was a bit dismayed, having read about Regan's work, that the reviewer was obsessed with the similarity of the accused player's play to a particular version of Houdini, which he was running as he reviewed the games. Sometimes he had to wait for a while until Houdini's search elevated the play in question to the top one or two choices. But it just goes to show that most people's belief in confirmatory evidence is too strong. They don't know how weak it is. {sigh}

bernds wrote:
Uberdude wrote:
Doesn't look strong evidence of Leela cheating to me.
I'd still say the 98% case is a very odd outlier, although the metric "matches top three moves" is not a very good one. In chess they have things like "average centipawn loss" which would correspond to an average drop in win rate, and number of blunders (moves which are worse than a computer move by a certain amount). The problem with both of these is that they become meaningless in a winning position, which is achievable in Go if you have the computer play your opening, or if you've studied enough with the computer that you know how to wreck a human in the fuseki.

I've now run the game through both Leela and her sister Zero, and they disagree fairly significantly about what's going on (which kind of invalidates the theory that there were only ordinary moves that everyone would play).


I don't know if that's the theory, but the disagreement between the Leela sisters shows that, at present, we do not have enough agreement between AI players about the value of specific plays to rate them, as they do in chess. Once we can rate plays, then we can say, oh, this player chose a play that is above his normal skill level. Or we could say, in this game he played many fewer blunders than usual. We are not matching his play to the AI's choices, we are comparing it to his usual play. It's disconfirmatory evidence, which is what we want. :)

Author:  Pio2001 [ Wed Apr 04, 2018 4:03 am ]
Post subject:  Re: “Decision: case of using computer assistance in League A

I believe that some players review their games with Leela. I sometimes do, just to look for blunders.

If this kind of review is done by one or two players for every one of their games, then eventually, they are going, one day or another, to find some examples of sequences (moves 50 to 150, for example) that are not far from the software choice (among the three top choices, for example).

This would be obviously "cherry picking".

Author:  Uberdude [ Wed Apr 04, 2018 4:29 am ]
Post subject:  Re: “Decision: case of using computer assistance in League A

Another Leela similarity analysis, this time with a player who has studied with Leela, Daniel from the UK team. I chose his game against Dutch 6/5d Geert Groenen from last year during his Leela period (see his journal entry on the game at forum/viewtopic.php?p=215737#p215737). I thought he was trying to play a solid Leela style so would match a lot, but it didn't. That was probably because endgame started really early (about move 100) and there were a lot of non-matches, sometimes because even choice #10 is similar win% to #1 but often in fact Leela identified mistakes I agree with, e.g. both players playing slack lines. Something I found notable in this analysis was how often the humans, particularly Geert, played moves Leela considered mistakes but which had the highest policy network probability; i.e. Leela is really good at predicting moves humans will play, often locally good shape (the old Dutch high dans are known for this style) but missing out that something else, often a tenuki, is better (at least in Leela's view, it's not super strong oracle to believe unquestioningly but I tended to agree with her).

Leela top 3 and within 5% metric for moves 50-149, Daniel (white) scored 33/50 = 66% and Geert scored 37/50 = 74%. Using the stricter top 1 matching Daniel scored 23/50 = 46% and Geert 20/50 = 40%. Daniel won by 4.5 Added to table, with winner in brackets:

Code:
+-----------------+------+----------------+------+---------+---------+---------+---------+
|      Black      | Rank |     White      | Rank | B top 3 | W top 3 | B top 1 | W top 1 |
+-----------------+------+----------------+------+---------+---------+---------+---------+
| [Carlo Metta]   |  4d  | Reem Ben David |  4d  |    * 98 |      80 |    * 72 |      54 |
| Andrey Kulkov   |  6d  | [Carlo Metta]  |  4d  |      80 |    * 86 |      68 |    * 62 |
| Dragos Bajenaru |  6d  | [Carlo Metta]  |  4d  |      74 |    * 78 |      50 |    * 60 |
| [Andrew Simons] |  4d  | Jostein Flood  |  3d  |      80 |      88 |      54 |      62 |
| Geert Groenen   |  5d  | [Daniel Hu]    |  4d  |      74 |      66 |      40 |      46 |
| [Ilya Shikshin] |  1p  | Artem Kachan.  |  1p  |      56 |      76 |      38 |      60 |
| [Andrew Simons] |  4d  | Victor Chow    |  7d  |      84 |      76 |      44 |      44 |
| Cornel Burzo    |  6d  | [A. Dinerstein]|  3p  |      74 |      66 |      40 |      48 |
| Jonas Welticke  |  6d  | [Daniel Hu]    |  4d  |      54 |      64 |      34 |      42 |
| [Park Junghwan] |  9p  | Lee Sedol      |  9p  |      74 |      64 |      64 |      38 |
| Lothar Speigel  |  5d  | [Daniel Hu]    |  4d  |      66 |      58 |      48 |      42 |
| Gilles v.Eeden  |  6d  | [Viktor Lin]   |  6d  |      82 |      70 |      56 |      46 |
+-----------------+------+----------------+------+---------+---------+---------+---------+


P.S. Whilst it's interesting to analyse with Leela Zero (and comparing differences between bots is valuable), at the point the game in question against Reem was played (November 2017) she was still a kyu player (my my, they grow up so fast!) so not much good for cheating.

Edit: I also analysed the El Clasico of European go, Ilya Shikshin 1p vs Artem Kachanovskyi 1p. These players are quite possibly stronger than Leela 0.11 on 50k nodes. So not matching could mean they are playing better rather than worse moves than Leela. As expected the more territorial and orthodox Artem was more similar than creative fighter Ilya. This was also, I think, the first game I analysed to feature a ko (which makes a lot of obvious matches for taking the ko, but also threats can differ). Top 3 match was 38/50 = 76% for Artem and 28/50 = 56% for Ilya, top 1 was 30/50 = 60% for Artem and 19/50 = 38% for Ilya.

Edit 2: Also did my game vs Victor Chow 7d from a few years ago as another example of a weaker player scoring an upset against a stronger one with a solid style. I played well in the opening and middlegame and got a good lead (but only won by half a point when he turned on super endgame and I was under time pressure, after move 150). For over 50 moves of the game Leela really wanted me to invade the left side at c7, which I was aware of but as I was leading against a 7d I knew was a strong fighter I didn't invade there to avoid complications I would well mess up. This was responsible for a lot of my failed matches with Leela's top 1 (often still top 3, but a few times not), plus of course some straight out mistakes from both of us.

Edit 3: And Cornel Burzo 6d vs Alexander Dinerstein 3p. Cornel has an elegant honte style, whilst Dinerstein is territorial and lead the whole time with a territory lead and ways into Cornel's flaky centre. As with Kulkov and Groenen games the player with highest top 3 match wasn't the same as with highest top 1 match.

Edit 4: And Daniel vs Jonas Welticke. Jonas is known for crazy openings and weird style, which he did here opening on the sides, only 25% win after 50 moves. As expected his wacky moves didn't match much. Daniel played solidly and matched a lot, except Leela got confused by a simple semeai so wanted to be stupid. Also despite having won the semeai already, in calm positions Leela wanted to keep playing the semeai rather than some profitable move elsewhere (but Daniel was winning so much maybe he could essentially pass and still win).

Edit 5: First pro game. My expectation was pros might score lower matching against Leela than us mid-high amateur dans as they are much stronger and could be playing unexpected better moves. I chose Park Junghwan and Lee Sedol's last game at some festival. Park is a fairly conventional player, whilst Lee is more creative, so I expected Park to match more. Park did match more, but they were both similar to us amateurs. Maybe Leela is stronger than I realised. Leela did not expect the moves which made me feel "Wow, cool pro moves" (often tenuki), but she did better than I did (with brief thinking) in predicting the contact fighting.

Edit 6: Another of Daniel's from last year, vs Lothar Spiegel 5d from Austria who is a fairly sensible player. Lots of matching during long but joseki-ish middle game invasions, but also misses from mistakes and also both players overlooking important sente exchange for a while (f11/g10).

Edit 7: Gilles van Eeden 6d (classic good shape Dutch 6d) vs Viktor Lin 6d. Most mismatches were due to a ko fight, and a few disagreements in early yose. Going into yose Leela gave Gilles 77% win, but this looks like a misunderstanding of his dead group at top left: if I played out a few more moves to make it clearly dead then the win% collapsed to 57%. In the end he lost by 2.5.

Author:  sybob [ Wed Apr 04, 2018 2:53 pm ]
Post subject:  Re: “Decision: case of using computer assistance in League A

Bill Spight wrote:
Still, drawing conclusions from one game is absurd.

Huh?
The issue is: did he cheat IN THIS GAME.
That's what he is accused of. It does not matter how/what other games are.

Author:  jeromie [ Wed Apr 04, 2018 3:11 pm ]
Post subject:  Re: “Decision: case of using computer assistance in League A

sybob wrote:
Bill Spight wrote:
Still, drawing conclusions from one game is absurd.

Huh?
The issue is: did he cheat IN THIS GAME.
That's what he is accused of. It does not matter how/what other games are.


That’s only true if you consider the likelihood of cheating in one game to be independent of cheating in other games AND you think there is nothing to learn from a player’s performance in other games. But that’s probably not true.

At the very least, a person’s general level of play adds some important data. If I were to suddenly start beating dan level players on KGS after a long period of stable play as a 3 kyu, you’d have good grounds to be suspicious of my improvement.

Page 7 of 36 All times are UTC - 8 hours [ DST ]
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/