“Decision: case of using computer assistance in League A”

Uberdude · Post by **Uberdude** » Mon Apr 02, 2018 8:01 am

Just to go back to Carlo, I thought I'd work out his performance rating for this season's PGETC. He had great results for a 4d:
- beat Andrey Kulkov 6d (Russia) by 1.5
- beat Ondrej Kruml 5d (Czechia) by 2.5
- beat Dragos Bajenaru 6d (Romania) by resign
- beat Reem Ben David 4d (Israel) by resign *** the famous 98% game
- lost to Mero Csaba 6d (Hungary) by 2.5
- beat Mijodrag Stankovic "5d" 3d by resign
- lost to Andrij Kravets 7d/1p by 7.5

At the start of the season in (1st) September Carlo's rating was 2381 [very similar to me], this was after picking up 50 points at the EGC. Of course his true strength could have been more than that and grown since then too but his rating lagged. His performance rating (using EGD GoR calculator), using current ratings of opponents is 2629, or +248.

How does that compare to other good performances?

Forum regulars may remember I beat Victor Chow 7d from South Africa a few years ago. UK were in league C for the 2014/15 season and my initial rating was 2361. My results were:
- beat Petrauskas 3d (Lithuania) by resign
- beat Chow 6/7d (South Africa) by 0.5
- beat Ganeyev 3k (Kazakhstan) by resign.
As I had no losses my performance rating with the "adjust until input = output" method is infinite, anchoring with a loss to 2700 gives 2666, anchoring with loss to 2800 gives 2719. So +300 ish with big uncertainty as no losses and few games, the only useful information is I beat a 2616 in one game, how flukey was that?

Last season Daniel on the UK team had no losses, this season he had just 1:
- beat Rasmusson 4d (Denmark)
- beat Karadaban 5d (Turkey)
- beat Welticke 6d (Germany)
- lost to Lin 6d (Austria)
Initial rating was 2402. Performance rating 2616 (+214).
If you include the wins (included some 5ds) from the previous season (for which his initial rating was 2262 but he probably wasn't much weaker than he is now) as well then you get performance rating of 2677 (+415).

Uberdude · Post by **Uberdude** » Mon Apr 02, 2018 12:18 pm

Time for more Leela similarity data! I analysed Carlo's first game against Andrey Kulkov. His ranking is 6d, but rating only 2538 these days, a strong 5d. He won the European Championship in 2001 aged 19 and peaked around 2650 in the early 2000s, stable around 2550 for last few years.

Using Leela 0.11 at 50k nodes the similarity with top 3 choices* Carlo scored 43/50 matches = 86% and Andrey scored 40/50 = 80%. Using the stricter top 1 choice the similarity was 31/50 = 62% for Carlo and 34/50 = 68% for Andrey. So Carlo beat a rusty 6d playing less Leela-like than a 3d I beat, and fewer of Leela's #1 moves than his opponent. This is not indicative of cheating with Leela to me, but that he played well.

* and within 5% of top 1, this was relevant a few times with them playing very bad 2nd choices. e.g. Leela thinks Carlo's a9 was a huge mistake despite being 2nd choice: it wanted to a5 capture and take sente to invade right side instead of save a few more stones in gote along the first line.

BlindGroup · Post by **BlindGroup** » Mon Apr 02, 2018 2:29 pm

Bartleby wrote: I still think 98 percent is really high. Although confirmatory evidence may be weak in general, at some point that becomes no longer true. If a player had a 100 percent match rate over an entire game would this not be highly suspect? 98 per cent is quite close to 100 per cent.

Uberdude wrote: Yup. It's difficult. If we were already in a position where it's accepted 10% or something of people are cheating online then I'd be happier with much weaker evidence to convict someone, "on the balance of probabilities" level (as for civil cases in English law). But if we are still in the cheating is rare world (maybe I'm being naive) then stronger "beyond reasonable doubt" (criminal law) evidence is needed.

I think that there are two issues missing from this discussion:

1. I completely agree that our priors on the number of cheaters should factor into the analysis, but from a statistical perspective, an equally crucial element is the variance (or better yet distribution) of match rates. For example, if most high level dans match consistently enough with Leela within a very narrow band (say 85-90%), then even a small deviation from the band could be meaningful. Unfortunately, the only way to do calculate the necessary information is to run the analysis on a fairly large database of games at that level. The necessary size would obviously depend on the observed distribution, but that could be anywhere from a few hundred to a few thousand games. That would be quite a bit of work, but it may be necessary if the goal is to better understand the distribution of Leela match rates.

2. The decision about guilt or innocence is always going to be nothing more than an inference. All evidence can do is shift the probabilities around. Even if we had a video of someone entering scores from Leela during a match, there is always a chance that the video was doctored or that the player had an unknown twin... One builds evidence until the probability of "innocent" explanations becomes sufficiently small that one is comfortable with the resulting number of mistakes. Innocent people will unfortunately always be convicted thanks to the law of large numbers. The question for a given decision process is whether or not it (A.) is sufficiently likely to convict people who did cheat and (B.) is sufficiently unlikely to falsely convict those who didn't. One can choose to prioritize A or B, but one always comes at the expense of the other.

And as Bill noted, the priority placed on these two ends can depend on the penalties involved -- Criminal cases in some legal traditions prioritize B with the idea that punishing innocent people should be stringently avoided, while in civil cases (that do not involve jeopardy to life or liberty) the preference shifts slightly to A. In go, it may be that the "cost" of being falsely punished for cheating in general largely anonymous online play by having one's account closed could be considered sufficiently low that the community might be willing to accept a high emphasis on A (e.g. accepting that a number of people will be wrongly convicted in order to ensure that a large number of cheaters are caught). While in tournaments where players are known and reputations are in jeopardy, we might want to prioritize B quite a bit more.

Bill Spight · Post by **Bill Spight** » Mon Apr 02, 2018 4:30 pm

BlindGroup wrote: I think that there are two issues missing from this discussion:

1. I completely agree that our priors on the number of cheaters should factor into the analysis, but from a statistical perspective, an equally crucial element is the variance (or better yet distribution) of match rates. For example, if most high level dans match consistently enough with Leela within a very narrow band (say 85-90%), then even a small deviation from the band could be meaningful. Unfortunately, the only way to do calculate the necessary information is to run the analysis on a fairly large database of games at that level. The necessary size would obviously depend on the observed distribution, but that could be anywhere from a few hundred to a few thousand games. That would be quite a bit of work, but it may be necessary if the goal is to better understand the distribution of Leela match rates.

That's an important point. But also, if we are going to decide the question of cheating by match rates, we need to know the distribution of matches, given cheating. That we do not have, and nobody has proposed, except, perhaps, 100% of matches with one of Leela's top three choices.

If we are going to convict somebody based upon matches to Leela's choices, we are always (Edit: almost always) going to have more evidence that the player plays like Leela than that the person is cheating.

As for the CIT incident, I argue that

is evidence that Triton did not cheat, but it is Leela's second choice. If you accept my argument, then you cannot simply take agreement with Leela as evidence of cheating. But it seems that we currently have no statistical definition of cheating except playing like some bot, Leela in this case. That's not acceptable.

BlindGroup · Post by **BlindGroup** » Mon Apr 02, 2018 5:46 pm

Bill Spight wrote:That's an important point. But also, if we are going to decide the question of cheating by match rates, we need to know the distribution of matches, given cheating. That we do not have, and nobody has proposed, except, perhaps, 100% of matches with one of Leela's top three choices.

That is indeed the "rub". As someone noted above regarding the sumo wrestling paper, there is a large economics literature on identifying "cheating" in various contexts -- sumo matches, standardized tests, attendance taking at school, penalty calling in basketball games, etc. In all of these instances, it is impossible to identify cheaters ex ante, and so, it's impossible construct a model of observed behavior conditional on someone cheating. The way this literature gets around this is by identifying relationships that are very unlikely to exist absent some sort of deliberate cheating. The more of these relationships one identifies, the more the evidence "piles up" making it much more likely the cheating is involved.

The problem with this approach is that it can identify whether or not on average cheating is occurring, but it cannot tell us whether or not cheating occurred in any specific instance. It would even be difficult to estimates odds of cheating for a given instance. Unfortunately, It is also very difficult to identify these relationships. (That's why they get published in top academic journals!)

In the context of go, one would need to identify specific points in a game in which one would expect a cheater to correspond more closely with Leela than someone who wasn't. So, this is consistent with the discussion about eliminating joseki play and forcing moves -- even a non-cheater will correspond with Leela there. To get separation in behavior you need a situation in which a non-cheating player is unlikely to correspond to Leela while a cheating player would be very likely to correspond. So, if you were looking at the games of 4d's, an example might be a move that 4d's are unlikely to identify on their own, but where Leela estimates a very significant difference in winning probabilities between her first and second best moves. Essentially looking for instances in which there is the 4d equivalent of the "god move" or the "ear reddening move". (In the previous discussion, this is analogous to the debate about whether or not a move was "normal" for a given person's level of play.)

Having said this, it occurs to me that we do have a wealth of information available on how non-cheaters play in all of the pre-Leela game records. So, in principle, one could search those games and try to identify the odds of different moves conditional on rank. (I'm loosely thinking about how identifying the probability of moves as the first stage for determining which moves to evaluate for Alphago.) For a given ranked player, one could then identify the "odds" of a specific move using that prior data looking for instances in which Leela finds a significantly better move than that rank of player is likely to find on her own. I feel like this is the best we will be able to do, but the other big issue is that one of the claims is that the availability of Leela has changed the playing styles of non-cheaters. If so, then our store of past games would be less useful.

Bill Spight wrote:If we are going to convict somebody based upon matches to Leela's choices, we are always (Edit: almost always) going to have more evidence that the player plays like Leela than that the person is cheating.

Agreed. Unless one can distinguish between someone playing "like" Leela and someone copying Leela, we're stuck. My sense though is that while we can't know how someone cheating with Leela would play, we can characterize the play of people we know not to be cheating. Although this may require generating new data rather than just using what we already have available. At one extreme, one could have a large number of people play serious games while being observed. It's an expensive option, but it proves that the problem is at least tractable.

If I had to guess, I think these kinds of numerical solutions are going to be best suited to policing regular online games because (a.) the amount of available data is large to predict "normal" play is large and (b.) the consequences of an automated rule-based system is limited to forcing someone to create a new account. For in person games or online tournaments, I think this sort of system will never work given the severity of the consequences, and so the focus will have to be on preventing cheating by making it more difficult.

Javaness2 · Post by **Javaness2** » Tue Apr 03, 2018 7:23 am

I think you have to be careful about qualifying that as evidence.
We know already that the Leela was not used to choose ranked suggestions 1 to 3 all the time.
Is there a suggestion that Leela was used to pick moves only within a certain number goodness-metric-units? (%win is not really %win afaik)
Or is the suggestion that Leela was just used a lot in some manner that hasn't been precised? An accusation which is harder to defend against.
Reading reddit, I had the feeling that there might be some degree of "This move looks weird, are using a bot" accusation at play.

Sensationalist though this all CIT-scandal is, it is remarkably low on detail.

Calvin Clark · Post by **Calvin Clark** » Tue Apr 03, 2018 9:00 am

Bill Spight wrote: As for the CIT incident, I argue that is evidence that Triton did not cheat, but it is Leela's second choice.

Just for yuks, I checked CrazyStone Pro on my Pixel. It would have played

with 15 seconds of thinking and 46804 playouts.

Bill Spight · Post by **Bill Spight** » Tue Apr 03, 2018 9:25 am

If you are going to use a bot to cheat, then you are going to play more like that bot than you usually do. It does not follow that you are going to make a lot of plays like the bot, however. If we are going to used bot-like play as evidence of cheating, we need to know how cheaters use bot-like play. But we don't.

I don't think that anybody has taken a look at cheating by losers. Just because you cheat doesn't mean that you will win.

In fact, if you have gotten yourself into dire straits in a game, then you might well be tempted to cheat to try to get yourself out of the hole you are in. (To mix metaphors.

) I eagerly await cheating allegations against the player who lost a game. {Tongue in cheek.

} But seriously, if we are going to study cheating -- and I think we should --, we should not confine ourselves to looking at the winners.

Edit: Actually, some chess players have looked at losses for evidence of cheating. The main method seems to be to find bonehead losing plays that are one space away from correct plays. The idea, OC, is that the cheater made a small error in the location of the play. The key also, is that the losing play is not just an error, but one that is uncharacteristic of the player's play when supposedly cheating. Again, this is disconfirmatory evidence.

Uberdude · Post by **Uberdude** » Tue Apr 03, 2018 12:42 pm

I also analysed Carlo's 3rd game, a win by resign against Dragos Bajenaru 6d (2589) of Romania, the strongest player he beat. Dragos was an insei and peaked around 2670 in late 2000s, stable for last few years just under 2600. Intro up to move 50: Carlo as white played double 3-3s to Dragos's double 4-4s, I didn't like Dragos's shoulder hit joseki from move 5, Carlo 3-3 invaded top right AlphaGo style (Leela doesn't even consider it), all very natural moves from Carlo until f15 surprised me (Leela #11). Dragos tried to make territory on lower side with n3 e3 combo but didn't work as 2 doors in. Carlo's k4 was the only move in first 50 which could be considered slightly suspicious: that area is clearly the place to play but a few nearby points plausible and k4 was Leela's #1. By move 50 Leela gives white 69% win and I agree black's opening was bad: he made no moyo with his 4-4s and white's territory strategy is working well.

Using the same Leela top 3 and within 5% metric for moves 50-149, Carlo scored 39/50 = 78% and Dragos scored 37/50 = 74%. Using the stricter top 1 matching Carlo scored 30/50 = 60% and Dragos 25/50 = 50%. They both matched most of the moves to 96 but at that point Dragos started endgame and Carlo followed him, whereas Leela saw a powerful shape attack at p7/m6 that Carlo neither attacked not Dragos defended, so their matching went down for a while until play moved to that area over 30 moves later.

Uberdude · Post by **Uberdude** » Tue Apr 03, 2018 12:57 pm

Here is a summary of my findings in a handy table (does our phpbb support tables?). Carlo's numbers highlighted with an asterisk.

Code: Select all

+-----------------+------+----------------+------+---------+---------+---------+---------+
|      Black      | Rank |     White      | Rank | B top 3 | W top 3 | B top 1 | W top 1 |
+-----------------+------+----------------+------+---------+---------+---------+---------+
| Carlo Metta     | 4d   | Reem Ben David | 4d   |    * 98 |      80 |    * 72 |      54 |
| Andrey Kulkov   | 6d   | Carlo Metta    | 4d   |      80 |    * 86 |      68 |    * 62 |
| Dragos Bajenaru | 6d   | Carlo Metta    | 4d   |      74 |    * 78 |      50 |    * 60 |
| Andrew Simons   | 4d   | Jostein Flood  | 3d   |      80 |      88 |      54 |      62 |
+-----------------+------+----------------+------+---------+---------+---------+---------+

My hope is the referees would have something like this in their report with many more rows, but for now this is all we have to go on (plus the claimed 70-80% for top 3 in an unspecified plural amount of Carlo's offline games). Doesn't look strong evidence of Leela cheating to me.

Bonobo · Post by **Bonobo** » Tue Apr 03, 2018 3:05 pm

Uberdude wrote:My hope is the referees would have something like this in their report with many more rows, but for now this is all we have to go on (plus the claimed 70-80% for top 3 in an unspecified plural amount of Carlo's offline games). Doesn't look strong evidence of Leela cheating to me.

Are you in contact with Pandanet and/or EGF about this, especially with the referee Jonas Egeberg? I imagine that they could appreciate your expertise …

bernds · Post by **bernds** » Tue Apr 03, 2018 6:39 pm

Uberdude wrote:Doesn't look strong evidence of Leela cheating to me.

I'd still say the 98% case is a very odd outlier, although the metric "matches top three moves" is not a very good one. In chess they have things like "average centipawn loss" which would correspond to an average drop in win rate, and number of blunders (moves which are worse than a computer move by a certain amount). The problem with both of these is that they become meaningless in a winning position, which is achievable in Go if you have the computer play your opening, or if you've studied enough with the computer that you know how to wreck a human in the fuseki.

I've now run the game through both Leela and her sister Zero, and they disagree fairly significantly about what's going on (which kind of invalidates the theory that there were only ordinary moves that everyone would play). Leela does indeed think this game was played nearly perfectly from a certain point - very little disagreements from a winrate loss standpoint. However - and this I find significant - there were some disagreements in the fuseki phase, which is where I'd expect people to be able to cheat most easily without attracting attention. For example, the 3-3 invasion at C17 is not something Leela suggests. Leela Zero does want to play it, but it wants to play it a move earlier, instead of R6, and then it has some disagreements about the exact sequence. So this would suggest it's the choice of a human who has seen computers play 3-3 early and often but is not currently analyzing the position with a machine.

In the latter half you can find moments where choice A from Leela is something a human wouldn't play, and the human did not play it.

So at this point I don't know what to think. The agreement with Leela in the latter phase of the game is really odd, but if the player was using an engine, he wasn't doing it to get an advantage out of the opening. Zero doesn't think the game is going in Black's favour until move 70 (and actually thinks N16 at move 118 was catastrophic and gives White the advantage again, which I'm not sure I believe). Leela sees Black's win rate drop steadily from move 16 onwards, but thinks White made a big mistake at move 33. Then the win rate for Black drops back below 50% at move 38 with the kind of safety-first slack move that I recognize from reviewing my own play with the computer. Leela also dislikes N16, but only rates it a 4%-ish drop, which might be acceptable if one knew one was winning anyhow. But the move played in the game is a much more natural shape move which I would expect a human to play in that position.

Both engines really want to play N9 as a forcing move at some point after move 141 (differing slightly on timing), and I would find it very hard to resist making that move if I saw it suggested by an engine. When it's pointed out to you it's clearly an important move. Leela doesn't make much of it, but Zero thinks not playing it is a big blunder, for several moves in a row.

Given that only one game out of four was odd, and there is no clear trend in the first 40 moves (which there very likely would be if an engine was playing), I am inclined to give the benefit of the doubt on this one, going against my initial reaction.

Javaness2 · Post by **Javaness2** » Tue Apr 03, 2018 10:13 pm

bernds wrote:I'd still say the 98% case is a very odd outlier, although the metric "matches top three moves" is not a very good one. In chess they have things like "average centipawn loss" which would correspond to an average drop in win rate, and number of blunders (moves which are worse than a computer move by a certain amount). The problem with both of these is that they become meaningless in a winning position, which is achievable in Go if you have the computer play your opening, or if you've studied enough with the computer that you know how to wreck a human in the fuseki.

I think using a top 3 metric is perhaps pointless, I don't know why a cheat would be imagined to use just the top 3. Can you produce a winrate metric or goodness metric report for each game, I seem to remember that Leela had such a chart. By the by, the fuseki used by Carlo seems to have caught some of the players off guard.

Uberdude · Post by **Uberdude** » Tue Apr 03, 2018 10:18 pm

Working on the premise a 98% top 3 match is good evidence of Leela cheating (which I don't believe), do the results from Carlo's other 2 games mean we accept he wasn't using Leela then? In that case we are saying he can play well enough to beat two 6 dans without Leela in rounds 1 and 3, and then used Leela to help him beat a 4d in round 4. That's odd.

A possible suggestion/implication when this story broke was he used Leela in his other games this season (in which he had more impressive wins against stronger opponents than Reem) and they would have high match rates too. I, for one, considered this. That they don't is important relevant information the referees did not release (maybe they didn't know, but this should have been part of the investigation).

Bill Spight · Post by **Bill Spight** » Wed Apr 04, 2018 2:55 am

Uberdude wrote:Working on the premise a 98% top 3 match is good evidence of Leela cheating (which I don't believe), do the results from Carlo's other 2 games mean we accept he wasn't using Leela then? In that case we are saying he can play well enough to beat two 6 dans without Leela in rounds 1 and 3, and then used Leela to help him beat a 4d in round 4. That's odd.

Passing strange. OC, one could claim that he didn't need to cheat as much in the other games, for some reason. The real point, however, is that matching Leela's choices is not a good metric for cheating.

Or if you think that it is the right metric, you should assume that Carlo cheated in all his games against 4 dans and higher rated players in the tournament and match over all of them, not just one.

A possible suggestion/implication when this story broke was he used Leela in his other games this season (in which he had more impressive wins against stronger opponents than Reem) and they would have high match rates too. I, for one, considered this. That they don't is important relevant information the referees did not release (maybe they didn't know, but this should have been part of the investigation).

To quote myself:

Bill Spight wrote: The decision rests upon one game?????

That may be enough to require a replay or throw the result out. But confirmatory evidence is weak. For disciplinary action I would want evidence from at least 10 games.

As Regan points out, the key question for accusations of cheating is whether the accused player played better than he did without cheating. Uberdude's research on the other games against stronger players bears on that point. Carlo was playing well in that tournament without matching Leela more than usual. If the referees had had a clue about how to go about investigating allegations of cheating they would have looked for evidence of cheating in the other games that Carlo played in the same tournament. Those game records were readily available. That may sound harsh, but in their defense, investigating cheating in online go is terra incognita. Still, drawing conclusions from one game is absurd. And if they did look at the other games in the tournament, matching Carlo's play with Leela's choices, they should have drawn the opposite conclusion: not guilty.

Life In 19x19

“Decision: case of using computer assistance in League A”

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A