“Decision: case of using computer assistance in League A”

bernds · Post by **bernds** » Mon Apr 09, 2018 5:48 am

Uberdude wrote:I checked the games didn't end in an early resign before analysing (discared le Calve vs Bajenaru for this reason). I think I should do some 5d games next.

If you want some of mine (3D OGS/IGS) to help fill the data set, I probably still have a bunch of Leela .rsgf files from analyzing my OGS correspondence games.

Uberdude · Post by **Uberdude** » Mon Apr 09, 2018 10:03 am

Javaness2 wrote:Might not a metric such as "Average distance from Leela's Goodness Value for its first choice move" be a more interesting metric?

Yes, but these win% actually can't be trusted much, even as Leela's evaluation of a position, when there aren't many simulations. The protocol I have been using is load the game for analysis into Leela, kick off the analysis until it reaches around 50k nodes and then use analysis window (which is sorted by # simulations, not win %) to check if the move played was in top 3. See example screenshot below. So at this point Leela wants to play d15 with 51.3% win, the move played in the game was d14 which is #3 and 47.6%, so still within the 5% band, but few simulations so that number is not so reliable. This counts as a match, if d15 fluctuated up another 1.3% the difference would be over 5% and this wouldn't count as a match.

: d15#1.PNG (251.82 KiB) Viewed 11620 times

Go forward one move and let Leela analyse the position with d14 on the board for 50k nodes, then go back to before and you get the below, it now thinks d14 is the best move with highest win%! This seems to indicate Leela's algorthim isn't tuned to be exploratory enough. Another analysis protocol would be to ensure the actual move played also gets 50k (or whatever) nodes of exploration, this could lead to higher match rates if Leela changes its mind to think the real game move is better than it anticipated when focusing on its #1 move. I chose my method as it's how I imagine a Leela cheater wondering what to play next would operate.

: d14#1.PNG (250.22 KiB) Viewed 11620 times

Bill Spight · Post by **Bill Spight** » Mon Apr 09, 2018 5:47 pm

BlindGroup wrote:
Uberdude wrote:I only have 10 data points, but fitting them to a normal distribution (dubious: too small sample, could be different shape, plus 100 is a hard max) I get a mean of 80 and standard deviation of 8. So then you might say 98 is 2.2 sds from the mean, what's the chance of that? Look up your normal distribution probability tables and you get 1.2%. That's small, an inept statistician would say, less than the oft used 0.05 significance level, he must be guilty! But that's the chance a randomly selected game has that value (based on the false assumption the metric is normally distributed with those parameters). But this game was not randomly selected, it was chosen to be examined precisely because it has a high similarity. So such a probability is invalid. As Feynman eloquently said:

You know, the most amazing thing happened to me tonight. I was coming here, on the way to the lecture, and I came in through the parking lot. And you won’t believe what happened. I saw a car with the license plate ARW 357. Can you imagine? Of all the millions of license plates in the state, what was the chance that I would see that particular one tonight? Amazing!
Uberdude, your taking the time to go through even these 10 games seems to be more than we've seen anyone else doing to systematically assess these decisions. A few thoughts to contribute:

1. As you note a sample size of 10 data points is VERY small. I think even "inept statisticians" would be uncomfortable move forward with only these data. That said, this is not meant to criticize your efforts, but rather to argue that your are on the right track and that your efforts should be extended significantly by some organization with significantly greater access to computational resources.

2. I think you have the logic of the hypothesis testing framework slightly twisted and it affects the interpretation of the 1.2 percent error rate (the "Type I" rate). You are right, we chose the game with the 98 percent top-3 match rate deliberately -- it was the game under question.

Having looked at a number of videos about cheating at online chess, I think I'll second Uberdude here. It appears that a lot of online chess cheating is by playing every move chosen by a superhuman chess engine. The match to the top three plays, except for a mistake or blunder, is because the engine is unknown, as is how long the engine ran, and on what hardware. That metric seems to have been chosen to give almost 100% matches. Almost 100% matches to a superhuman chess engine, then, seems to be part of the theory of online cheating at chess. It looks like this game was came into question because of the near 100% matches with Leela. (A 4 dan losing to a 4 dan is not enough to question the game.) If so, a Fisherian cannot use that game to prove cheating, because it was not randomly chosen. A Bayesian can, but, speaking as one, I think that all the game does is to raise suspicion. Confirmatory evidence is weak.

John Fairbairn · Post by **John Fairbairn** » Tue Apr 10, 2018 12:51 am

A Fisherian cannot use that game to prove cheating, because it was not randomly chosen. A Bayesian can, but, speaking as one, I think that all the game does is to raise suspicion. Confirmatory evidence is weak.

Even without understanding the statistical nuances I can easily agree that what has happened to Carlo has been highly unsatisfactory, and the bulk of opinion in this thread seems to be of like mind.

But I think it is also important to try to see it all from the point of view of the organisers and referees.

Kasparov championed a form of chess in which a pro played with a machine to help him (just like the alleged involvement of Leela here). Despite his fame it didn't get any traction. There have been a tiny handful of such games in go (mainly in Taiwan) and they sank without trace as far as I could see. I never even saw a report on how often the pro had recourse to the machine. That, plus the amount of comments on chess cheating I've seen, leads to me to believe that people see cheating in chess/go, just as people see drugs in athletics, very much in black-and-white terms. Halfway houses and discreet averting of the eyes are not tolerated by the vast majority. Cheating must be stamped out - even if some people suffer wrongly, it seems.

Now, given that this can only be done on the basis of some sort of probabilistic assumptions, is it possible to lend support to organisers and referees (and through them to the overwhelming majority of players) by using statistics in the same way that seems to be accepted elsewhere. What I have in mind is something like the "significance" factor which is often mentioned in connection with 95% probability. How can such a metric be devised and accepted?

Acceptance doesn't seem to be a problem to me because human are used to running their entire lives on the basis of probability. But it could perhaps be made easier to accept if the first "punishment" was not so swingeing. E.g. a player could be put on notice that he is suspected of cheating. The arbiters could also indicate what measures need to be taken to satisfy them in future (e.g. a player could video himself while playing an important on-line game and use that to show that he is not consulting a machine).

Bill Spight · Post by **Bill Spight** » Tue Apr 10, 2018 4:36 am

John Fairbairn wrote: But I think it is also important to try to see it all from the point of view of the organisers and referees.

{snip}

people see cheating in chess/go, just as people see drugs in athletics, very much in black-and-white terms. Halfway houses and discreet averting of the eyes are not tolerated by the vast majority. Cheating must be stamped out - even if some people suffer wrongly, it seems.

There was a cheating scandal in an IGS tournament in the 1990s in which Sprint, a strong Chinese amateur, was discovered to have gotten help from a Chinese pro. That may have made Pandanet sensitive to accusations of cheating in their tournaments. AFAICT, nothing written here criticizing the treatment of the evidence in this case, or the CIT case, condones cheating.

Now, given that this can only be done on the basis of some sort of probabilistic assumptions,

That may be so in these cases, but not in general, as Regan has pointed out. (Unless you are a Bayesian.

) Even in the case of casual online chess cheating, the cheaters typically put down their opponents, a form of behavioral evidence. (In itself weak, OC, but not just statistical.)

is it possible to lend support to organisers and referees (and through them to the overwhelming majority of players) by using statistics in the same way that seems to be accepted elsewhere.

That was not done in this case. I have argued in Bayesian terms, first, because I am a Bayesian, and second, because Bayesians, like most of the public, and like the organizers and referees, believe in confirmatory evidence. But, unlike most of the public, we know that it is very, very weak. The use of confirmatory evidence is not generally accepted statistical practice.

What I have in mind is something like the "significance" factor which is often mentioned in connection with 95% probability. How can such a metric be devised and accepted?

Regan addresses that in chess, not with the question of whether a player plays like Houdini or other top engine (confirmatory evidence), but whether the player plays better than he does without cheating (disconfirmatory evidence). Regan can make use of individual moves, because he is able to rate them. Thus, an obvious play, even though every engine would play it, does not count against the player because it is what he would play without cheating. In go, we are not able to do that yet; give us a few years. What we have to do instead is to rely upon the judgement of strong players. For instance, in the Reem vs. Metta game, consider the sequence,

-

, where Black secures the bottom right corner. Black has options for

, but given that play and White's responses, the four plays,

-

, would be played by not only by Carlo Metta, but also by weaker dan players who were not cheating. Even in cases of suspected online cheating at chess, accusers look at the plays of suspected cheaters and point out plays that are unlike human plays, or human plays of the level of the suspect. That is, the accusers look for disconfirmatory evidence, not confirmatory evidence, or not just confirmatory evidence. The four Black plays,

-

, are confirmatory evidence of the proposition, "He plays like Leela", but are not evidence of cheating. The question is not just a reliance upon statistical evidence alone, but a reliance upon the wrong statistical evidence.

Now, if one is using a bot to cheat, then one's play will resemble that of the bot, to some extent. Therefore, as Blindgroup points out, given enough games, the number of plays that are matches to Leela's choices but not because of cheating should even out, on average. But that is not the case for a single game. You need to look at a number of games in which Carlo is suspected of cheating, such as all of his games in this tournament, and compare them with other games in which he is not suspected of cheating. That is, we must look for disconfirmatory evidence: Carlo plays differently in one set of games from how he plays in the other set of games. If you suspect him of cheating in all games, then you compare his play against the play of other players of similar ability. OC, in that case the similarity of his play to Leela's may simply be evidence, not of cheating, but of intensive training with Leela for a couple of years.

So you could, if you gloss over the question of randomization, set up a significance test using some metric of similarity to Leela's play. But doing so would involve the use of a large number of games, and any statistically significant result would not be a 98% match in a single game.

Acceptance doesn't seem to be a problem to me because human are used to running their entire lives on the basis of probability. But it could perhaps be made easier to accept if the first "punishment" was not so swingeing. E.g. a player could be put on notice that he is suspected of cheating. The arbiters could also indicate what measures need to be taken to satisfy them in future (e.g. a player could video himself while playing an important on-line game and use that to show that he is not consulting a machine).

As I have said, the evidence in that one game is enough to raise suspicion. And that would justify the organizers to treat Carlo like Caesar's wife, requiring him to be above suspicion, and require that his future games be monitored. It would also justify looking at the plays in the questioned game to see whether the result might be voided. It might even be possible to find further evidence of cheating by analyzing that game.

Bill Spight · Post by **Bill Spight** » Tue Apr 10, 2018 6:29 am

Here is one way to set up a significance test, but it does not require a lot of games.

Recruit a panel of three 6 dans who are unfamiliar with Carlo's games in this tournament. (To give Carlo the benefit of the doubt with regard to his level, since he did beat 6 dans in this tournament.) Have each of them, without consultation, record what they would play where Carlo had the move in the range, move 51 - 100, for each game he played. Then eliminate from comparison moves where Carlo matched both one of Leela's three choices and one of the panel's choices, as not being evidence of cheating. This gives you disconfirmatory evidence: not like the panel, who we know not to be cheating.

The remaining plays are potentially cheating plays. Those that do not match Leela's choices are, by presumption, not cheating plays; those that do match are still potentially cheating plays. But we still do not have a null hypothesis.

However, we know that the judges on the panel are not cheating. Treat their plays in like manner. For each of them, put Carlo on their panel and eliminate those plays where their play matches both Leela and one of the other three players. This process yields a 2x4 matrix, where we have cells labeled Carlo-like-Leela, Carlo-unlike-Leela, Judge1-like-Leela, Judge1-unlike-Leela, etc. Is Carlo's play significantly different from that of the Judges? (The null hypothesis is that they are not different.)

mhlepore · Post by **mhlepore** » Tue Apr 10, 2018 5:30 pm

Bill Spight wrote:...This process yields a 2x4 matrix, where we have cells labeled Carlo-like-Leela, Carlo-unlike-Leela, Judge1-like-Leela, Judge1-unlike-Leela, etc. Is Carlo's play significantly different from that of the Judges? (The null hypothesis is that they are not different.)

This null hypothesis could be rejected if one of the judges plays differently than Carlo and the other two judges. The null is rejected, but we couldn't draw a negative inference about Carlo.

The quest for a statistical test that answers what we all want to know seems truly impossible. We are always ending up with a few moves that "could be called suspicious" but we can't agree on much more than that. I bet the same could be said of most randomly selected games between two 5 dans.

If, as Uberdude points out - he were using Leela Zero to cheat, then it would be an easier problem to solve because he would be playing highly effective, but unintuitive, moves.

Not guilty because there is massive reasonable doubt. Time to move on.

Bill Spight · Post by **Bill Spight** » Tue Apr 10, 2018 9:36 pm

mhlepore wrote:
Bill Spight wrote:...This process yields a 2x4 matrix, where we have cells labeled Carlo-like-Leela, Carlo-unlike-Leela, Judge1-like-Leela, Judge1-unlike-Leela, etc. Is Carlo's play significantly different from that of the Judges? (The null hypothesis is that they are not different.)

This null hypothesis could be rejected if one of the judges plays differently than Carlo and the other two judges. The null is rejected, but we couldn't draw a negative inference about Carlo.

I think that you are thinking of the null that all of the players play alike. That's not the same thing. For starters, Carlo's play has to be closer to Leela's than each of the Judge's play. We can reject the hypothesis that all of the players play alike, but if any of them plays more like Leela than Carlo does, then we cannot reject the null hypothesis that Carlo's play is more like Leela's than the Judges' play. His play is within the fold. (Yes, I did not quite express the null correctly, or precisely.)

The quest for a statistical test that answers what we all want to know seems truly impossible.

We can't use, say, a simple Chi-Squared test, but there are statistical tests that handle this kind of situation, with multiple comparisons.

As Regan points out, without other kinds of evidence besides similarity to Leela's play, we need very strong evidence to convict someone of cheating. So with this test we might compare Carlo's play with each of the judge's and require a p value less than 0.66% for each comparison.

Dmytro · Post by **Dmytro** » Wed Apr 11, 2018 3:15 am

Uberdude wrote: Yes, but these win% actually can't be trusted much, even as Leela's evaluation of a position, when there aren't many simulations. The protocol I have been using is load the game for analysis into Leela, kick off the analysis until it reaches around 50k nodes and then use analysis window (which is sorted by # simulations, not win %) to check if the move played was in top 3.

Did you check if the Leela evaluates the same moves when the game is loaded for analysis and when the game is rebuild from scratch (so engine doesn't know the next move)?

Uberdude · Post by **Uberdude** » Wed Apr 11, 2018 4:32 am

Dmytro wrote:
Uberdude wrote: Yes, but these win% actually can't be trusted much, even as Leela's evaluation of a position, when there aren't many simulations. The protocol I have been using is load the game for analysis into Leela, kick off the analysis until it reaches around 50k nodes and then use analysis window (which is sorted by # simulations, not win %) to check if the move played was in top 3.
Did you check if the Leela evaluates the same moves when the game is loaded for analysis and when the game is rebuild from scratch (so engine doesn't know the next move)?

Although I load the whole game sgf into Leela, when I ask it for what it wants to play for move X I haven't done any analysis for moves after X (I used a separate sgf replayer to know what the human played) so I don't think the fact the sgf contains that information is used by Leela, but I will check with a truncated sgf. (It's a manual position-by-position analysis rather than bulk analysis of the game like go review partner does). If you go forward from X and do analysis then these simulations of the game tree are used if you move back to X and continue analysis.

Dmytro · Post by **Dmytro** » Wed Apr 11, 2018 1:54 pm

Uberdude wrote: Although I load the whole game sgf into Leela, when I ask it for what it wants to play for move X I haven't done any analysis for moves after X (I used a separate sgf replayer to know what the human played) so I don't think the fact the sgf contains that information is used by Leela, but I will check with a truncated sgf. (It's a manual position-by-position analysis rather than bulk analysis of the game like go review partner does). If you go forward from X and do analysis then these simulations of the game tree are used if you move back to X and continue analysis.

I do not know much about Leela interface. But, logically, your way for game analysis looks good. Still, I would prefer to use truncated sgf to be 100% sure that there is no influence from next moves.

HermanHiddema · Post by **HermanHiddema** » Thu Apr 12, 2018 12:59 am

Uberdude wrote: Yes, but these win% actually can't be trusted much, even as Leela's evaluation of a position, when there aren't many simulations. The protocol I have been using is load the game for analysis into Leela, kick off the analysis until it reaches around 50k nodes and then use analysis window (which is sorted by # simulations, not win %) to check if the move played was in top 3. See example screenshot below. So at this point Leela wants to play d15 with 51.3% win, the move played in the game was d14 which is #3 and 47.6%, so still within the 5% band, but few simulations so that number is not so reliable. This counts as a match, if d15 fluctuated up another 1.3% the difference would be over 5% and this wouldn't count as a match.

[[snip image]]

Go forward one move and let Leela analyse the position with d14 on the board for 50k nodes, then go back to before and you get the below, it now thinks d14 is the best move with highest win%! This seems to indicate Leela's algorthim isn't tuned to be exploratory enough. Another analysis protocol would be to ensure the actual move played also gets 50k (or whatever) nodes of exploration, this could lead to higher match rates if Leela changes its mind to think the real game move is better than it anticipated when focusing on its #1 move. I chose my method as it's how I imagine a Leela cheater wondering what to play next would operate.

[[snip image]]

So, given that Leela's preferred moves are non-deterministic like this, it is possible that the same move might on one run be Leela's top choice, and on another be outside the top 3 or outside the 5% margin?

If so, I wonder what the following test would yield:

Given one of your test games, for every position between moves 50-150, let Leela analyse the position five times, independently (i.e. close and reopen the position between runs). Then record if the human move played was ever Leela's top choice.

John Fairbairn · Post by **John Fairbairn** » Thu Apr 12, 2018 1:42 am

There is a fascinating parallel to this case that has just been revived. A play, Who Wants To be A Millionaire, with the same basic theme has just opened in London. It is based on a true event in which a contestant on a major UK tv quiz programne won a million pounds and the organisers later challenged that win because they alleged he had help from his wife in the audience. He had to choose one answer from three read out and they alleged she coughed just after what she believed to be the correct answer (as I understand it, it was only her opinion - she did not have access to the answers and as this was 2001 she could not easily look the answers up online, and certainly not away from the gaze of other audience members).

Both denied cheating but the contestant ended up in court and was convicted by a jury. Winning a million pounds was not unusual - it was the coughing that was allegedly unusual.

The play changes the personae a little and does not follow the true ending (in real life the convicted contestant was given a suspended gaol sentence) but instead requires the audience to vote electronically on whether or not cheating took place.

Maybe we could try an electronic vote here, too.

The fuller story (and maybe corrections for anything I've mis-stated) can be found at http://www.bbc.co.uk/news/entertainment-arts-43700097. Two comments from me: (1) I can't see anything in it that would make this a typically British crime; (2) the show's presenter apparently thought the contestant was as "guilty as sin." That presenter was a hugely popular figure at the time - did his celebrity influence how the jury voted?

Kirby · Post by **Kirby** » Thu Apr 12, 2018 2:26 am

My view on this whole thing...

* Cheating in online tournaments can't be prevented. Even before Leela, people could use online resources, etc., to cheat.
* Requiring a webcam, for example, could mitigate (but not solve) the issue.
* I don't think punitive measures can fairly be taken without absolute proof of cheating.
* For important tournaments, sponsors should realize the potential for cheating, and try to reduce the risk of cheating as much as possible. For example, it's probably harder to cheat in an in-person tournament.

It's mathematically interesting to analyze probabilities of moves and all that stuff, but at the end of the day, I think you can't fairly punish someone just because their moves seem like a computer's.

Bill Spight · Post by **Bill Spight** » Thu Apr 12, 2018 3:34 am

John Fairbairn wrote:There is a fascinating parallel to this case that has just been revived. A play, Who Wants To be A Millionaire, with the same basic theme has just opened in London. It is based on a true event in which a contestant on a major UK tv quiz programne won a million pounds and the organisers later challenged that win because they alleged he had help from his wife in the audience. He had to choose one answer from three read out and they alleged she coughed just after what she believed to be the correct answer (as I understand it, it was only her opinion - she did not have access to the answers and as this was 2001 she could not easily look the answers up online, and certainly not away from the gaze of other audience members).

Both denied cheating but the contestant ended up in court and was convicted by a jury. Winning a million pounds was not unusual - it was the coughing that was allegedly unusual.

I wonder if they were bridge players.

To quote myself from earlier in this thread:

When I was in high school a couple of little old ladies told me about some ways to cheat at bridge. At that time some people would open One Club with fewer than four cards in the suit. (Actually, a lot of people played that system.) The cheaters would politely cough before bidding One Club with only three cards in the suit. OC, everybody at the table was in on the secret, so it was not exactly cheating.

Life In 19x19

“Decision: case of using computer assistance in League A”

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A