statistical analysis of player performance

AlesCieply · Post by **AlesCieply** » Mon Jul 09, 2018 8:43 am

The topic should serve as an easy to link reference to my analysis of player performance measured by the available bots, in the current version mostly Leela 0.11. I started to work on it in relation to the PGETC case in which an Italian player Carlo Metta was accused of using Leela in his internet games. After an original analysis based on matching the played moves to Leela top 3 suggestions proved to be inconclusive I decided to try a more detailed analysis with an idea of comparing the accused player performances (mistakes histograms) in his games played on internet and at regular (real life) tournaments. The analysis is inspired by works by Ken Regan on measuring in-game player ratings and catching those who cheat with AI in chess, see e.g. an review article. The idea is to look at frequencies/probabilities the players make mistakes of a given value (play moves that lower the probability to win a game, e.g. lowering the winning probability by 1-2%, 3-4% etc). This should form a histogram (or pattern) reflecting the player performance. If a player makes significantly lower number of mistakes in his internet games when compared with games of the same player in regular tournaments, then it could be an indication that the player used an outside help.

The analysis is presented in a form of spreadsheet files with each sheet containing an analysis of one game. For each move the bot is used to estimate the probability to win the game (winrate) before and after the move is played. The difference delta (set to 0 by definition if the top bot choice is played) provides the value of mistake the player makes. For each game separately, the results of the performed analysis can be seen in the histogram tables provided at the top right of each sheet assigned to a particular game. The tables show (separately for black and white player) how many moves were played with delta falling into a specific interval. The percentages of good moves (the played move had a winrate within 1% of the top move suggested by the AI, or even bettered the top move the AI found) and bad moves (causing a drop of the winrate by at least 10%) are also shown there.

The original analysis included 4 internet games by Carlo Metta and 4 of his games played at regular tournaments.

The current analysis of his internet games is far more extensive. It includes four PGETC games and two games from the Italian Championship Online, all played by Carlo Metta before he was accused of cheating. For a comparison three more PGETC games played by Carlo are included that he played after the accusation. Finally, the analysis of the Bryant-Metta game played in the PGETC qualification match is added as well. The analysis of Carlo's regular games was also updated to include four games played at WAGC and two games played in the Italian Championship Final.

Some notes on the internet games played by Carlo before the accusation:

The data shown in the current analysis are from new runs, so the results are slightly different from those in the earlier analysis (e.g. Carlo had 68% of good moves in the old analysis of the Kulkov-Metta game, it dropped to 64% now). The new runs are more consistent as they come from "automated runs" of Go Review Partner while a good fraction of the original analysis included hand-transcribed winrates. The differences between the original and new delta histograms are relatively small and demonstrate variations due to independent runs of Leela.

Carlo makes almost no big mistakes (marked by red color) in his internet games which is in contrast when compared with his regular games. One can make only 1-2 (or even 0) big mistake but not so consistently as the results of other analyzed games show (including analysis of games played by someone else).

The percentage of good moves in Carlo internet games is rather consistent, unlike in his regular games. The percentage of good moves drops sharply to 50% in a game against Csaba Mero played after the accusation.

The game Bryant-Metta from the UK-IT qualification match is also interesting. It is the only game analyzed so far in which Leela 0.11 has trouble to "understand the game" and provide stable winrates. It was suggested that Carlo used Leela Zero here, though there is no real proof for it.

Two of these games were also analyzed with the AQ bot. Unfortunately, the winrates estimated by the bot are not as stable as those provided by Leela.

I intend to edit this and the following message time by time to provide more information and updates on the analysis. Expect more later.

AlesCieply · Post by **AlesCieply** » Mon Jul 09, 2018 1:06 pm

Links to the analysis files:
Carlo Metta (4d) internet games - google sheets, rsgf files
Carlo Metta (4d) regular games - google sheets, rsgf files
Ondrej Kruml (5d) regular games - google sheets, rsgf files

Results of the Pearson's chi2 test of independence:
C.Metta on internet (6 games played before the accusation) vs C.Metta in regular tournaments (6 games) - p=3*10^(-8)
C.Metta in regular games (6 games) vs O.Kruml (10 games) - p=0.44
C.Metta on internet (6 games played before the accusation) vs O.Kruml (10 games) - p=5*10^(-7)

The p-value represents a probability that the compared two sets of games were played by "the same person". Here "the same person" means the player that has a similar distribution of mistakes. Two players of about equal strength and with a similar style of play are expected to get a p-value close to 1. Details are provided in the attached file in which the comparison of the mistakes distributions (delta-histograms) is shown as well.

Interpretation - C.Metta's play at regular tournaments is relatively close to the one of O.Kruml but VERY different from his play on internet.

bugsti · Post by **bugsti** » Tue Jul 10, 2018 12:57 am

AlesCieply wrote:

Some notes on the internet games played by Carlo before the accusation:
The data shown in the current analysis are from new runs, so the results are slightly different from those in the earlier analysis (e.g. Carlo had 68% of good moves in the old analysis of the Kulkov-Metta game, it dropped to 64% now). The new runs are more consistent as they come from "automated runs" of Go Review Partner while a good fraction of the original analysis included hand-transcribed winrates. The differences between the original and new delta histograms are relatively small and demonstrate variations due to independent runs of Leela.

I also noticed big oscillation in different and indipendent runs of Leela.

AlesCieply wrote: Carlo makes almost no big mistakes (marked by red color) in his internet games which is in contrast when compared with his regular games. One can make only 1-2 (or even 0) big mistake but not so consistently as my preliminary results for another player show (I am still finalizing the analysis, hope to make it public soon).

You forgot to mention the most important fact among all. Carlo's big mistakes rate is ALWAYS consistent with his opponent's mistakes rate. And also the rate of his good moves.

AlesCieply wrote: The percentage of good moves in Carlo internet games is rather consistent, unlike in his regular games. The percentage of good moves drops sharply to 50% in a game against Csaba Mero played after the accusation.

But also his opponent's (Mero) good moves rate drops down, consistently with Carlo's drop.

I think you found a method not to detect cheating but to detect if a particular game was an "easy" game or a tough one for the players.

The fact that Carlo's moves are always consistent with his opponent's move can even be a prove that cheating did not occurred.

AlesCieply · Post by **AlesCieply** » Tue Jul 10, 2018 3:37 am

bugsti wrote:
You forgot to mention the most important fact among all. Carlo's big mistakes rate is ALWAYS consistent with his opponent's mistakes rate. And also the rate of his good moves.

Actually, you can note quite some difference in the 3 PGETC games Carlo played after he was accused, so your statement is not true. I have not looked at other games of Carlo's opponents so it is hard to say whether their low number of mistakes is due to their higher strength or it is game related.

bugsti wrote:
AlesCieply wrote: The percentage of good moves in Carlo internet games is rather consistent, unlike in his regular games. The percentage of good moves drops sharply to 50% in a game against Csaba Mero played after the accusation.
But also his opponent's (Mero) good moves rate drops down, consistently with Carlo's drop.

I have no idea how Mero performs in other games. I guess his performance oscillates, but one cannot say for sure before doing the analysis. Have you noted Bajenaru also scored only about 50% of good moves while Carlo had 67% (EDIT: corrected).

bugsti wrote: I think you found a method not to detect cheating but to detect if a particular game was an "easy" game or a tough one for the players.

I agree there is some correlation between the difficulty of the game for the players (and most likely also for the bot used to make the winrate estimates) and the statistical performance (percentage of mistakes and good moves). Still, Carlo normally outperforms his opponents in the internet games played before the accusation, especially in those played after the latest release of Leela. Just have a look at the Bajeranu-Metta and Metta-benDavid games (plus the two from Italian Online Championship).

Bill Spight · Post by **Bill Spight** » Tue Jul 10, 2018 3:43 am

One thing I have been wondering about is your precision label. Thanks to your link I took a look at the Go Review Partner documentation. I did not find the word, precision, in that documentation. However, I did find this explanation of delta.

GoReviewPartner wrote:By comparing the win rate (or Value Network win rate, Monte Carlo win rate) at one move (when the bot best move would be played) with the win rate of the following move (the case when the actual game move was played), one can draw a delta graph for each color.

This is a graph that indicates by how much the bot believes it could have played better than the human player, or eventually by how much the human player move was better than its own move. The difference between both win rate percentage value is called delta.

Personification aside (the bot doesn't believe anything), the delta is comparing, if not apples and oranges, at least different varieties of apples. For the delta to mean what it claims to mean, the bot must play perfectly. Your precision label indicates the difference in win rates before and after a single play by the bot. If the bot played perfectly that difference would always be zero. But the difference can be substantial, and that casts doubt upon the delta measure, which has the same sources of error.

Two of these sources of error are the different number of playouts for each move in the comparison, and the different game trees that are built for each move. A good example is

in Bojanic's analysis of the Metta-Ben David game. Starting from the position after

Leela's choice of

has a win rate (for Black) of 51.6% with 44,162 playouts for an error term of ± 0.5%. White indeed made Leela's play, and starting from that position, Leela's choice of

has a win rate of 55.4% with 11,483 playouts for an error term of ± 1%. But the win rate difference is 4%, much larger than the error estimates. Those error estimates are based upon playouts, but what happened is that, starting from the position after

, Leela found an apparently better play for Black's next move than it had found starting from the position after

.

So to find a good delta we don't want to do what Go Review Partner does. It's OK for casual review, but not for scientific purposes. We want to start from the same place, and we want to have an equal number of playouts for each play we are comparing. With Go Review Partner I think we can do that by making each play we are comparing and then running the bot for a certain number of rollouts, or for a certain length of time. That way we are comparing apples with apples.

A glance at the Bryant-Metta game's precision data indeed suggests that Leela does not understand that game very well. It apparently keeps finding good variations that it missed one ply earlier. At the very least we should compare apples with apples.

Bill Spight · Post by **Bill Spight** » Tue Jul 10, 2018 3:47 am

AlesCieply wrote:Carlo normally outperforms his opponents in the internet games played before the accusation, especially in those played after the latest release of Leela.

Well, he won, didn't he?

And, as I mentioned before, a Chi Square Test comparing Carlo's play with the play of his opponents in those games failed to find a significant difference. It's not even close to the 5% level. The significant difference is the one you found, between Carlo's play under different conditions.

AlesCieply · Post by **AlesCieply** » Tue Jul 10, 2018 4:06 am

Bill, I intend to address the precision of the bot estimates is some detail, most likely will put it into the second placeholder message. Before I do so, just very briefly:
- I agree with you that to have it "scientifically perfect" one should have about the same numbers of playouts for the estimation of the position after the top move suggested by the bot was played and compare it with the winrate after the game move was played.
- The last column in my data sheets is for precision. I put it there just to see how this can vary when the number of playouts increases. As far as it stays below 1% difference I consider it fine. When it exceeds the limit I color the cell blue, so I can easily spot in where to bot has some trouble to estimate the winrate. For the first two games in the new analysis (Kulkov, Kruml) I made additional runs at higher playouts (300k+) to check the extremes (green, red deltas and blue precisions), so those two games are "doctored" to achieve better winrate estimates. If there was a change to the original winrate, top move suggestion or order of the move played, I marked the affected cells by blue color. These changes had little (if any) impact on the histograms, but the winrates there are simply better estimated than for the rest of the sheet. As it is quite a lot of additional work, I have not done it for other games. I also intend to provide the original rsgf files, so everybody can check that this additional "doctoring" is really not done with a purpose to make the data look good or worse for Metta.

Bill Spight · Post by **Bill Spight** » Tue Jul 10, 2018 4:28 am

AlesCieply wrote:Bill, I intend to address the precision of the bot estimates is some detail, most likely will put it into the second placeholder message. Before I do so, just very briefly:
- I agree with you that to have it "scientifically perfect" one should have about the same numbers of playouts for the estimation of the position after the top move suggested by the bot was played and compare it with the winrate after the game move was played.

Well, if I understand Go Review Partner well enough, the deltas are based upon a similar number of playouts for each comparison, because they depend upon the number of playouts for the bot's top choices. (OC, they are different choices, one ply apart.

)

I think then, that the main source of error is usually finding a better game tree one ply later. That's why it is important to make comparisons at the same level. Apples vs. apples.

- The last column in my data sheets is for precision. I put it there just to see how this can vary when the number of playouts increases.

That's important.

You can use it in a slightly different way by running the bot more than once from the same position.

Hmmm. That might be a way to assess the difficulty of a position. An easy position with an obvious play may well have less variable results than a more difficult position. We might use a precision measure to decide which plays to compare.

bugsti · Post by **bugsti** » Tue Jul 10, 2018 5:46 am

AlesCieply wrote:
I have no idea how Mero performs in other games. I guess his performance oscillates, but one cannot say for sure before doing the analysis. Have you noted Bajenaru also scored only about 50% of good moves while Carlo had 77%.

... while Carlo had 66.7% according to your spreadsheet.

AlesCieply · Post by **AlesCieply** » Tue Jul 10, 2018 6:23 am

bugsti wrote:
AlesCieply wrote: ... while Carlo had 66.7% according to your spreadsheet.

Ooops, sorry, I mistyped. Thanks for the correction.

AlesCieply · Post by **AlesCieply** » Tue Jul 10, 2018 6:40 am

Bill Spight wrote:
AlesCieply wrote:Carlo normally outperforms his opponents in the internet games played before the accusation, especially in those played after the latest release of Leela.
Well, he won, didn't he?

I wish it was that simple. It is possible to have a higher percentage of good moves and still loose the game, the Kulkov-Metta PGETC belongs here in the new analysis. The Bryant-Metta is another one but there I really do not know what to think of the analysis. One can also lead for most of the game (having higher percentage of good moves up to move 180) and then self-atari in the endgame etc. My point in the opening message is that Metta is surprisingly consistent in his internet performances prior to the accusation and the consistency is broken afterwards and in his regular games. Just to make sure, I do not consider it as a proof of cheating, he could just have been in a bad state of mind after the accusation so it can be explained both ways. Still, it is worth noting.

Bill Spight · Post by **Bill Spight** » Tue Jul 10, 2018 7:54 am

AlesCieply wrote:
Bill Spight wrote:
AlesCieply wrote:Carlo normally outperforms his opponents in the internet games played before the accusation, especially in those played after the latest release of Leela.
Well, he won, didn't he?
I wish it was that simple. It is possible to have a higher percentage of good moves and still loose the game, the Kulkov-Metta PGETC belongs here in the new analysis. The Bryant-Metta is another one but there I really do not know what to think of the analysis. One can also lead for most of the game (having higher percentage of good moves up to move 180) and then self-atari in the endgame etc. My point in the opening message is that Metta is surprisingly consistent in his internet performances prior to the accusation and the consistency is broken afterwards and in his regular games. Just to make sure, I do not consider it as a proof of cheating, he could just have been in a bad state of mind after the accusation so it can be explained both ways. Still, it is worth noting.

I am not talking about proof of cheating, but of the comparison between Metta's play and that of his opponents (win-loss aside). I haven't seen any test of that comparison aside from the one I did on your early data, which showed no significant difference. Statistically, Carlo's play online was better than his play offline in the games analyzed. But I did not find that Carlo's online play was better than the online play of his opponents. If you can show that, please do.

Bill Spight · Post by **Bill Spight** » Tue Jul 10, 2018 8:05 am

Since we are talking about using bots to assess errors, the game in this post, viewtopic.php?p=233790#p233790 , may be of interest. Zen7 loses one point and the game by making an unnecessary protective play. Without that play it estimates its winrate at 39%. The position is easy for an amateur dan player to read out, IMO.

Uberdude · Post by **Uberdude** » Tue Jul 10, 2018 8:41 am

Bill Spight wrote:Since we are talking about using bots to assess errors, the game in this post, viewtopic.php?p=233790#p233790 , may be of interest. Zen7 loses one point and the game by making an unnecessary protective play. Without that play it estimates its winrate at 39%. The position is easy for an amateur dan player to read out, IMO.

I didn't follow all that thread, but as the creator didn't seem to understand about bots being trained on a fixed komi I would want to check the komi configuration was correct before concluding anything (e.g. Zen may have assumed a komi of 6.5 so thought it would win by 0.5 instead of 1.5 after defending, when in fact the game was played under 7.5 komi). Artificial intelligences are also good at artificial stupidity

Bill Spight · Post by **Bill Spight** » Tue Jul 10, 2018 8:55 am

Uberdude wrote:
Bill Spight wrote:Since we are talking about using bots to assess errors, the game in this post, viewtopic.php?p=233790#p233790 , may be of interest. Zen7 loses one point and the game by making an unnecessary protective play. Without that play it estimates its winrate at 39%. The position is easy for an amateur dan player to read out, IMO.
I didn't follow all that thread, but as the creator didn't seem to understand about bots being trained on a fixed komi I would want to check the komi configuration was correct before concluding anything (e.g. Zen may have assumed a komi of 6.5 so thought it would win by 0.5 instead of 1.5 after defending, when in fact the game was played under 7.5 komi). Artificial intelligences are also good at artificial stupidity

I just checked. The komi was 6.5 pts. and Black 309 made Black lose by 0.5.

Life In 19x19

statistical analysis of player performance

statistical analysis of player performance

Re: statistical analysis of player performance

Re: statistical analysis of player performance

Re: statistical analysis of player performance

Re: statistical analysis of player performance

Re: statistical analysis of player performance

Re: statistical analysis of player performance

Re: statistical analysis of player performance

Re: statistical analysis of player performance

Re: statistical analysis of player performance

Re: statistical analysis of player performance

Re: statistical analysis of player performance

Re: statistical analysis of player performance

Re: statistical analysis of player performance

Re: statistical analysis of player performance