Life In 19x19

Posted: **Tue Nov 10, 2020 10:28 pm**

Now I discuss the rest of the paper.

For cheat detection, the paper considers a winrate graph over a game's moves according to the AI's stated probabilites. One player cheating is described as the graph steadily going upwards to 99%, both players cheating as a rather constant graph.

Regardless of indirect calculation of winrates, I agree that such graphs identify suspective players because such graphs express them to make essentially no significant mistakes. They are an indication of possibly occurring cheating but not a proof of it.

The paper claims that a player's average 'effect' and consistent development of his moves' effects indicated his skill. Since effect is calculated from scoremeans, again, this is wrong. Effects only indicate a model of skill. See my earlier remarks.

The paper repeats its earlier mistakes, which I have mentioned for earlier sections.

It compares a player's moves to KataGo's first move suggestions; close agreement is said to let a player be suspicious. Suspicious, why not; every player can be suspected to be cheating. However, there is another explanation than cheating: a player can have trained a lot with KataGo or have a similar playing style. Besides, there is the problem that a player might be cheating using a different AI program or different KataGo network; then KataGo's moves might not be particularly suitable for comparison to the player's moves.

The paper studies four example games discussing indications of cheating. Again, it frequently repeats its earlier mistakes. Per player and game, a few indicators are considered to judge about cheating. Most alleged indicators are interpreted as indicating cheating, although they can also be interpreted as the opposite. The paper's systematic repetition of its earlier mistakes combined with indicators interpreted with prejudice of indicating cheating produces alleged detection of cheaters on the implied assumption that several such aspects combined would be sufficiently convincing evidence. A warning of caution that false allegations can occur only serves as an alibi. Such an approach of cheating detection is bound to detect cheaters regardless of what percentage of players is judged upon wrongly.

Another part of the problem is that the paper has introduced values (called metrics) and applies them while the paper does not provide theory for distinguishing when values, or combinations of values, indicate versus do not indicate cheating. This is like statistics without confidence thresholds for individual values, let alone for combined consideration of several same kinds of values and then for different kinds of values.

Value graphs are supposed to be interpreted subconsciously by human arbiters. Instead, there ought to be theory interpreting data represented in graphs, such as analysing differences between two value curves in a graph.

Such theory requires agreement to very large samples of games and their values and graphs. This is so also because different board positions and different games can have different behaviours of values. E.g., imagine a semeai with two local maxima: one correct and one wrong; when calculting an average for a roughly balanced tree search, also the average will be a wrong indication. Currently, such is interpreted as being as good as all other values for ordinary positions.

The paper's value analysis applied to players creates an unfair prejudice: some players with specific playing styles, studying with specific AI programs or having studied much with AI are in much greater danger of being wrongly indicated as cheaters.

In conclusion, although the paper suggests some values potentially useful for some studies or models, the theory is very far from rather safe application as distinction between cheating and no cheating, except for the mentioned success cases of tool usage followed by a player admitting cheating. Currently, the theory is very incomplete, is over-interpreted and frequently advertised by the paper's authors within the paper as being more (an alleged description of reality, such as "a player's skill") than it is (only a model such as "a model of a player's skill"; furthermore a model lacking quality evaluation, which - for the promoted application of cheating detection - is essential).

Posted: **Wed Nov 11, 2020 3:00 am**

RobertJasiek wrote:Apparently, the scoremean is defined as a mean over all visited scores (at the leaves, I suppose) during a Monte-Carlo search. When correct subsequent play approaches a leaf, the scoremean can converge to strong human score prediction. So far so good. However, in the general case, which includes many positions far earlier than a leaf, there may be some stability in the values and game-tree-local convergence for strong AI play but we do not know by how much in every specific position scoremean and strong human score prediction differ. The scoremean does not equal strong human score prediction.

Sure enough, human counting and AI counting are different.

An interesting feature of the scoremean is that KataGo is reliably able to produce ties against itself (with an integer komi), showing that its calculations are at least consistent to a degree. (For example Leela Zero is unable to do the same.) KataGo also reliably beats Leela Zero, indicating that its understanding of the game should be the better one. While the scoremean values are 'impure' and 'imprecise', unlike human counting, I still think we should give them value.

RobertJasiek wrote:The paper says: "Every move played in a game reduces the number of its future possibilites." Unless a superko or similar rule applies, this is just a conjecture and disproven by this counter-example: White's two-eye-formation fills the board, White fills an eye, Black passes, White fills an eye committing suicide (assuming it is legal according to the rules). The resulting position has a greater number of future possibilities than the initial position. To get a theorem instead of a conjecture, some presuppositions need to be stated and a proof is required.

I'm not sure I get this. For every move, you could have played the move or passed. After playing a move (or passing), you clearly have a smaller number of possible futures remaining.

RobertJasiek wrote:The effect of a move is defined as the difference of scoremeans after and before it. The paper says that statistical information on the effects describe the playing skill of a player. No. It only describes a model of the playing skill of a player because the scoremean is only a model of correct positional judgement.

For sure, a player's average effect in a single game does not make it possible to accurately estimate their playing skill – we never claimed such a thing. There is a fairly strong correlation, however.

RobertJasiek wrote:The paper claims that AI does not follow strategic plans, which can be expressed in human terms. Wrong. I already described such during the early AlphaGo days, when AI performed exactly according to my previously described generally applicable strategic planning of a certain kind (best reducing large sphere of inflence).

Of course you can fit a humanly describable strategic plan to a particular move by an AI.

The point is that the AI's move-choosing procedure itself is not, e.g., 'right now there is nothing urgent going on, so the next most valuable thing is to reduce that sphere of influence, and for that particular task I should use the technique I read about in a theory book.' The AI chooses its moves purely from the information available on the game board, through algorithms not accessible to humans, without trying to infer the opponent's verbalisable strategy.

RobertJasiek wrote:Regardless of indirect calculation of winrates, I agree that such graphs identify suspective players because such graphs express them to make essentially no significant mistakes. They are an indication of possibly occurring cheating but not a proof of it.

I think we should note that the only 'proof' of cheating is the cheater's confession or getting caught in the act, for example by a video recording or a trusted proctor. All other anti-cheating solutions are finally based on probabilities, which I think should be called 'evidence' rather than 'proof'.

RobertJasiek wrote:It compares a player's moves to KataGo's first move suggestions; close agreement is said to let a player be suspicious. Suspicious, why not; every player can be suspected to be cheating. However, there is another explanation than cheating: a player can have trained a lot with KataGo or have a similar playing style. Besides, there is the problem that a player might be cheating using a different AI program or different KataGo network; then KataGo's moves might not be particularly suitable for comparison to the player's moves.

I have tested the model on a wide variety of AIs, even AlphaGo Master, which is trained on human games and plays considerably differently from modern AIs such as KataGo. Superhuman performance is still visible in the graphs, even if KataGo's own value function is slightly different.

Furthermore, being able to for example play KataGo's favoured sequences out from time to time will not mark a player as suspicious, but consistently playing in the roughly right part of the board will. I have seen no evidence of a player's 'familiarity with AI moves' making them stand out in my analysis.

RobertJasiek wrote:Another part of the problem is that the paper has introduced values (called metrics) and applies them while the paper does not provide theory for distinguishing when values, or combinations of values, indicate versus do not indicate cheating. This is like statistics without confidence thresholds for individual values, let alone for combined consideration of several same kinds of values and then for different kinds of values.

I think you may have misunderstood the purpose of the paper. As we are basically just starting the research, we can hardly have a finished product for cheat detection work – if we did, you could already download the software somewhere. The paper describes how different metrics derivable from AI analysis can be used in cheat detection. We cannot possibly do statistics with confidence thresholds if we don't even know what we should measure. This is what the paper starts to tackle.

RobertJasiek wrote:E.g., imagine a semeai with two local maxima: one correct and one wrong; when calculting an average for a roughly balanced tree search, also the average will be a wrong indication. Currently, such is interpreted as being as good as all other values for ordinary positions.

This will however not happen, because the AI will not use the same amount of playouts for both possibilities. You can easily test this with KataGo and see that it gives the (roughly) right scoremean once it figures out the status of the position. Depending on the complexity of a position, this may of course require a larger number of playouts.

RobertJasiek wrote:The paper's value analysis applied to players creates an unfair prejudice: some players with specific playing styles, studying with specific AI programs or having studied much with AI are in much greater danger of being wrongly indicated as cheaters.

As I said above, this claim is unsubstantiated.

RobertJasiek wrote:In conclusion, although the paper suggests some values potentially useful for some studies or models, the theory is very far from rather safe application as distinction between cheating and no cheating, except for the mentioned success cases of tool usage followed by a player admitting cheating. Currently, the theory is very incomplete, is over-interpreted and frequently advertised by the paper's authors within the paper as being more (an alleged description of reality, such as "a player's skill") than it is (only a model such as "a model of a player's skill"; furthermore a model lacking quality evaluation, which - for the promoted application of cheating detection - is essential).

As I also wrote above, I think you have misunderstood the point of the paper. It certainly was not presenting a robust system that can be applied in cheat detection – if that was the case, then the research would be nearing completion, rather than having just started, and we would already have a product to offer. The four cases presented in the paper are examples of how interpretation of the generated graphs can possible quicken the otherwise slow, pure human analysis of alleged cheating cases. None of the presented metrics or procedures is given as a 'general solution', but rather as a check that can be done to get a better idea of what happened. When applying a series of 'cheat filters' such as these and all (or most) of them coming off as 'suspicious', then we have probabilistic data that I believe is more trustworthy than a 'mere' human interpretation made from reviewing a game. When there is no way to get actual 'proof' of cheating, this, I think, can very well be the next best thing.

Posted: **Wed Nov 11, 2020 4:03 am**

What I see here is an argument between mathematics + mathematics and mathematics + common sense. You can easily guess which side I'm on.

It seems from the reports here on the aptly named Corona Cup that online go already has its own CV problem - the Cheating Virus. A vaccine is needed urgently. The world and (more significantly perhaps) its stock markets have already greeted with jubilation the news that there may be a vaccine soon for the real CV that gives "only" 90% protection. An approach that gives 90% protection NOW or SOON against cheating in go is surely to be similarly welcomed.

Posted: **Wed Nov 11, 2020 4:24 am**

Some people worry more about side-effects of the vaccine on healthy people than the protective effect on vulnerable people. The discussion can go on forever. I am confident that the anti-CV committee will make reasonable use of the vaccine.

Posted: **Wed Nov 11, 2020 5:26 am**

NordicGoDojo wrote:An interesting feature of the scoremean is that KataGo is reliably able to produce ties against itself (with an integer komi), showing that its calculations are at least consistent to a degree. [...] KataGo also reliably beats Leela Zero, indicating that its understanding of the game should be the better one. While the scoremean values are 'impure' and 'imprecise', unlike human counting, I still think we should give them value.

Sure.

Where I object is over-interpretation of KataGo's skill, e.g., when the paper refers to "a player's skill".

RobertJasiek wrote:number of its future possibilites
I'm not sure I get this.

I lack to time to work out this. The paper might as well omit the related statement without significant impact, so who cares?:)

a player's average effect in a single game does not make it possible to accurately estimate their playing skill

At the same time, the paper describes the intention of analysing a player's skill (but so far should only speak of a "model of it") from just one game. His demonstrated, what the paper calls, skill shall then be used as a basis for possibly detecting his cheating in this game.

If you do hold your statement, you must at the same time hold that cheating detection by the paper's means from only one game of the player is impossible.

There is a fairly strong correlation, however.

I do not have a problem with seeing a fairly strong correlation, as long as it roughly described as "for an average game of a particular, arbitrary player, the paper's tools can indicate a 'cheating' suspicion under the assumption that the model of the player's performance is his performance".

Of course you can fit a humanly describable strategic plan to a particular move by an AI.

Not just to one move but To particular kinds of move sequences.

The point is that the AI's move-choosing procedure itself is [...]

...not described as a human-readable strategic plan indeed, right. It is well hidden in the network values, pure tree searches and code.

I think we should note that the only 'proof' of cheating is the cheater's confession or getting caught in the act, for example by a video recording or a trusted proctor. All other anti-cheating solutions are finally based on probabilities, which I think should be called 'evidence' rather than 'proof'.

Right.

Therefore, if "statistical" probabilities shall serve as evidence, they require theory for thresholds, levels of confidence and agreement to large samples.

I have tested the model on a wide variety of AIs, even AlphaGo Master, which is trained on human games and plays considerably differently from modern AIs such as KataGo.

Good.

Furthermore, being able to for example play KataGo's favoured sequences out from time to time will not mark a player as suspicious, but consistently playing in the roughly right part of the board will.

(My reply refers to phases before the endgame phase.)

I disagree. A player can have the skill to always play in roughly the, as indicated by AI analysis, right part of the board in some of his games. Such a player need not have superhuman level.

A player is suspicious if he also consistently plays locally close to optimal. If we know he is a strong (or very strong) player, we must be extra cautious and tolerant towards interpreting his skill.

I have seen no evidence of a player's 'familiarity with AI moves' making them stand out in my analysis.

I expect what you describe. Nevertheless and regardless, could you describe your observations so far in more detail, please? We might learn from them.

I think you may have misunderstood the purpose of the paper.

I get it that the paper is an early step in metrics analysis - for that purpose, I do not think to have misunderstood its purpose.

At the same time, at various places, the paper makes detailed statements that go far beyond the aforementioned purpose. I criticise the paper for such over-interpreting statements.

Furthermore, the paper goes far beyond the aforementioned purpose when suggesting and describing application to cheating detection. I also criticise that the paper rushes ahead too fast while it even serves as part of justification of already applying such tools in tournaments.

IOW, there is not just one purpose of the paper - not just the modest purpose of an early step in metrics analysis. This paper does not give an impression like a pure maths paper, such as about KL-divergence gives. Quite contrarily, implicitly the paper is referred to as strong justification when a tournament announcement refers to "state-of-the-art" anti-cheating tools.

Depending on the complexity of a position, this may of course require a larger number of playouts.

Yes, but cheating detection is supposed to be applicable even when, in quite a few positions, there are not enough playouts.

RobertJasiek wrote:The paper's value analysis applied to players creates an unfair prejudice: some players with specific playing styles, studying with specific AI programs or having studied much with AI are in much greater danger of being wrongly indicated as cheaters.
As I said above, this claim is unsubstantiated.

I am not convinced because I do not buy it that the paper's only purpose would be early research. (You might rewrite the paper to convince me by removing all details hinting at advanced application / interpretation, but please do not waste your time on doing so:) As a suggestion for future papers, clearly distinguish current level of understanding and possible future research, and maybe applications beyond a paper described outside a paper itself.)

It certainly was not presenting a robust system that can be applied in cheat detection – if that was the case, then the research would be nearing completion, rather than having just started, and we would already have a product to offer.

Thank you for the clarification!

When applying a series of 'cheat filters'

Good in theory, but only good in practice if each filter itself is convincing - and is not a roughly 50% interpretation chance "cheated or not cheated".

I believe is more trustworthy than a 'mere' human interpretation made from reviewing a game.

I think what might some time become a useful filter is objective analysis of data currently presented as graphs and indicating similar progress (such as "winning chances") during a game of different AIs' moves versus a player's moves. Such characteristics are very hard to fake if they occur absolutely consistently before the stage of already having won a game strategically. Of course, that presumes extensive studies that coincidences do not occur just because of a specific nature of a game's development.

Posted: **Wed Nov 11, 2020 6:11 am**

"an argument between mathematics + mathematics and mathematics + common sense"

A maths paper deserves, first of all, a maths assessment.

A paper containing maths and informal text deserves a reply that is partly either.

Beyond the paper itself, there can also be informal discussion. Such as how cheating should be treated in a particular tournament.

All of this is not an argument between maths and informal discussion. Both have their place.

Posted: **Wed Nov 11, 2020 6:17 am**

jlt, whenever arbitration, anti-cheating detection or judgements about crimes are concerned, by all means decisions must minimise unjust judgements as far as anyhow possible.

Posted: **Wed Nov 11, 2020 7:34 am**

a player's average effect in a single game does not make it possible to accurately estimate their playing skill
RobertJasiek wrote:At the same time, the paper describes the intention of analysing a player's skill (but so far should only speak of a "model of it") from just one game. His demonstrated, what the paper calls, skill shall then be used as a basis for possibly detecting his cheating in this game.

If you do hold your statement, you must at the same time hold that cheating detection by the paper's means from only one game of the player is impossible.

My apologies, I should have written 'a player's average effect alone' to emphasise the point. This is why all the other metrics are there for help. Although there is a strong general correlation between the average effect and a player's skill, it is possible to find a game by top European amateurs that ends with an average effect of around -0.3, while a random Ke Jie game might end with average effects at around -0.7. This obviously doesn't imply that the European amateurs are stronger, but that the 'type' of the game is different, with the latter game involving more what we might call 'complexity'. If we could somehow quantify this complexity, we might be able to move forward towards a more general solution. This might be less difficult than it sounds; or, at least, currently I'm working on a promising solution idea.

I think we should note that the only 'proof' of cheating is the cheater's confession or getting caught in the act, for example by a video recording or a trusted proctor. All other anti-cheating solutions are finally based on probabilities, which I think should be called 'evidence' rather than 'proof'.
RobertJasiek wrote: Right.

Therefore, if "statistical" probabilities shall serve as evidence, they require theory for thresholds, levels of confidence and agreement to large samples.

I agree completely, and hope that the research will reach such a point one day.

Furthermore, being able to for example play KataGo's favoured sequences out from time to time will not mark a player as suspicious, but consistently playing in the roughly right part of the board will.
RobertJasiek wrote: (My reply refers to phases before the endgame phase.)

I disagree. A player can have the skill to always play in roughly the, as indicated by AI analysis, right part of the board in some of his games. Such a player need not have superhuman level.

A player is suspicious if he also consistently plays locally close to optimal. If we know he is a strong (or very strong) player, we must be extra cautious and tolerant towards interpreting his skill.

For sure, the criteria especially in the current model has to adapt to the analysed player's level. As the 'human' part of the current model, I will be a lot more surprised if a supposed 7k plays so well that I couldn't imagine playing any better, than if a supposed 7d does the same. Note that this may not 'prove' that the 7k is consulting an AI – they might simply be sandbagging.

I have seen no evidence of a player's 'familiarity with AI moves' making them stand out in my analysis.
I expect what you describe. Nevertheless and regardless, could you describe your observations so far in more detail, please? We might learn from them.

This sounds like a useful undertaking – I already have a number of 'cheater profiles' in my mind, but not in written form, and I should get around to classifying and describing them better. I think I should omit writing anything hastily here for now, but I expect to get back on the subject later (at the latest in a future paper).

Anecdotally, I can say that 'I have been studying AI a lot these past few months and that is why my play looks similar' is the most common counterargument I hear when I accuse somebody of cheating – it comes up almost invariably. Usually, after that the suspect goes on to explain why they were able to play some AI early or middle game joseki, when my reason for suspicion will have been something very different (such as a surprisingly small average effect or extremely sharp reading during one or multiple key fights). I tend to completely ignore AI joseki in my analysis, not counting them as evidence.

RobertJasiek wrote: IOW, there is not just one purpose of the paper - not just the modest purpose of an early step in metrics analysis. This paper does not give an impression like a pure maths paper, such as about KL-divergence gives. Quite contrarily, implicitly the paper is referred to as strong justification when a tournament announcement refers to "state-of-the-art" anti-cheating tools.

It seems we entered the semantics part anyway. Let me quote a dictionary definition which I think is in line with CC2's usage of the term: 'very modern and using the most recent ideas and methods.' If you can tell me of go-specific anti-cheating tools so much more modern that my model pales in comparison, I will be happy to hear of them.

Depending on the complexity of a position, this may of course require a larger number of playouts.
RobertJasiek wrote:Yes, but cheating detection is supposed to be applicable even when, in quite a few positions, there are not enough playouts.

I think this will be extremely difficult to achieve in practice. How can you know that a player played like an AI if you don't consult an AI to a reasonable depth (where 'reasonable' must be context-specific) to see what its output is? Of course, a strong human player might get a fairly good idea without, but trying to bypass that requirement is one of the main points of the model.

When applying a series of 'cheat filters'
RobertJasiek wrote:Good in theory, but only good in practice if each filter itself is convincing - and is not a roughly 50% interpretation chance "cheated or not cheated".

Surely this is an unnecessary requirement. If we have five purely independent filters that all have a 50% chance, we already have 97% accuracy when applying them all.

Posted: **Wed Nov 11, 2020 8:30 am**

0.5 to the 5th power is not necessarily 1 - 97% when some (or even most) of the filters are applied with prejudice towards the cheating allegation. You need 5 mutually independent, objective filters.

When you confront a player with metrics, he cannot defend himself on the same terms. He must defend himself by his own skills, which might include to have studied by AI. You have to be fair to him instead of quickly disbelieving reasons just because others could abuse them as fake evidence.

Good luck / skill with your continued research!

Posted: **Wed Nov 11, 2020 10:40 am**

NordicGoDojo wrote:As we are basically just starting the research, we can hardly have a finished product for cheat detection work

May I strongly suggest that you add an empirical researcher to your team? Without empirical validation it is easy to go astray.

Posted: **Wed Nov 11, 2020 11:59 am**

NordicGoDojo wrote:I already have a number of 'cheater profiles' in my mind, but not in written form, and I should get around to classifying and describing them better. I think I should omit writing anything hastily here for now, but I expect to get back on the subject later (at the latest in a future paper).

Anecdotally, I can say that 'I have been studying AI a lot these past few months and that is why my play looks similar' is the most common counterargument I hear when I accuse somebody of cheating – it comes up almost invariably.

I remember the flap some year ago over a cheating allegation that a player had used Leela11 to cheat. There were real problems with the evidence of cheating. For one thing the accusers reported only plays between move 50 and move 150, for no apparent reason. That raises the question of cherry picking their data. For another, their main evidence was plays matching one of the top three of Leela11's choices. One thing that lay people do not understand, without having been taught, is that confirmatory evidence is weak. In fact, it is very weak. So much so that Popper discounts it entirely.

Obviously, one way of cheating is to copy a bot's play, and matching several of its plays can raise suspicions of cheating. But suspicion is only suspicion. OC, it could justify more investigation. In the case I am talking about, the accusers broadened their search for evidence of cheating to include matching Leela11's second choices. That raises suspicions about the accusers. They have switched their theory of cheating. (OC, they may, for reasons unknown, started with looking for more matches than matches to Leela's top choices. But bear with me, please.) Their theory now is that the cheater sometimes played the top choice, sometimes the second choice. Well, that is a possible way to cheat, OC. But two things happen by broadening what is considered to be confirmatory evidence. First, the weak evidence is made even weaker. Second, by being able to match more moves, the evidence of cheating is made to sound more convincing. Instead of reporting, say, 60% matches, you can report, say, 90% matches. And by including matches to Leela's third choices, as they also did, you weaken the evidence further, but make it sound more convincing by reporting, say, 95% matches. (I don't remember the exact number of matches reported in the accusation, but you get the general picture. Weaker evidence that sounds better.) Now, I am not saying that the accusers purposely presented misleading evidence, but I do think that their investigation was biased by looking for evidence of cheating. The right approach is, once your suspicions have been aroused, to look for evidence that the player is not cheating. You try to disprove your theory. Such an investigation may provide further evidence of cheating, OC.

This does not mean that matching a bot's top choices may not be good evidence of cheating. It simply must be disconfirmatory evidence. One avenue that may be helpful came to light in a discussion here some time back. It turns out, at least preliminarily, that any two of today's top bots match each other's top choices around 80% of the time — IIRC, this was fairly early in the game. The term for that is a concordance rate of 80%. Research with several bots over many, many games could produce a good estimate, along with probability distributions and error estimates. Now suppose that a suspected cheater's plays matched a particular bot 95% of the time, while several other top bots each matched that bot's plays around 80% of the time. That would be strong evidence that the player was matching that bot's plays. Not because of the matching, per se, but because of the difference between the player's percentage of matching and that of the other bots. Confirmatory evidence is weak. Disconfirmatory evidence is strong.

Edit: This illustrates why loose matching to any of a bot's three top choices is a bad idea. For such matching, the concordance rate between any two of today's top bots would approach 100%. Even if a player's plays matched one of a bot's three top choices at a very high rate, the difference between the player's matching rate and that of another top bot would be too small to find a significant difference.

Posted: **Wed Nov 11, 2020 7:24 pm**

Bill Spight wrote:May I strongly suggest that you add an empirical researcher to your team? Without empirical validation it is easy to go astray.

I've been wanting to blind-test my model for a long time, and that is why I recently made the initial test with 5 AI v. AI games, 5 human v. human games, and 10 AlphaGo v. human games (reaching a hit rate of 87.5%, with one false positive and four false negatives). The problem is that proper material for blind testing is extremely difficult (or expensive) to gather: we need to have cheaters who confess their cheating, we need to be able to trust their confession (or else have proof of their cheating), and we also need proof that the non-cheating players really did not cheat.

Bill Spight wrote:For one thing the accusers reported only plays between move 50 and move 150, for no apparent reason.

The Hawkeye program on the Yike server does something similar. I understand the rationale is that (stronger) humans can memorise opening moves and count accurately in the endgame to avoid larger mistakes, meaning that middle game content should be the most informative for identifying cheaters. My own experience generally suggests the same, although from time to time there can be useful material in the opening and the endgame as well.

Bill Spight wrote:In the case I am talking about, the accusers broadened their search for evidence of cheating to include matching Leela11's second choices. That raises suspicions about the accusers. They have switched their theory of cheating. (OC, they may, for reasons unknown, started with looking for more matches than matches to Leela's top choices. But bear with me, please.) Their theory now is that the cheater sometimes played the top choice, sometimes the second choice. Well, that is a possible way to cheat, OC.

One major challenge at the time was that online cheating as a phenomenon was new, and there was nothing nearing a 'best established practice' for cheat detection. Checking how consistently a player's moves rate within an AI's top candidates is an easy idea to come up with – besides this case, it is also used in the Yike Hawkeye program.

As per my understanding of what happened, the accused actually changed their play after the accusation; when before they tended to pick the AI's top suggestion, and that was pointed out, then they started opting for lower candidates instead, which the accuser then pointed out. If this description is accurate, surely the change increases, rather than decreases, the likelihood of the accused cheating. However, my understanding is based on hearsay from people not directly involved in the case, so I will restrict myself to the hypothetical.

Bill Spight wrote:But two things happen by broadening what is considered to be confirmatory evidence. First, the weak evidence is made even weaker. Second, by being able to match more moves, the evidence of cheating is made to sound more convincing. Instead of reporting, say, 60% matches, you can report, say, 90% matches. And by including matches to Leela's third choices, as they also did, you weaken the evidence further, but make it sound more convincing by reporting, say, 95% matches. (I don't remember the exact number of matches reported in the accusation, but you get the general picture. Weaker evidence that sounds better.) Now, I am not saying that the accusers purposely presented misleading evidence, but I do think that their investigation was biased by looking for evidence of cheating. The right approach is, once your suspicions have been aroused, to look for evidence that the player is not cheating. You try to disprove your theory. Such an investigation may provide further evidence of cheating, OC.

A big challenge in the whole field of cheat detection is that it is essentially an endless cat-and-mouse game: when you come up with a model that catches a high ratio of cheaters with a small ratio of false positives and make it publicly known, the smart cheaters will adjust their play so that the model doesn't find them. Presumably for this observer effect, chess servers such as chess.org don't publicly state how exactly their cheat detection mechanism works.

If you monitor concordance with the AI's top moves, smart cheaters start avoiding them. If you monitor the development of the winrate, smart cheaters start controlling the 'story' of the game that it is not straightforward, but that it seems they are losing at one or several parts of the game. If you monitor the scoremean, smart cheaters start playing suboptimally, so that they perform just a little bit better than the opponent. And so on, and so on.

I am aware that the above can be interpreted as 'all players being guilty until disproven', so I would like to stress that this is completely not how I operate. I always assume that an alleged cheater is innocent until shown otherwise, and for that a good number of 'cheat filters' have to come off as positive. (E.g., straightforward win – check; unexpectedly high quality of the game – check; concordance with AI's recommendations at key points – check; etc.) Unfortunately, at this point the model relies on my human 'gut feeling' in the final call, rather than, say, an accurate probability distribution with confidence thresholds.

Bill Spight wrote:Research with several bots over many, many games could produce a good estimate, along with probability distributions and error estimates. Now suppose that a suspected cheater's plays matched a particular bot 95% of the time, while several other top bots each matched that bot's plays around 80% of the time. That would be strong evidence that the player was matching that bot's plays. Not because of the matching, per se, but because of the difference between the player's percentage of matching and that of the other bots. Confirmatory evidence is weak. Disconfirmatory evidence is strong.

Edit: This illustrates why loose matching to any of a bot's three top choices is a bad idea. For such matching, the concordance rate between any two of today's top bots would approach 100%. Even if a player's plays matched one of a bot's three top choices at a very high rate, the difference between the player's matching rate and that of another top bot would be too small to find a significant difference.

As far as I know, top pros already have concordance rates of above 80% in Hawkeye, and a European amateur player was recently suspected for reaching 76%. Personally I see too many potential problems in this model to rely on it in my own analysis – it makes for a singular, weak 'cheat filter' at best.

Posted: **Wed Nov 11, 2020 9:15 pm**

NordicGoDojo wrote:
Bill Spight wrote:May I strongly suggest that you add an empirical researcher to your team? Without empirical validation it is easy to go astray.
I've been wanting to blind-test my model for a long time, and that is why I recently made the initial test with 5 AI v. AI games, 5 human v. human games, and 10 AlphaGo v. human games (reaching a hit rate of 87.5%, with one false positive and four false negatives). The problem is that proper material for blind testing is extremely difficult (or expensive) to gather: we need to have cheaters who confess their cheating, we need to be able to trust their confession (or else have proof of their cheating), and we also need proof that the non-cheating players really did not cheat.

I understand the difficulties. The questions you raise are ones that an experienced empirical researcher can address.

Now, we have a lot of data of strong human players not using AI to cheat, going back before AlphaGo. In addition, thanks to the Elf team and GoGoD, we have a lot of data of differences between human play back then and modern bot play. Yes, humans are learning a lot from the bots, and will continue to do so until the law of diminishing returns kicks it. Which might take a decade or two. That amplifies the problems involved. And, OC, you don't have to stick to the Elf data, you can use KataGo and other bots for analysis, as well. It is just that there is a lot of data already available off the shelf. How much and in what ways human play has changed in the AI era is an important question, but it is important to establish an empirical baseline against which to measure changes.

NordicGoDojo wrote:
Bill Spight wrote:For one thing the accusers reported only plays between move 50 and move 150, for no apparent reason.
The Hawkeye program on the Yike server does something similar. I understand the rationale is that (stronger) humans can memorise opening moves and count accurately in the endgame to avoid larger mistakes, meaning that middle game content should be the most informative for identifying cheaters. My own experience generally suggests the same, although from time to time there can be useful material in the opening and the endgame as well.

Maybe so, but that is an untested hypothesis. Now, there may be good evidence for that in chess, for human play versus that of pre-neural net chess engines. But it is not a good methodology. As one of my profs stressed, don't throw away data. (Yes, you may have to deal with outliers, but we are not talking about that now. Besides, you deal with them, you don't just throw them out.) In the case in question, the suspected cheater had, according to Leela, taken a 70-30 lead by move 50, up from perhaps 50-50 or 45-55 or so. By move 180 or so, when the game ended, Leela gave his lead as 85-15. Even if you started off looking at move 50 and later, in percentage terms most of the player's advantage had already accrued, and in half as many plays or less. Say what you will, in that game his best play was already behind him. Wouldn't that be a good place to look for cheating?

NordicGoDojo wrote:
Bill Spight wrote:In the case I am talking about, the accusers broadened their search for evidence of cheating to include matching Leela11's second choices. That raises suspicions about the accusers. They have switched their theory of cheating. (OC, they may, for reasons unknown, started with looking for more matches than matches to Leela's top choices. But bear with me, please.) Their theory now is that the cheater sometimes played the top choice, sometimes the second choice. Well, that is a possible way to cheat, OC.
One major challenge at the time was that online cheating as a phenomenon was new, and there was nothing nearing a 'best established practice' for cheat detection.

Well, yes, the accusers were feeling their way around. But that didn't stop a lot of people from drawing very strong conclusions.

Let me say that Ales Cieply did some very careful and meticulous analysis.

NordicGoDojo wrote:
Bill Spight wrote:But two things happen by broadening what is considered to be confirmatory evidence. First, the weak evidence is made even weaker. Second, by being able to match more moves, the evidence of cheating is made to sound more convincing. Instead of reporting, say, 60% matches, you can report, say, 90% matches. And by including matches to Leela's third choices, as they also did, you weaken the evidence further, but make it sound more convincing by reporting, say, 95% matches. (I don't remember the exact number of matches reported in the accusation, but you get the general picture. Weaker evidence that sounds better.) Now, I am not saying that the accusers purposely presented misleading evidence, but I do think that their investigation was biased by looking for evidence of cheating. The right approach is, once your suspicions have been aroused, to look for evidence that the player is not cheating. You try to disprove your theory. Such an investigation may provide further evidence of cheating, OC.
A big challenge in the whole field of cheat detection is that it is essentially an endless cat-and-mouse game: when you come up with a model that catches a high ratio of cheaters with a small ratio of false positives and make it publicly known, the smart cheaters will adjust their play so that the model doesn't find them. Presumably for this observer effect, chess servers such as chess.org don't publicly state how exactly their cheat detection mechanism works.

Yes, there is a cat and mouse game. As John Fairbairn has pointed out, you need to establish a penumbra around cheating so that certain things that non-cheaters do may be disallowed, and certain things that non-cheaters do not currently do must be required. Honest players need to bend over backwards to avoid the appearance of cheating. Such is life.

But if cheat detection is essentially the cat and mouse game, the atmosphere is already poisoned. You see this in the world of espionage, where there is very little in the way of proof, and you never know whether you are being paranoid enough.

I am aware that the above can be interpreted as 'all players being guilty until disproven', so I would like to stress that this is completely not how I operate.

I didn't think so.

We need people who are immersed in the cat and mouse game. But that immersion perforce tends to produce a suspicious mind set, which may be necessary to play that game well. This is why we have different people assess the evidence in the end. That's also why we develop empirical methods of testing the evidence.

NordicGoDojo wrote:
Bill Spight wrote:Research with several bots over many, many games could produce a good estimate, along with probability distributions and error estimates. Now suppose that a suspected cheater's plays matched a particular bot 95% of the time, while several other top bots each matched that bot's plays around 80% of the time. That would be strong evidence that the player was matching that bot's plays. Not because of the matching, per se, but because of the difference between the player's percentage of matching and that of the other bots. Confirmatory evidence is weak. Disconfirmatory evidence is strong.
As far as I know, top pros already have concordance rates of above 80% in Hawkeye, and a European amateur player was recently suspected for reaching 76%. Personally I see too many potential problems in this model to rely on it in my own analysis – it makes for a singular, weak 'cheat filter' at best.

The point is that by establishing an empirical baseline we can turn weak evidence into strong evidence. Not because the evidence matches our suspicions, but because it differs from the baseline. Confirmatory evidence is weak, disconfirmatory evidence is strong.

Posted: **Wed Nov 11, 2020 9:39 pm**

If you lack (non-)cheating games, how about emulating them?

Anti-cheating means are weak if they have to be kept secret to work well. They must be open and survive attempts of cheaters' adapted styles.

Posted: **Wed Nov 11, 2020 9:50 pm**

Bill Spight wrote:certain things that non-cheaters do may be disallowed

Such as entering a tournament hall without passing a metal detector?

Surely we must not prohibit any legal move.

Confirmatory evidence is weak, disconfirmatory evidence is strong.

You keep repeating this gospel but please explain the theory why, IYO, this necessarily must be so, must it?

Life In 19x19

Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go