Re: Derived Metrics for the Game of Go
Posted: Tue Nov 10, 2020 10:28 pm
Now I discuss the rest of the paper.
For cheat detection, the paper considers a winrate graph over a game's moves according to the AI's stated probabilites. One player cheating is described as the graph steadily going upwards to 99%, both players cheating as a rather constant graph.
Regardless of indirect calculation of winrates, I agree that such graphs identify suspective players because such graphs express them to make essentially no significant mistakes. They are an indication of possibly occurring cheating but not a proof of it.
The paper claims that a player's average 'effect' and consistent development of his moves' effects indicated his skill. Since effect is calculated from scoremeans, again, this is wrong. Effects only indicate a model of skill. See my earlier remarks.
The paper repeats its earlier mistakes, which I have mentioned for earlier sections.
It compares a player's moves to KataGo's first move suggestions; close agreement is said to let a player be suspicious. Suspicious, why not; every player can be suspected to be cheating. However, there is another explanation than cheating: a player can have trained a lot with KataGo or have a similar playing style. Besides, there is the problem that a player might be cheating using a different AI program or different KataGo network; then KataGo's moves might not be particularly suitable for comparison to the player's moves.
The paper studies four example games discussing indications of cheating. Again, it frequently repeats its earlier mistakes. Per player and game, a few indicators are considered to judge about cheating. Most alleged indicators are interpreted as indicating cheating, although they can also be interpreted as the opposite. The paper's systematic repetition of its earlier mistakes combined with indicators interpreted with prejudice of indicating cheating produces alleged detection of cheaters on the implied assumption that several such aspects combined would be sufficiently convincing evidence. A warning of caution that false allegations can occur only serves as an alibi. Such an approach of cheating detection is bound to detect cheaters regardless of what percentage of players is judged upon wrongly.
Another part of the problem is that the paper has introduced values (called metrics) and applies them while the paper does not provide theory for distinguishing when values, or combinations of values, indicate versus do not indicate cheating. This is like statistics without confidence thresholds for individual values, let alone for combined consideration of several same kinds of values and then for different kinds of values.
Value graphs are supposed to be interpreted subconsciously by human arbiters. Instead, there ought to be theory interpreting data represented in graphs, such as analysing differences between two value curves in a graph.
Such theory requires agreement to very large samples of games and their values and graphs. This is so also because different board positions and different games can have different behaviours of values. E.g., imagine a semeai with two local maxima: one correct and one wrong; when calculting an average for a roughly balanced tree search, also the average will be a wrong indication. Currently, such is interpreted as being as good as all other values for ordinary positions.
The paper's value analysis applied to players creates an unfair prejudice: some players with specific playing styles, studying with specific AI programs or having studied much with AI are in much greater danger of being wrongly indicated as cheaters.
In conclusion, although the paper suggests some values potentially useful for some studies or models, the theory is very far from rather safe application as distinction between cheating and no cheating, except for the mentioned success cases of tool usage followed by a player admitting cheating. Currently, the theory is very incomplete, is over-interpreted and frequently advertised by the paper's authors within the paper as being more (an alleged description of reality, such as "a player's skill") than it is (only a model such as "a model of a player's skill"; furthermore a model lacking quality evaluation, which - for the promoted application of cheating detection - is essential).
For cheat detection, the paper considers a winrate graph over a game's moves according to the AI's stated probabilites. One player cheating is described as the graph steadily going upwards to 99%, both players cheating as a rather constant graph.
Regardless of indirect calculation of winrates, I agree that such graphs identify suspective players because such graphs express them to make essentially no significant mistakes. They are an indication of possibly occurring cheating but not a proof of it.
The paper claims that a player's average 'effect' and consistent development of his moves' effects indicated his skill. Since effect is calculated from scoremeans, again, this is wrong. Effects only indicate a model of skill. See my earlier remarks.
The paper repeats its earlier mistakes, which I have mentioned for earlier sections.
It compares a player's moves to KataGo's first move suggestions; close agreement is said to let a player be suspicious. Suspicious, why not; every player can be suspected to be cheating. However, there is another explanation than cheating: a player can have trained a lot with KataGo or have a similar playing style. Besides, there is the problem that a player might be cheating using a different AI program or different KataGo network; then KataGo's moves might not be particularly suitable for comparison to the player's moves.
The paper studies four example games discussing indications of cheating. Again, it frequently repeats its earlier mistakes. Per player and game, a few indicators are considered to judge about cheating. Most alleged indicators are interpreted as indicating cheating, although they can also be interpreted as the opposite. The paper's systematic repetition of its earlier mistakes combined with indicators interpreted with prejudice of indicating cheating produces alleged detection of cheaters on the implied assumption that several such aspects combined would be sufficiently convincing evidence. A warning of caution that false allegations can occur only serves as an alibi. Such an approach of cheating detection is bound to detect cheaters regardless of what percentage of players is judged upon wrongly.
Another part of the problem is that the paper has introduced values (called metrics) and applies them while the paper does not provide theory for distinguishing when values, or combinations of values, indicate versus do not indicate cheating. This is like statistics without confidence thresholds for individual values, let alone for combined consideration of several same kinds of values and then for different kinds of values.
Value graphs are supposed to be interpreted subconsciously by human arbiters. Instead, there ought to be theory interpreting data represented in graphs, such as analysing differences between two value curves in a graph.
Such theory requires agreement to very large samples of games and their values and graphs. This is so also because different board positions and different games can have different behaviours of values. E.g., imagine a semeai with two local maxima: one correct and one wrong; when calculting an average for a roughly balanced tree search, also the average will be a wrong indication. Currently, such is interpreted as being as good as all other values for ordinary positions.
The paper's value analysis applied to players creates an unfair prejudice: some players with specific playing styles, studying with specific AI programs or having studied much with AI are in much greater danger of being wrongly indicated as cheaters.
In conclusion, although the paper suggests some values potentially useful for some studies or models, the theory is very far from rather safe application as distinction between cheating and no cheating, except for the mentioned success cases of tool usage followed by a player admitting cheating. Currently, the theory is very incomplete, is over-interpreted and frequently advertised by the paper's authors within the paper as being more (an alleged description of reality, such as "a player's skill") than it is (only a model such as "a model of a player's skill"; furthermore a model lacking quality evaluation, which - for the promoted application of cheating detection - is essential).