A Curious Case Study in KGS Ranks

dfunkt · Post by **dfunkt** » Tue Mar 25, 2014 1:24 pm

I'm not a math guy so most of this thread is incomprehensible to me but as a go player it is much more fun to play on a server where your rank changes easily (although never on it's own with no games played.) I think Robert is right in that regard. I guess there are people who only want evenly matched games but I like the rank roulette approach.

RBerenguel · Post by **RBerenguel** » Tue Mar 25, 2014 1:52 pm

High volatility means your current even game opponent can easily be 3 stones stronger than you. Fun?

Splatted · Post by **Splatted** » Tue Mar 25, 2014 2:02 pm

RBerenguel wrote:High volatility means your current even game opponent can easily be 3 stones stronger than you. Fun?

Yes

RBerenguel · Post by **RBerenguel** » Tue Mar 25, 2014 2:11 pm

Just in case I ran an example of high volatility. In this new model, if a players of rank A and B play each other, the change in rank is as big as abs(A-B), the smallest variation is 0.1 (for players less than 0.1 rank points apart.)

With a small number of simulations it seemed as if convergence to "inner strength" was much faster and effective than with a much smaller variation, but with 150k games played we get:

: Screen Shot 2014-03-25 at 22.03.11.png (69.43 KiB) Viewed 11303 times

This, again is the percentiles of differences between "inner" and "system" rankings.

Code: Select all

Simulation: 100 * gauss(mu=20) for T=150000 steps having 149739 games played:

          top 1%    top 10%    top 25%  top 33.3%     median  bot 33.3%    bot 25%    bot 10%     bot 1% 
start      35.89      26.52      19.60      18.25      10.79       6.38       5.14       1.30       0.00 
mid         5.37       4.41       3.54       3.16       2.13       1.38       1.13       0.10       0.00 
final       5.37       4.55       3.52       3.16       1.97       1.21       0.89       0.20       0.00

Half of the players have wrong ranks, by 2 stones. And this is with a very small pool and many, many games.

If we instead put a very minor variation (essentially invert the condition in the ranking shift, so the maximum is 0.1 and the minimum is 0.0001) we get a graph that is qualitatively similar.. But quantitatively very different:

: Screen Shot 2014-03-25 at 22.07.26.png (50.61 KiB) Viewed 11303 times

Code: Select all

Simulation: 100 * gauss(mu=20) for T=150000 steps having 149883 games played:

          top 1%    top 10%    top 25%  top 33.3%     median  bot 33.3%    bot 25%    bot 10%     bot 1% 
start      33.56      24.41      19.88      16.33      12.97       8.47       6.19       2.13       0.43 
mid         1.72       1.03       0.76       0.68       0.45       0.28       0.20       0.09       0.00 
final       1.78       1.39       0.95       0.73       0.53       0.36       0.30       0.12       0.00

On median 1.5 stones better. Far better in the worse cases.

With absurdidly small numbers (max change 0.01, min change 0.0001), the convergence is incredibly slow and surprisingly not much better:

Code: Select all

Simulation: 100 * gauss(mu=20) for T=500000 steps having 497391 games played:

          top 1%    top 10%    top 25%  top 33.3%     median  bot 33.3%    bot 25%    bot 10%     bot 1% 
start      38.11      25.35      18.98      14.10       9.77       6.22       4.74       1.84       0.37 
mid         1.98       1.64       1.22       1.01       0.85       0.70       0.61       0.26       0.03 
final       1.74       1.39       0.94       0.87       0.64       0.41       0.24       0.08       0.00

Mef · Post by **Mef** » Tue Mar 25, 2014 3:10 pm

Polama wrote: Nope, not equivalent. Plugging these numbers into a binomial calculator:

If we expect a 41% win rate, the probability of losing at least 236 games out of 242 'by chance' is about 10^-45.

If we factor in the ~17,000 opportunities for that streak, we're still around, call it, 10^-40.

For a player going 2 games a day, that's 365 games in the 6 month span. If we say he went 0-4, that's 12%. Given 361 4 game spans, that's essentially a given to occur (1-(10^-20) or so?) We'd be extremely surprised if a 41% player didn't have a 4 game losing streak in 365 games, and even more surprised if he had a 3/242 streak in a 17,000 game span.

Wins are streaky by nature, so the probability will be higher in practice. But still, 10^-40 is roughly your odds of being dealt a royal flush in poker, 7 hands in a row.

My comment was in reference to how you would expect the rank to be weighted similarly, the "one bad day" is going to account for ~5% of the total rank.

That said, I do enjoy posts that run the numbers and look at things analytically, so thank you! Your analysis up to here is correct, and you are further correct that this data would make it very easy to reject the claim that gnugo2 should be ranked as 11k over this stretch! However beyond that it is a little trickier...

Put another way, auto-resigning most games, he was probably, what? 30 kyu? So the fact that the system thought he'd only fallen 1/5 a stone was extremely wrong. We know he was much worse than that. He demonstrated it over a very significant number of games. Which, as I understand it, is the most common complain about the kgs rating system: that it overestimates (in this case, vastly overestimates) how much variation can be expained away by chance as the number of games played increases.

This is where we run into trouble. While we know that there was an implementation bug causing him to forfeit most of his games (and hence could be reasonably classified as 30k), the system does not have that benefit.

To put this another way -- We know that it was behaving like a 30k, but just looking at the game results, how would we think it was behaving? Assuming this streak were perfectly representative of the true strength of this player, a win rate of 2.5% in properly handicapped games would project the player to be mis-ranked by somewhere in the ballpark of 2-3 stones. So looking at the data one might say "Which is more likely? It has suddenly lost the full 3 stones strength or it suddenly lost some strength, but it also having a string of poor luck?". Another point worth noting is that we only have a resolution of 1 data point assigned to the rating graph for that day. Because of this we do not truly know for sure how low the rating dropped, we can however assume that it did not drop below 12.0 for any appreciable amount of time

If you look at the calcuations being performed mid-streak in this regard, it gets a little more reasonable. Assume that instead of being 11k, the bot is actually a very weak 12k then suddenly you perform your binomial computation to see what are the odds that it goes 5-195 with a 90% expected loss rate and the probability is something like 3x10^-5 not unreasonable to observe in a 17,000 game streak. This is of course before we attempt to explain 16,700 games of playing at 11k strength.

Another (oversimplified) way we could look at is is to approach this situation like one ez4u asked about a while back: What would you expect if you have a player who plays half the time >1D and the other half the time <30k? In this case we have a player who plays 95% of the time like a 11k, and 5% like a 30k. If you were to predict the outcome of their next game, slotting them as a 12k for the over-under isn't too unreasonable.

RobertJasiek · Post by **RobertJasiek** » Tue Mar 25, 2014 8:12 pm

RBerenguel wrote:Is this close to your idea?

My rough idea for global stabilisation against in/deflation: Presume a logarithmic(?) function for the desired fraction of players above a given rating. Determine all those too high / low and their total difference from the desired distribution. Shift each such player by a total excess in either direction divided by the number of affected players in the same direction. - Such a global shift is small per player, and of course it does not need any anchors or predictions on player developments. - Could you run a sample with such a global stablisation, please?

High volatility means your current even game opponent can easily be 3 stones stronger than you.

3 stones real [world] rank difference frequently occurs also under the current system, but there the problem is greater, because players can be permanently placed around a wrong mean rating.

Apart from various minor details and country exceptions, the EGF rating system does it much better, because the data is much better: it is retrieved mainly from McMahon tournaments, which already have a good pre-sorting of players.

RBerenguel · Post by **RBerenguel** » Wed Mar 26, 2014 3:32 am

RobertJasiek wrote:
RBerenguel wrote:Is this close to your idea?
My rough idea for global stabilisation against in/deflation: Presume a logarithmic(?) function for the desired fraction of players above a given rating. Determine all those too high / low and their total difference from the desired distribution. Shift each such player by a total excess in either direction divided by the number of affected players in the same direction. - Such a global shift is small per player, and of course it does not need any anchors or predictions on player developments. - Could you run a sample with such a global stablisation, please?

High volatility means your current even game opponent can easily be 3 stones stronger than you.
3 stones real [world] rank difference frequently occurs also under the current system, but there the problem is greater, because players can be permanently placed around a wrong mean rating.

Apart from various minor details and country exceptions, the EGF rating system does it much better, because the data is much better: it is retrieved mainly from McMahon tournaments, which already have a good pre-sorting of players.

I'm not sure I totally understand the stabilisation method. A normal (for instance) distribution should be imposed on the rankings of players (so correcting ranks adjusts all rankings to a normal)? This introduces a problem: is a normal distribution the correct model? I guess in my model I can impose a "refit" step after every (say) 20 games to get closer to a normal, but I'm not sure how to do it quickly and easily. I'll think about it, let me know if this is the idea.

RobertJasiek · Post by **RobertJasiek** » Wed Mar 26, 2014 3:58 am

It is not a normal (Gaussian / binomial) distribution, because weak ranks are not rare. I do not know which function is the best. Logarithm is just a first guess. Deriving a function from Cieply's statistics would be another approach. OTOH, a reasonably good approximation to observed rank distributions in a known environment will do.

At a particular moment of time (at a different moment, you start afresh), let us suppose that P1..Pn are the ordered ratings of n players. (Maybe they can be rounded to multiples of 0.1.) Assume that we have found a function F, from which we can derive an ideal distribution of ratings I1..In of an ideal player population. Determine Pi - Ii for all i=1..n. This will classify the actual players into three kinds: too great rating, ideal rating, too small rating. If there are more / fewer players with too great / small rating, the inflation / deflation must be corrected as follows: Modify only the players with either too great / small rating. For the set of modified players, calculate the sum of Pj - Ij. Divide by the number of modified players. Apply the value to each modified player.

This is just a sketch, surely details can be varied. Maybe two modifications should be made, one for the too great, one for the too small ratings group of players.

Every day, week or month (I do not know yet which frequency is needed), make such a modification. Note: you start with enumbering the players afresh!

SmoothOper · Post by **SmoothOper** » Wed Mar 26, 2014 4:48 am

Long term stable ranks based on extensive histories aren't really a selling point IMO. To some extent it is nice not to have to worry about dropping a game here or there, but not that much.

RBerenguel · Post by **RBerenguel** » Wed Mar 26, 2014 6:26 am

RobertJasiek wrote:It is not a normal (Gaussian / binomial) distribution, because weak ranks are not rare. I do not know which function is the best. Logarithm is just a first guess. Deriving a function from Cieply's statistics would be another approach. OTOH, a reasonably good approximation to observed rank distributions in a known environment will do.

At a particular moment of time (at a different moment, you start afresh), let us suppose that P1..Pn are the ordered ratings of n players. (Maybe they can be rounded to multiples of 0.1.) Assume that we have found a function F, from which we can derive an ideal distribution of ratings I1..In of an ideal player population. Determine Pi - Ii for all i=1..n. This will classify the actual players into three kinds: too great rating, ideal rating, too small rating. If there are more / fewer players with too great / small rating, the inflation / deflation must be corrected as follows: Modify only the players with either too great / small rating. For the set of modified players, calculate the sum of Pj - Ij. Divide by the number of modified players. Apply the value to each modified player.

This is just a sketch, surely details can be varied. Maybe two modifications should be made, one for the too great, one for the too small ratings group of players.

Every day, week or month (I do not know yet which frequency is needed), make such a modification. Note: you start with enumbering the players afresh!

Rank/rating seems to behave in a more complex way than that:

Histogram of current GoR with 30 breaks:

From this plot, it almost appears to be a bimodal distribution (with mass accumulating at 100 and 1500). Increasing the number of breaks in the histogram plot hints at even more "minor" modes, actually just increasing from 30 (this image) to 40 adds a clear-cut peak-and-valley between 1500 and 2000. In other words, if I saw this plot and was asked to model it, I'd probably dust off first the Fourier tools than the statistical tools.

Here's with 60 breaks:

Edit: hid the plots in spoiler tags to avoid messing layout

RobertJasiek · Post by **RobertJasiek** » Wed Mar 26, 2014 7:20 am

Thanks, but these data can be influenced by the frequencies to attend tournaments / play regularly on servers, which might be lower for beginner ranks.

Polama · Post by **Polama** » Wed Mar 26, 2014 7:21 am

Mef wrote: My comment was in reference to how you would expect the rank to be weighted similarly, the "one bad day" is going to account for ~5% of the total rank.

That said, I do enjoy posts that run the numbers and look at things analytically, so thank you! Your analysis up to here is correct, and you are further correct that this data would make it very easy to reject the claim that gnugo2 should be ranked as 11k over this stretch! However beyond that it is a little trickier...

It was the actually the math that drew me in to this conversation =). I don't have particularly strong opinions on the KGS scoring algorithm, but I was intrigued by how extreme the probability this span would happen by chance was.

This is where we run into trouble. While we know that there was an implementation bug causing him to forfeit most of his games (and hence could be reasonably classified as 30k), the system does not have that benefit.

That's why I think it made such a good test case, because we know the truth and the algorithm didn't. If a player complains they should be 2 stones stronger, we don't know for certain if the algorithm is slow or they're overestimating their ability. Here we can compare the 'truth' to the system's perception of the truth.

If you look at the calcuations being performed mid-streak in this regard, it gets a little more reasonable. Assume that instead of being 11k, the bot is actually a very weak 12k then suddenly you perform your binomial computation to see what are the odds that it goes 5-195 with a 90% expected loss rate and the probability is something like 3x10^-5 not unreasonable to observe in a 17,000 game streak. This is of course before we attempt to explain 16,700 games of playing at 11k strength.

Yes, I agree that roughly 2 stones weaker would be on the low side of consistent with the evidence (I think 10% win rate for 13 kyu vs 11 kyu is low though, the variation is higher in the ddk's). Also, I think looking at it mid-streak would actually exacerbate this case, because we could reasonably infer at least 1 stones loss of strength sometime in the first 50 games or so, and if we implement the change then, we'd continue to see the same bad performance against weaker and weaker opponents.

It's not that I think the bot should've plummeted to 30 kyu immediately, but 1/5 a stone seems wildly low compared with the evidence.

Another (oversimplified) way we could look at is is to approach this situation like one ez4u asked about a while back: What would you expect if you have a player who plays half the time >1D and the other half the time <30k? In this case we have a player who plays 95% of the time like a 11k, and 5% like a 30k. If you were to predict the outcome of their next game, slotting them as a 12k for the over-under isn't too unreasonable.

The key, as I see it, is that the player isn't randomly playing bad at 5% of the time, but is playing bad in contiguous bursts 5% of the time. Given that, your best prediction is to look how he was doing in his most recent games.

That his results aren't inconsistent with being 2-3 stones off is, I suppose, part of the point. Let's say it was a human who played gigantic numbers of blitz 11kyu go and never improved. Finally, he started studying under a professional teacher, devoting his time to tsumego, playing carefully. He switches to slower games, so he's playing 1/5 as many games as before. Over a couple weeks he goes completely undefeated in every match. If the system isn't willing to say that's worth at least 2 stones, it seems to be way too rigid. The model seems perfectly tuned for a static rank, but it seems to be under-appreciating our ability to improve or get worse at the game in bursts.

arndt · Post by **arndt** » Wed Mar 26, 2014 8:09 am

I'm coming in late, and I haven't read the preceding discussion in detail, but has it been stated what the ranks of the users challenging the bot during its resigning period were? If they were 15k to 10k, why should it plummet to 30k rather than 15k?

Sorry if this was already asked and answered.

skydyr · Post by **skydyr** » Wed Mar 26, 2014 8:13 am

Polama wrote:
Mef wrote: My comment was in reference to how you would expect the rank to be weighted similarly, the "one bad day" is going to account for ~5% of the total rank.

That said, I do enjoy posts that run the numbers and look at things analytically, so thank you! Your analysis up to here is correct, and you are further correct that this data would make it very easy to reject the claim that gnugo2 should be ranked as 11k over this stretch! However beyond that it is a little trickier...
It was the actually the math that drew me in to this conversation =). I don't have particularly strong opinions on the KGS scoring algorithm, but I was intrigued by how extreme the probability this span would happen by chance was.

This is where we run into trouble. While we know that there was an implementation bug causing him to forfeit most of his games (and hence could be reasonably classified as 30k), the system does not have that benefit.
That's why I think it made such a good test case, because we know the truth and the algorithm didn't. If a player complains they should be 2 stones stronger, we don't know for certain if the algorithm is slow or they're overestimating their ability. Here we can compare the 'truth' to the system's perception of the truth.

If you look at the calcuations being performed mid-streak in this regard, it gets a little more reasonable. Assume that instead of being 11k, the bot is actually a very weak 12k then suddenly you perform your binomial computation to see what are the odds that it goes 5-195 with a 90% expected loss rate and the probability is something like 3x10^-5 not unreasonable to observe in a 17,000 game streak. This is of course before we attempt to explain 16,700 games of playing at 11k strength.
Yes, I agree that roughly 2 stones weaker would be on the low side of consistent with the evidence (I think 10% win rate for 13 kyu vs 11 kyu is low though, the variation is higher in the ddk's). Also, I think looking at it mid-streak would actually exacerbate this case, because we could reasonably infer at least 1 stones loss of strength sometime in the first 50 games or so, and if we implement the change then, we'd continue to see the same bad performance against weaker and weaker opponents.

It's not that I think the bot should've plummeted to 30 kyu immediately, but 1/5 a stone seems wildly low compared with the evidence.

Another (oversimplified) way we could look at is is to approach this situation like one ez4u asked about a while back: What would you expect if you have a player who plays half the time >1D and the other half the time <30k? In this case we have a player who plays 95% of the time like a 11k, and 5% like a 30k. If you were to predict the outcome of their next game, slotting them as a 12k for the over-under isn't too unreasonable.
The key, as I see it, is that the player isn't randomly playing bad at 5% of the time, but is playing bad in contiguous bursts 5% of the time. Given that, your best prediction is to look how he was doing in his most recent games.

That his results aren't inconsistent with being 2-3 stones off is, I suppose, part of the point. Let's say it was a human who played gigantic numbers of blitz 11kyu go and never improved. Finally, he started studying under a professional teacher, devoting his time to tsumego, playing carefully. He switches to slower games, so he's playing 1/5 as many games as before. Over a couple weeks he goes completely undefeated in every match. If the system isn't willing to say that's worth at least 2 stones, it seems to be way too rigid. The model seems perfectly tuned for a static rank, but it seems to be under-appreciating our ability to improve or get worse at the game in bursts.

If this is perceived as a problem, it would seem that it could be fixed by changing the historical weighting algorithm so that older games expire or are devalued faster than they are currently, or that the most recent X games give more weight than normal.

Mef · Post by **Mef** » Wed Mar 26, 2014 8:48 am

Polama wrote:
That's why I think it made such a good test case, because we know the truth and the algorithm didn't. If a player complains they should be 2 stones stronger, we don't know for certain if the algorithm is slow or they're overestimating their ability. Here we can compare the 'truth' to the system's perception of the truth.

If you look at the calcuations being performed mid-streak in this regard, it gets a little more reasonable. Assume that instead of being 11k, the bot is actually a very weak 12k then suddenly you perform your binomial computation to see what are the odds that it goes 5-195 with a 90% expected loss rate and the probability is something like 3x10^-5 not unreasonable to observe in a 17,000 game streak. This is of course before we attempt to explain 16,700 games of playing at 11k strength.
Yes, I agree that roughly 2 stones weaker would be on the low side of consistent with the evidence (I think 10% win rate for 13 kyu vs 11 kyu is low though, the variation is higher in the ddk's). Also, I think looking at it mid-streak would actually exacerbate this case, because we could reasonably infer at least 1 stones loss of strength sometime in the first 50 games or so, and if we implement the change then, we'd continue to see the same bad performance against weaker and weaker opponents.

It's not that I think the bot should've plummeted to 30 kyu immediately, but 1/5 a stone seems wildly low compared with the evidence.

Another (oversimplified) way we could look at is is to approach this situation like one ez4u asked about a while back: What would you expect if you have a player who plays half the time >1D and the other half the time <30k? In this case we have a player who plays 95% of the time like a 11k, and 5% like a 30k. If you were to predict the outcome of their next game, slotting them as a 12k for the over-under isn't too unreasonable.
The key, as I see it, is that the player isn't randomly playing bad at 5% of the time, but is playing bad in contiguous bursts 5% of the time. Given that, your best prediction is to look how he was doing in his most recent games.

That his results aren't inconsistent with being 2-3 stones off is, I suppose, part of the point. Let's say it was a human who played gigantic numbers of blitz 11kyu go and never improved. Finally, he started studying under a professional teacher, devoting his time to tsumego, playing carefully. He switches to slower games, so he's playing 1/5 as many games as before. Over a couple weeks he goes completely undefeated in every match. If the system isn't willing to say that's worth at least 2 stones, it seems to be way too rigid. The model seems perfectly tuned for a static rank, but it seems to be under-appreciating our ability to improve or get worse at the game in bursts.

So now we are mixing lots of things --

First is that while we talk about it as making a shift at a certain point, KGS is actually constantly adjusting the predicted rank so that it fits the maximum likelihood of the data available. You are correct that in this case KGS would have continued to see bad results as it dropped the rank, and had the bot not been stopped after 1 day it's rating drop would have continued accelerating (once again - we don't actually know just how far it dropped into the 12k range, just that it was somewhere between 0.2 and 1.3 of a stone). Again, it's also worth noting that the performance observed wasn't indicative necessary of a huge change just that of perhaps 2-3 stones. The characterization of playing like a 30k was probably not entirely accurate, for instance here is one of the games it won:

Now, moving on to your hypothetical we are venturing into the part of these discussion I dislike -- Instead of having real data and real results being analyzed, we have a made up situation, with vague "data" being presented, theorized behavior being speculated on and then the rating system criticized for it. If you literally have a 3 stone step change overnight and change your playing patterns in order to exacerbate how long it will take the rating system to compensate...then yes it is possible that that you are a corner case that the system was not set up to handle in an ideal manner. This has happened before and it has disrupted the KGS rating system before, but to my knowledge it has never been by a person, it was when a stronger open-source bot was released so all the GnuGo bots got upgraded (and were 2 stones stronger overnight).

A more realistic scenario would be something like some members of this forum have done -- take a summer off to travel to China and study go. It's entirely plausible that after a 90 day break a player returns and is 3 stones stronger. Let's assume this player comes back to KGS and as you say plays games at 1/5 their original rate. If we assume this player was 5k or weaker before leaving, then by the end of the first week their "new" games will composed ~23.35 of their rank. By the end of two weeks >40% of their rank will be based on the new games. It will not take long for them to find a new equilibrium close to where they should be.

Nevertheless even for the corner case KGS has a simple way to solve this problem: Play games handicapped at the rating you think you should be! This will allow you to reach your equilibrium faster and unlike many other rating systems does not penalize the opponents who help you get there.

Life In 19x19

A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks

Re: A Curious Case Study in KGS Ranks