A Curious Case Study in KGS Ranks

RBerenguel · **#21**

Polama wrote:

RBerenguel wrote:

Can't this just be explained by history inertia? It may be statistically relevant, but the KGS ranking system (IIRC, it's been a while since I checked it) it's almost a predictor-corrector system (sorry for the term, this is used in numerical analysis, for example): it will heavily rely on history to predict the rank, probably correcting after more data points are available. Sure, a huge losing streak is significant, and current, but the historical weight says otherwise, and dampens the current "error"

The algorithm's choice can be explained by history inertia. But the actual performance can't be. If you view a rank as a fixed, static thing and you hit a 200 loss streak the best you can do is throw your hands up and say "that was weird!" and adjust your prediction down slightly. But this streak clearly demonstrates that this account's ability is not static, that the previous 17,000 games are no longer particularly meaningful. When we're at 10^-40 probability, it's significantly more likely that, say, the person suffered extreme head trauma then that they're having a bad day.

The model may work better with humans. But this case is a demonstration that at extreme numbers of games it can no longer respond to absurdly strong signals of a change in rank.

Now, it may be that there's an explicit time mechanism, and that if this account were let to run for a month it would eventually plummet rapidly to 30 kyu. That would be sensible, because the most likely case seems to be that somebody else logged into this account today. You'd want measures from multiple days to be certain. But if we're just looking at game results, the effect should definitely be way, way, way stronger.

I assume the 6 month history is log or 1/exp weighted, so that the first game used barely signals anything. So I guess after 3-4 days of losing everything, the rank may start plummeting, faster and faster (since as soon as you start losing to lower ranked players the rank will fall faster.) The weights used, history timings and other artifacts essentially determine the volatility of rank within the system, but from what I saw of the formula (as I say, it was a looong time ago since I checked it), it's essentially as easy to go up as it is to go down (as Mef explains with "low 4d" "high 4d" numbers in one of the linked threads.)

Pippen · **#22**

Mef wrote:

KGS's rating system aims to provide the most accurate rank it can with all data available. It aims to do the best job of predicting the probable outcome between any two players and any handicap (though in practice it only accepts feedback from games H6 or less).

Tygem's rating system does not make any predictions. It does not handle handicap games. It does not make any attempt to ensure proper rank spacing. It suffers from large amounts of noise being introduced by players setting their own ranks. Under an ideal set of assumptions (all ranks properly spaced, all players properly ranked, etc) you still expect to spend 30% of your time at the wrong rank. Tygem's rating system has a place in the go world and many people find it fun. Accurately assessing your go strength and comparing yourself on a fixed scale to a pool of larger players isn't it.

Yes, a Tygem rank x ranges wider. On KGS this rank x would be constructed over two ranks that cover that area. But also on Tygem you will not see a 5D beating a 7D more than 2 out of 10 games, so these ranks are accurate, but not that accurate like KGS and therefore not so fitting for ranked HC games (but not inadaequate either). Of course there is the exception when we talk about players that begin with Tygem and self-ranked themselves, but you gotta consider thousands of players on Tygem. IMO this pure mass "cleans" things up.

I do believe that after 100 games Tygem gives u a rank that compares u accurately to the big player pool. KGS does it preciser, but at the cost of fun, thrill & motivation. Because believe it or not: Knowing that the next game will decide about your promotion does give u chills you will not have at KGS^^. It's like if you compare regular season games to playoff games in sports.

RobertJasiek · **#23**

In/deflation of global player population: this is a problem of every rating system, because of player in/output and improvement of part of the players. It is possible to add global corrections for that for every, incl. my quick draft of a, system.

Handicaps: I dislike rating of handicap games because one needs to make pretty arbitrary assumptions by far not all players will meet.

HermanHiddema · **#24**

@Robert: So we add anchors or whatever to stabilize the rating.

Next problems:

1. Wildly inaccurate ranks change slowly. If a new player enters as 10k, but is actually 1d, he could go 50-0 and still be only 5k.

2. Constant volatility. The chances that a 15k will have improved a rank after playing 50 games are larger than those of a 5k improving after 50 games, which are again larger than a 5d improving after 50 games, but your ranking system does not reflect that.

RBerenguel · **#25**

HermanHiddema wrote:

@Robert: So we add anchors or whatever to stabilize the rating.

Next problems:

1. Wildly inaccurate ranks change slowly. If a new player enters as 10k, but is actually 1d, he could go 50-0 and still be only 5k.

2. Constant volatility. The chances that a 15k will have improved a rank after playing 50 games are larger than those of a 5k improving after 50 games, which are again larger than a 5d improving after 50 games, but your ranking system does not reflect that.

You beat me to it. We need anchors to stabilise the ranks, add predictions (essentially to estimate this hidden "inner strength") and we end with a ranking system very similar or equivalent to KGS's

RobertJasiek · **#26**

HermanHiddema wrote:

So we add anchors or whatever to stabilize the rating.

No anchors, and their artificial problems. I prefer a global method to balance in/deflation.

Quote:

1. Wildly inaccurate ranks change slowly.

New players: they choose an appropriate rank. If they guessed badly, they reset their initial rank.

Fast improving players: +0.1 per win is fast enough.

Quote:

2. Constant volatility.

Great. This reflects reality.

Quote:

The chances that a 15k will have improved a rank after playing 50 games are larger than those of a 5k improving after 50 games, which are again larger than a 5d improving after 50 games, but your ranking system does not reflect that.

Great. It should not reflect that. The system is designed to be very volatile, and sufficiently volatile for 15k or 5d.

RBerenguel wrote:

We need anchors to stabilise the ranks, add predictions (essentially to estimate this hidden "inner strength") and we end with a ranking system very similar or equivalent to KGS's

Anchors and predications are NOT needed to stabilise the (volatile) ranks within the global population. Instead one can use an assumption for a global distribution. The system would be very different from KGS, because global stablisation does / need / should not prevent each player's possible great volatility.

HermanHiddema · **#27**

RobertJasiek wrote:

Quote:

2. Constant volatility.

Great. This reflects reality.

You really think that a 15 kyu and a 5 dan improve at the same rate?

RobertJasiek · **#28**

We seem to have a misunderstanding to what "constant" refers:) You: players of different ranks have different volatility (yes!). I: regardless of different ranks, a rating system can be kept simpler by using a constant volatility regardless of rank.

EDITED

HermanHiddema · **#29**

RobertJasiek wrote:

Of course not. But I consider it an overkill to treat them differently. I think rating systems can and should be as simple as possible.

Perhaps, then, you should have said "this does not reflect reality, but I am willing to sacrifice accuracy for simplicity".

RBerenguel · **#30**

RobertJasiek wrote:

HermanHiddema wrote:

So we add anchors or whatever to stabilize the rating.

No anchors, and their artificial problems. I prefer a global method to balance in/deflation.

Quote:

1. Wildly inaccurate ranks change slowly.

New players: they choose an appropriate rank. If they guessed badly, they reset their initial rank.

Fast improving players: +0.1 per win is fast enough.

Quote:

2. Constant volatility.

Great. This reflects reality.

Quote:

The chances that a 15k will have improved a rank after playing 50 games are larger than those of a 5k improving after 50 games, which are again larger than a 5d improving after 50 games, but your ranking system does not reflect that.

Great. It should not reflect that. The system is designed to be very volatile, and sufficiently volatile for 15k or 5d.

RBerenguel wrote:

We need anchors to stabilise the ranks, add predictions (essentially to estimate this hidden "inner strength") and we end with a ranking system very similar or equivalent to KGS's

Anchors and predications are NOT needed to stabilise the (volatile) ranks within the global population. Instead one can use an assumption for a global distribution. The system would be very different from KGS, because global stablisation does / need / should not prevent each player's possible great volatility.

This seems to imply that the method you suggest may be: consider the previous 4 games (for instance, just a much smaller sample than 6 months) as rank estimation dampeners (to keep volatility slightly under control) and an estimation of the "inner strength" as valid, current rank. Is this close to your idea?

dfunkt · **#31**

I'm not a math guy so most of this thread is incomprehensible to me but as a go player it is much more fun to play on a server where your rank changes easily (although never on it's own with no games played.) I think Robert is right in that regard. I guess there are people who only want evenly matched games but I like the rank roulette approach.

RBerenguel · **#32**

High volatility means your current even game opponent can easily be 3 stones stronger than you. Fun?

Splatted · **#33**

RBerenguel wrote:

High volatility means your current even game opponent can easily be 3 stones stronger than you. Fun?

Yes

RBerenguel · **#34**

Just in case I ran an example of high volatility. In this new model, if a players of rank A and B play each other, the change in rank is as big as abs(A-B), the smallest variation is 0.1 (for players less than 0.1 rank points apart.)

With a small number of simulations it seemed as if convergence to "inner strength" was much faster and effective than with a much smaller variation, but with 150k games played we get:

Attachment:

Screen Shot 2014-03-25 at 22.03.11.png [ 69.43 KiB | Viewed 8907 times ]

This, again is the percentiles of differences between "inner" and "system" rankings.

Code:

Simulation: 100 * gauss(mu=20) for T=150000 steps having 149739 games played:

          top 1%    top 10%    top 25%  top 33.3%     median  bot 33.3%    bot 25%    bot 10%     bot 1% 
start      35.89      26.52      19.60      18.25      10.79       6.38       5.14       1.30       0.00 
mid         5.37       4.41       3.54       3.16       2.13       1.38       1.13       0.10       0.00 
final       5.37       4.55       3.52       3.16       1.97       1.21       0.89       0.20       0.00 

Half of the players have wrong ranks, by 2 stones. And this is with a very small pool and many, many games.

If we instead put a very minor variation (essentially invert the condition in the ranking shift, so the maximum is 0.1 and the minimum is 0.0001) we get a graph that is qualitatively similar.. But quantitatively very different:

Attachment:

Screen Shot 2014-03-25 at 22.07.26.png [ 50.61 KiB | Viewed 8907 times ]

Code:

Simulation: 100 * gauss(mu=20) for T=150000 steps having 149883 games played:

          top 1%    top 10%    top 25%  top 33.3%     median  bot 33.3%    bot 25%    bot 10%     bot 1% 
start      33.56      24.41      19.88      16.33      12.97       8.47       6.19       2.13       0.43 
mid         1.72       1.03       0.76       0.68       0.45       0.28       0.20       0.09       0.00 
final       1.78       1.39       0.95       0.73       0.53       0.36       0.30       0.12       0.00 

On median 1.5 stones better. Far better in the worse cases.

With absurdidly small numbers (max change 0.01, min change 0.0001), the convergence is incredibly slow and surprisingly not much better:

Code:

Simulation: 100 * gauss(mu=20) for T=500000 steps having 497391 games played:

          top 1%    top 10%    top 25%  top 33.3%     median  bot 33.3%    bot 25%    bot 10%     bot 1% 
start      38.11      25.35      18.98      14.10       9.77       6.22       4.74       1.84       0.37 
mid         1.98       1.64       1.22       1.01       0.85       0.70       0.61       0.26       0.03 
final       1.74       1.39       0.94       0.87       0.64       0.41       0.24       0.08       0.00 

Mef · **#35**

Polama wrote:

Nope, not equivalent. Plugging these numbers into a binomial calculator:

If we expect a 41% win rate, the probability of losing at least 236 games out of 242 'by chance' is about 10^-45.

If we factor in the ~17,000 opportunities for that streak, we're still around, call it, 10^-40.

For a player going 2 games a day, that's 365 games in the 6 month span. If we say he went 0-4, that's 12%. Given 361 4 game spans, that's essentially a given to occur (1-(10^-20) or so?) We'd be extremely surprised if a 41% player didn't have a 4 game losing streak in 365 games, and even more surprised if he had a 3/242 streak in a 17,000 game span.

Wins are streaky by nature, so the probability will be higher in practice. But still, 10^-40 is roughly your odds of being dealt a royal flush in poker, 7 hands in a row.

My comment was in reference to how you would expect the rank to be weighted similarly, the "one bad day" is going to account for ~5% of the total rank.

That said, I do enjoy posts that run the numbers and look at things analytically, so thank you! Your analysis up to here is correct, and you are further correct that this data would make it very easy to reject the claim that gnugo2 should be ranked as 11k over this stretch! However beyond that it is a little trickier...

Quote:

Put another way, auto-resigning most games, he was probably, what? 30 kyu? So the fact that the system thought he'd only fallen 1/5 a stone was extremely wrong. We know he was much worse than that. He demonstrated it over a very significant number of games. Which, as I understand it, is the most common complain about the kgs rating system: that it overestimates (in this case, vastly overestimates) how much variation can be expained away by chance as the number of games played increases.

This is where we run into trouble. While we know that there was an implementation bug causing him to forfeit most of his games (and hence could be reasonably classified as 30k), the system does not have that benefit.

To put this another way -- We know that it was behaving like a 30k, but just looking at the game results, how would we think it was behaving? Assuming this streak were perfectly representative of the true strength of this player, a win rate of 2.5% in properly handicapped games would project the player to be mis-ranked by somewhere in the ballpark of 2-3 stones. So looking at the data one might say "Which is more likely? It has suddenly lost the full 3 stones strength or it suddenly lost some strength, but it also having a string of poor luck?". Another point worth noting is that we only have a resolution of 1 data point assigned to the rating graph for that day. Because of this we do not truly know for sure how low the rating dropped, we can however assume that it did not drop below 12.0 for any appreciable amount of time

If you look at the calcuations being performed mid-streak in this regard, it gets a little more reasonable. Assume that instead of being 11k, the bot is actually a very weak 12k then suddenly you perform your binomial computation to see what are the odds that it goes 5-195 with a 90% expected loss rate and the probability is something like 3x10^-5 not unreasonable to observe in a 17,000 game streak. This is of course before we attempt to explain 16,700 games of playing at 11k strength.

Another (oversimplified) way we could look at is is to approach this situation like one ez4u asked about a while back: What would you expect if you have a player who plays half the time >1D and the other half the time <30k? In this case we have a player who plays 95% of the time like a 11k, and 5% like a 30k. If you were to predict the outcome of their next game, slotting them as a 12k for the over-under isn't too unreasonable.

RobertJasiek · **#36**

RBerenguel wrote:

Is this close to your idea?

My rough idea for global stabilisation against in/deflation: Presume a logarithmic(?) function for the desired fraction of players above a given rating. Determine all those too high / low and their total difference from the desired distribution. Shift each such player by a total excess in either direction divided by the number of affected players in the same direction. - Such a global shift is small per player, and of course it does not need any anchors or predictions on player developments. - Could you run a sample with such a global stablisation, please?

Quote:

High volatility means your current even game opponent can easily be 3 stones stronger than you.

3 stones real [world] rank difference frequently occurs also under the current system, but there the problem is greater, because players can be permanently placed around a wrong mean rating.

Apart from various minor details and country exceptions, the EGF rating system does it much better, because the data is much better: it is retrieved mainly from McMahon tournaments, which already have a good pre-sorting of players.

RBerenguel · **#37**

RobertJasiek wrote:

RBerenguel wrote:

Is this close to your idea?

My rough idea for global stabilisation against in/deflation: Presume a logarithmic(?) function for the desired fraction of players above a given rating. Determine all those too high / low and their total difference from the desired distribution. Shift each such player by a total excess in either direction divided by the number of affected players in the same direction. - Such a global shift is small per player, and of course it does not need any anchors or predictions on player developments. - Could you run a sample with such a global stablisation, please?

Quote:

High volatility means your current even game opponent can easily be 3 stones stronger than you.

3 stones real [world] rank difference frequently occurs also under the current system, but there the problem is greater, because players can be permanently placed around a wrong mean rating.

Apart from various minor details and country exceptions, the EGF rating system does it much better, because the data is much better: it is retrieved mainly from McMahon tournaments, which already have a good pre-sorting of players.

I'm not sure I totally understand the stabilisation method. A normal (for instance) distribution should be imposed on the rankings of players (so correcting ranks adjusts all rankings to a normal)? This introduces a problem: is a normal distribution the correct model? I guess in my model I can impose a "refit" step after every (say) 20 games to get closer to a normal, but I'm not sure how to do it quickly and easily. I'll think about it, let me know if this is the idea.

RobertJasiek · **#38**

It is not a normal (Gaussian / binomial) distribution, because weak ranks are not rare. I do not know which function is the best. Logarithm is just a first guess. Deriving a function from Cieply's statistics would be another approach. OTOH, a reasonably good approximation to observed rank distributions in a known environment will do.

At a particular moment of time (at a different moment, you start afresh), let us suppose that P1..Pn are the ordered ratings of n players. (Maybe they can be rounded to multiples of 0.1.) Assume that we have found a function F, from which we can derive an ideal distribution of ratings I1..In of an ideal player population. Determine Pi - Ii for all i=1..n. This will classify the actual players into three kinds: too great rating, ideal rating, too small rating. If there are more / fewer players with too great / small rating, the inflation / deflation must be corrected as follows: Modify only the players with either too great / small rating. For the set of modified players, calculate the sum of Pj - Ij. Divide by the number of modified players. Apply the value to each modified player.

This is just a sketch, surely details can be varied. Maybe two modifications should be made, one for the too great, one for the too small ratings group of players.

Every day, week or month (I do not know yet which frequency is needed), make such a modification. Note: you start with enumbering the players afresh!

SmoothOper · **#39**

Long term stable ranks based on extensive histories aren't really a selling point IMO. To some extent it is nice not to have to worry about dropping a game here or there, but not that much.

RBerenguel · **#40**

RobertJasiek wrote:

It is not a normal (Gaussian / binomial) distribution, because weak ranks are not rare. I do not know which function is the best. Logarithm is just a first guess. Deriving a function from Cieply's statistics would be another approach. OTOH, a reasonably good approximation to observed rank distributions in a known environment will do.

At a particular moment of time (at a different moment, you start afresh), let us suppose that P1..Pn are the ordered ratings of n players. (Maybe they can be rounded to multiples of 0.1.) Assume that we have found a function F, from which we can derive an ideal distribution of ratings I1..In of an ideal player population. Determine Pi - Ii for all i=1..n. This will classify the actual players into three kinds: too great rating, ideal rating, too small rating. If there are more / fewer players with too great / small rating, the inflation / deflation must be corrected as follows: Modify only the players with either too great / small rating. For the set of modified players, calculate the sum of Pj - Ij. Divide by the number of modified players. Apply the value to each modified player.

This is just a sketch, surely details can be varied. Maybe two modifications should be made, one for the too great, one for the too small ratings group of players.

Every day, week or month (I do not know yet which frequency is needed), make such a modification. Note: you start with enumbering the players afresh!

Rank/rating seems to behave in a more complex way than that:

Histogram of current GoR with 30 breaks:

From this plot, it almost appears to be a bimodal distribution (with mass accumulating at 100 and 1500). Increasing the number of breaks in the histogram plot hints at even more "minor" modes, actually just increasing from 30 (this image) to 40 adds a clear-cut peak-and-valley between 1500 and 2000. In other words, if I saw this plot and was asked to model it, I'd probably dust off first the Fourier tools than the statistical tools.

Here's with 60 breaks:

Edit: hid the plots in spoiler tags to avoid messing layout

A Curious Case Study in KGS Ranks

Who is online