Life In 19x19
http://www.lifein19x19.com/

A Curious Case Study in KGS Ranks
http://www.lifein19x19.com/viewtopic.php?f=24&t=10051
Page 1 of 5

Author:  Mef [ Mon Mar 24, 2014 7:41 pm ]
Post subject:  A Curious Case Study in KGS Ranks

There are many complaints espoused here and elsewhere about the inability of KGS's rating system to satisfy the needs of edge-case users. Frequently these discussions are emotiionally charged with only vague references to unsourced anecdotes, while it would be my preference for them to be more data driven. A strange turn of events has occurred recently that have allowed for an interesting evaluation of KGS's rating system behavior under extreme circumstances. Not being one to pass up a chance for investigation, I posit to L19 a case study. Specifically what I feel it tests are the following two claims:

- If you play too many rated games is it possible for your rating to become "stuck" to the point where even large streaks cannot move your rank. (If you play too many games will it take a very long time for your rank to move.)

-Does KGS unneccessarily penalize losing streaks over winning streaks, to where players cannot advance due to having 1 bad day. (Does a losing streak "weigh you down" more than a winning streak can "bring you up").



The Details:

The bot GnuGo2 has played approximately 17,000 rated games in the last six months, averaging about a 41% win rate (41.7 if you remove the anomaly we're about to discuss). This places it firmly in the mid-to-lower 11k rating and makes it quite possibly as stable as any rank will ever be. Due to an unfortunate error in how the user running this bot had it implemented, in mid-March there was one day where the bot forfeited vritually all of its games, ultimately going 6-236 on the day. For your review I've attached a clipped version of the bot's rating graph for this year where the day in question is clearly visible.

To cover the highlights:

- Having 1 poor day (2.5% win rate) encompassing approximately 1.5% of the total games played in the 6 month period caused the bot's rating to drop about 1/5 of a stone (graph is only updated once / day so there is no finer resolution to use) in spite of having 17,000 games "anchoring" the rank.

-Upon being restored to "normal strength" the bot played 887 (~5% of total games played in the 6 months) games winning ~49% of them, and it took less than a week for the rank to essentially fully recover.

-The bot's winrate while being rated 1 stone lower than normal was ~57.5%, so nothing terribly extraordinary.


To me this suggests that even if you are an extreme edge case (I don't know of any human users who have managed 17,000 games in 6 months, in spite of how much many have tried), your rank is still mobile if you truly have statistically significant streaks. Further it suggests to me no matter how bad of a day you have (because this was basically the worst of bad days), it is not a particularly excessive burden to overcome (The rank was restored to normal without an excessively high win rate).


Thoughts?

Attachments:
File comment: Annotated Rank Graph
GnuGo2.JPG
GnuGo2.JPG [ 33.05 KiB | Viewed 13249 times ]

Author:  illluck [ Mon Mar 24, 2014 8:10 pm ]
Post subject:  Re: A Curious Case Study in KGS Ranks

That seems like a demonstration of immobile rank to me - 6:236 and only dropping a fifth of a stone is pretty ridiculous.

Author:  Mef [ Mon Mar 24, 2014 8:23 pm ]
Post subject:  Re: A Curious Case Study in KGS Ranks

illluck wrote:
That seems like a demonstration of immobile rank to me - 6:236 and only dropping a fifth of a stone is pretty ridiculous.


To put this in perspective, this is the equivalent to a normal player who plays 2 games /day having a 4 game losing streak in a day.

Author:  Dante31 [ Mon Mar 24, 2014 9:30 pm ]
Post subject:  Re: A Curious Case Study in KGS Ranks

Those who are willing to look at KGS ranks rationally know that kgs ranks do not get stuck. It's just that there are people that need something to blame for the fact that they are not progressing as fast they they would like.

Author:  RobertJasiek [ Mon Mar 24, 2014 9:43 pm ]
Post subject:  Re: A Curious Case Study in KGS Ranks

The case study does not compare well to human players with frequent games, who need, without significant interruption, to win ca. 70+% for weeks up to a few months in order to improve a rank, after it has been VERY MUCH easier to drop a rank.

The problem can already be observed when 1 loss demotes a rank, but the next 2 or 3 games won do not necessarily promote a rank.

For any rating system to be perceived fair, there must be symmetry in the difficulties of decreasing and increasing one's rating. The KGS system lacks such a symmetry.

Author:  Mef [ Mon Mar 24, 2014 10:13 pm ]
Post subject:  Re: A Curious Case Study in KGS Ranks

RobertJasiek wrote:
The case study does not compare well to human players with frequent games, who need, without significant interruption, to win ca. 70+% for weeks up to a few months in order to improve a rank, after it has been VERY MUCH easier to drop a rank.

The problem can already be observed when 1 loss demotes a rank, but the next 2 or 3 games won do not necessarily promote a rank.

For any rating system to be perceived fair, there must be symmetry in the difficulties of decreasing and increasing one's rating. The KGS system lacks such a symmetry.



This has never been documented, only alluded to in unsupported anecdote that falls apart whenever data is collected. In fact, you personally were used as an example in a previous case study to demonstrate that this effect doesn't exist!

Edit: My apologies, I should have said: Two previous case studies

Author:  RobertJasiek [ Tue Mar 25, 2014 1:56 am ]
Post subject:  Re: A Curious Case Study in KGS Ranks

1) I have experienced my described rating / ranking behaviour for myself several (not only one, as you suggest) times.

2) Your linked case studies might be used for OTHER arguments (such as that I do not permanently win 70% of my KGS games, e.g., because(!!!) it is by far too frustrating to maintain a winning attitude when affected by the mentioned experience and continue playing only when not tired), but they do not refute my made experience.

3) I have heard from (or watched) several people that they have made similar experiences.

4) Since the effects have been experienced, they DO exist. (And no, I have not bothered to protocol them. I have better uses for my time.)

Author:  RBerenguel [ Tue Mar 25, 2014 2:28 am ]
Post subject:  Re: A Curious Case Study in KGS Ranks

RobertJasiek wrote:
4) Since the effects have been experienced, they DO exist. (And no, I have not bothered to protocol them. I have better uses for my time.)


¿¿?? Robert, you are a mathematician. Come on!

Author:  RobertJasiek [ Tue Mar 25, 2014 3:06 am ]
Post subject:  Re: A Curious Case Study in KGS Ranks

A fix for the rating system? Easy, use a different system:

- +0.1 ranks for a win, -0.1 ranks for a loss.
- Ignore all handicap games (incl. those with handicap 1).
- Ignore games with a rank difference >2.
- Maximum rank 9d.

Author:  RBerenguel [ Tue Mar 25, 2014 4:09 am ]
Post subject:  Re: A Curious Case Study in KGS Ranks

RobertJasiek wrote:
A fix for the rating system? Easy, use a different system:

- +0.1 ranks for a win, -0.1 ranks for a loss.
- Ignore all handicap games (incl. those with handicap 1).
- Ignore games with a rank difference >2.
- Maximum rank 9d.


I'm tempted to run a Monte Carlo simulation of such a system. Maybe I'll do, could be fun.

Author:  Charles Alden [ Tue Mar 25, 2014 4:51 am ]
Post subject:  Re: A Curious Case Study in KGS Ranks

Easy, use a different system:

- +0.1 ranks for a win, -0.1 ranks for a loss.
- Ignore all handicap games (incl. those with handicap 1).
- Ignore games with a rank difference >2.
- Maximum rank 9d.[/quote]

I'm tempted to run a Monte Carlo simulation of such a system. Maybe I'll do, could be fun.[/quote]


Under which system, in Mef's example the bot's rating would have moved to 34k the following day?

Author:  HermanHiddema [ Tue Mar 25, 2014 5:41 am ]
Post subject:  Re: A Curious Case Study in KGS Ranks

RobertJasiek wrote:
A fix for the rating system? Easy, use a different system:

- +0.1 ranks for a win, -0.1 ranks for a loss.
- Ignore all handicap games (incl. those with handicap 1).
- Ignore games with a rank difference >2.
- Maximum rank 9d.


Which is deflationary. Every 20k that enters the system and moves up to 1d has removed 20 ranks total from the other players. That's no problem in a small playing pool like a club, where I think this kind of system is fine, as you can just manually recalibrate all ranks every once in a while, but on a go server it is unsuitable.

In a deflationary system, playing more games means you lose rating quicker. So you're replacing "My rating is stuck because I play so much" with "My rating keeps dropping because I play so much". How is that better?

Author:  Pippen [ Tue Mar 25, 2014 6:52 am ]
Post subject:  Re: A Curious Case Study in KGS Ranks

I am a 5D-Tygem and 1d-KGS. From my experience with Tygem I can say: Ranks at KGS are more stable and consistent than Tygem's. On Tygem you will find some more differences within one rank. Sometimes you play guys that seem like 1-2 stones weaker, sometimes 1-2 stronger, but all have the same rank. But here comes the advantage of such a thing: It's more fun, you have faster chances to get promoted/demoted and to play stronger players that wouldn't play you otherwise. KGS ranking is sounder, but more boring and since ranking is maybe the main motivation to play and stay in Go, it's significant.

I'd like KGS to copy Tygem's ranking system, i.e. a system of x-game series where you get promoted when you win y games and demoted when you lose z games out of it.

Author:  uPWarrior [ Tue Mar 25, 2014 7:18 am ]
Post subject:  Re: A Curious Case Study in KGS Ranks

It's funny how Robert just proposed removing all handicap games from the calculation while in a different topic I proposed that only handicap games should be considered so we don't rely on arbitrary win percentages.

Author:  Mef [ Tue Mar 25, 2014 7:48 am ]
Post subject:  Re: A Curious Case Study in KGS Ranks

Pippen wrote:
I am a 5D-Tygem and 1d-KGS. From my experience with Tygem I can say: Ranks at KGS are more stable and consistent than Tygem's. On Tygem you will find some more differences within one rank. Sometimes you play guys that seem like 1-2 stones weaker, sometimes 1-2 stronger, but all have the same rank. But here comes the advantage of such a thing: It's more fun, you have faster chances to get promoted/demoted and to play stronger players that wouldn't play you otherwise. KGS ranking is sounder, but more boring and since ranking is maybe the main motivation to play and stay in Go, it's significant.

I'd like KGS to copy Tygem's ranking system, i.e. a system of x-game series where you get promoted when you win y games and demoted when you lose z games out of it.


KGS's rating system aims to provide the most accurate rank it can with all data available. It aims to do the best job of predicting the probable outcome between any two players and any handicap (though in practice it only accepts feedback from games H6 or less).

Tygem's rating system does not make any predictions. It does not handle handicap games. It does not make any attempt to ensure proper rank spacing. It suffers from large amounts of noise being introduced by players setting their own ranks. Under an ideal set of assumptions (all ranks properly spaced, all players properly ranked, etc) you still expect to spend 30% of your time at the wrong rank. Tygem's rating system has a place in the go world and many people find it fun. Accurately assessing your go strength and comparing yourself on a fixed scale to a pool of larger players isn't it.

Author:  Mef [ Tue Mar 25, 2014 7:54 am ]
Post subject:  Re: A Curious Case Study in KGS Ranks

uPWarrior wrote:
It's funny how Robert just proposed removing all handicap games from the calculation while in a different topic I proposed that only handicap games should be considered so we don't rely on arbitrary win percentages.



Someone with a stronger math background than myself could probably come up with a better answer for what the rating system thinks is ideal, but I would think that the best case would be for all players to have an even distribution of games across the whole range of handicaps the system aims to predict. on KGS that would mean 7.69% giving H6, H5, H4, etc. This would leave approximately 23% of your games as having no handicap (e.g. either even or +- 1 stone). Also you would probably want to fix the cultural affinity for using 0.5 komi and make it reverse komi.

Author:  RBerenguel [ Tue Mar 25, 2014 7:57 am ]
Post subject:  Re: A Curious Case Study in KGS Ranks

I ran a simulation, just for fun. As Herman points out, the system is deflationary. To compensate, I use a closed pool of players. Each player is given a rank from 0 to 40 (9d to 30k, so to say) and an "inner rank," which in some sense is used to model its real rank. So for example, a player can be 4, 8 so he should be losing rank eventually. To calculate game results, I use the difference of inner rank among players and an ELO-like winning probability, and only consider games between players with at most 1 rank difference. The distribution of ranks is a normal distribution, mean 20, sigma 2/3*mu. The population (and ranks) are corrected so that min is 0 and max is 40.

To plot and display, I use the percentiles in the difference between "inner rank" and "system rank." The results with a pool of just 100 players and 50000 games (real games played: 49847) roughly look like:

Code:
Simulation: 100 * gauss(mu=20) for T=50000 steps having 49847 games played (bot stands for bottom):

          top 1%    top 10%    top 25%  top 33.3%     median  bot 33.3%    bot 25%    bot 10%     bot 1%
start      40.00      29.16      22.98      20.82      14.88      10.37       8.22       2.81       0.02
mid         4.66       4.04       2.77       2.16       1.65       0.96       0.74       0.26       0.00
final       3.14       2.38       1.71       1.50       1.21       0.88       0.72       0.26       0.00


Or graphically,
Attachment:
Screen Shot 2014-03-25 at 15.51.56.png
Screen Shot 2014-03-25 at 15.51.56.png [ 61.79 KiB | Viewed 13046 times ]


Even such a simple ranking model has a big flaw (assuming closed pool of players, sure): it takes an awful lot of games to get to a "real strength," and even with 150k playthroughs (just simulated this, to check) the worst result is almost 2 "stones" off (the median is half a stone off).

Author:  Polama [ Tue Mar 25, 2014 8:46 am ]
Post subject:  Re: A Curious Case Study in KGS Ranks

Mef wrote:
illluck wrote:
That seems like a demonstration of immobile rank to me - 6:236 and only dropping a fifth of a stone is pretty ridiculous.


To put this in perspective, this is the equivalent to a normal player who plays 2 games /day having a 4 game losing streak in a day.


Nope, not equivalent. Plugging these numbers into a binomial calculator:

If we expect a 41% win rate, the probability of losing at least 236 games out of 242 'by chance' is about 10^-45.

If we factor in the ~17,000 opportunities for that streak, we're still around, call it, 10^-40.

For a player going 2 games a day, that's 365 games in the 6 month span. If we say he went 0-4, that's 12%. Given 361 4 game spans, that's essentially a given to occur (1-(10^-20) or so?) We'd be extremely surprised if a 41% player didn't have a 4 game losing streak in 365 games, and even more surprised if he had a 3/242 streak in a 17,000 game span.

Wins are streaky by nature, so the probability will be higher in practice. But still, 10^-40 is roughly your odds of being dealt a royal flush in poker, 7 hands in a row.

Put another way, auto-resigning most games, he was probably, what? 30 kyu? So the fact that the system thought he'd only fallen 1/5 a stone was extremely wrong. We know he was much worse than that. He demonstrated it over a very significant number of games. Which, as I understand it, is the most common complain about the kgs rating system: that it overestimates (in this case, vastly overestimates) how much variation can be expained away by chance as the number of games played increases.

Author:  RBerenguel [ Tue Mar 25, 2014 8:54 am ]
Post subject:  Re: A Curious Case Study in KGS Ranks

Polama wrote:
Mef wrote:
illluck wrote:
That seems like a demonstration of immobile rank to me - 6:236 and only dropping a fifth of a stone is pretty ridiculous.


To put this in perspective, this is the equivalent to a normal player who plays 2 games /day having a 4 game losing streak in a day.


Nope, not equivalent. Plugging these numbers into a binomial calculator:

If we expect a 41% win rate, the probability of losing at least 236 games out of 242 'by chance' is about 10^-45.

If we factor in the ~17,000 opportunities for that streak, we're still around, call it, 10^-40.

For a player going 2 games a day, that's 365 games in the 6 month span. If we say he went 0-4, that's 12%. Given 361 4 game spans, that's essentially a given to occur (1-(10^-20) or so?) We'd be extremely surprised if a 41% player didn't have a 4 game losing streak in 365 games, and even more surprised if he had a 3/242 streak in a 17,000 game span.

Wins are streaky by nature, so the probability will be higher in practice. But still, 10^-40 is roughly your odds of being dealt a royal flush in poker, 7 hands in a row.

Put another way, auto-resigning most games, he was probably, what? 30 kyu? So the fact that the system thought he'd only fallen 1/5 a stone was extremely wrong. We know he was much worse than that. He demonstrated it over a very significant number of games. Which, as I understand it, is the most common complain about the kgs rating system: that it overestimates (in this case, vastly overestimates) how much variation can be expained away by chance as the number of games played increases.


Can't this just be explained by history inertia? It may be statistically relevant, but the KGS ranking system (IIRC, it's been a while since I checked it) it's almost a predictor-corrector system (sorry for the term, this is used in numerical analysis, for example): it will heavily rely on history to predict the rank, probably correcting after more data points are available. Sure, a huge losing streak is significant, and current, but the historical weight says otherwise, and dampens the current "error"

Author:  Polama [ Tue Mar 25, 2014 9:14 am ]
Post subject:  Re: A Curious Case Study in KGS Ranks

RBerenguel wrote:
Can't this just be explained by history inertia? It may be statistically relevant, but the KGS ranking system (IIRC, it's been a while since I checked it) it's almost a predictor-corrector system (sorry for the term, this is used in numerical analysis, for example): it will heavily rely on history to predict the rank, probably correcting after more data points are available. Sure, a huge losing streak is significant, and current, but the historical weight says otherwise, and dampens the current "error"


The algorithm's choice can be explained by history inertia. But the actual performance can't be. If you view a rank as a fixed, static thing and you hit a 200 loss streak the best you can do is throw your hands up and say "that was weird!" and adjust your prediction down slightly. But this streak clearly demonstrates that this account's ability is not static, that the previous 17,000 games are no longer particularly meaningful. When we're at 10^-40 probability, it's significantly more likely that, say, the person suffered extreme head trauma then that they're having a bad day.

The model may work better with humans. But this case is a demonstration that at extreme numbers of games it can no longer respond to absurdly strong signals of a change in rank.

Now, it may be that there's an explicit time mechanism, and that if this account were let to run for a month it would eventually plummet rapidly to 30 kyu. That would be sensible, because the most likely case seems to be that somebody else logged into this account today. You'd want measures from multiple days to be certain. But if we're just looking at game results, the effect should definitely be way, way, way stronger.

Page 1 of 5 All times are UTC - 8 hours [ DST ]
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/