Oddities in KGS ranking system

hyperpape · **#21**

HermanHiddema wrote:

Allow a player to earn the right to a temporary promotion. For example: If a player wins 4 games in a row, they get one "promotion credit", with which they can start a single game at one rank higher than their usual rank. This is invisible to the opponent. Such a game, because it is played at the normal handicap for one rank higher, gives a player a chance to gain rating more quickly. I think many players would be really psyched to earn and play such games.

From a UI perspective, this could get weird. How would ranks be displayed when this happens?

HermanHiddema wrote:

Allow a player to request a reevaluation every X games (say 50). If they do this, their next 3 games count more strongly for their rating. This allows a player who feels that his rating is lagging to quickly gain some points. The game counts for the opponent's rating as usual, not extra.

This is a neat idea, but I wonder how many players would take to automatically requesting a reevaluation.

wms · **#22**

Kaya.gs wrote:

First of all, how do we know a rating system is accurate? How can we compare accuracy between KGS and Wbaduk?.

Basically, to determine the accuracy of a rating system, you feed in all but the last month's worth of games. Then you take the ranks it gives you, and use those ranks to determine who should win each game in the last month. More games predicted correctly means a better system.

If you want to get even better, instead of just predicting the win/loss of each game, have the rank system predict the probability of each player winning the game. Then you score points according to the sum of the logs of the probability of each outcome, and compare systems to see which is better. (This is equivalent to comparing the product of all win probabilities, but computing the product of the win probabilitiess will underflow if you have a lot of games, so summing the logs is more practical).

You can also, for example, score only the handicap games, to see how well your system computes proper handicaps. This was important to me so I used that as a second metric.

Once I build a system that would take any algorithm and spit out a score on how well it did, I was able to in the space of just a couple weeks of tuning and tweaking come up with a system that to me works extremely well. It does have quirks, but all rating systems do, and the quirks (e.g., your rank moves even when you don't play) aren't things that bother me very much, while I'm very happy with the accuracy.

Edit: For an example, you have two games, A vs. B and C vs. D. If A and C won, and system1 said A has a 50% chance of winning, while C has a 60% chance, then system1 gets ln(0.5)+ln(0.6) = -1.204 points. If system2 said A had an 80% chance of winning and C had a 40% chance of winning, then system2 gets ln(0.8)+ln(0.4) = -1.139 points. System2 has a higher score, so system2 is the better rank system. (Note that the scores will always be negative, because probabilities are always less than 1, so whichever system is closer to a score of 0 is the better one).

Kaya.gs · **#23**

HermanHiddema wrote:

Playing strength varies enormously depending on all sorts of conditions, such as thinking time, alcohol, lack of sleep, or whatever. Any rating system that tries to capture that playing strength in a single number is guaranteed to be inaccurate in that respect. That's why a rating system like Glicko also reports a deviation. So a 4kyu with deviation of 2 is 95% likely to play with a strength between 6kyu and 2kyu. That does not mean there is some precise actual strength between 2kyu and 6kyu that they really are. Rather, it means that even though their playing strength varies, the playing strength in any one game is very likely to be between those values.

But of course, for all sorts of purposes, from determining handicap to sorting players, you very much need a single number.

Now I think that often, players themselves are very much aware when their own strength is likely to be better or worse than their average. That is why people create separate accounts for blitz, or for playing casually instead of seriously. They don't want games that are likely to be bad to damage their rating too much.

Now there may be some ways to work around this issue, based on the player's own knowledge. Here's a few ideas:

Allow a player to secretly mark a game as "bad" before their first move. If they mark it as such, it will count less heavily for the rating (say, only 50%). This way, if you're tired, drunk or otherwise not in great shape, you can play with your main account with less chance of damage to your rating.

Allow a player to earn the right to a temporary promotion. For example: If a player wins 4 games in a row, they get one "promotion credit", with which they can start a single game at one rank higher than their usual rank. This is invisible to the opponent. Such a game, because it is played at the normal handicap for one rank higher, gives a player a chance to gain rating more quickly. I think many players would be really psyched to earn and play such games.

Allow a player to request a reevaluation every X games (say 50). If they do this, their next 3 games count more strongly for their rating. This allows a player who feels that his rating is lagging to quickly gain some points. The game counts for the opponent's rating as usual, not extra.

This feels way to complex. And users playing with the rating system makes me feel uneasy. It shold be global and simple.

Remember that KGS makes you think of your bad game, but in Wbaduk, u dont worry as much. Yes , you are likely to lose, but who cares, u can get it back with a single victory, and a likely one. A loss on kgs feels its there to drag you down forever ,or 6 months which is pretty much the same

.

I like the concept of a fix number of games. 14 victories = rank up. Makes it veeeery predictable. But i worry that being so unsophisticated, it will give innacurate results, and people will find it too plain.

Then again, i dont know what "accurate " means. What is the price of innacuracy? winning easy games and losing hard games all the time ?

wms · **#24**

Kaya.gs wrote:

Then again, i dont know what "accurate " means. What is the price of innacuracy? winning easy games and losing hard games all the time ?

Possible costs of inaccuracy, different inaccurate systems will have different sets of problems:

* Some players will be overranked, making all their games very hard. Others will be underranked, making their games very easy.
* Handicap games will be consistently very easy (or very hard) for white to win
* Clusters of players who play each other a lot can drift away from the population, meaning that if you play somebody from a group of friends, then even though your ranks are equal the game could be an easy win for one or the other of you.

There are others, but that's off the top of my head.

Mef · **#25**

Kaya.gs wrote:

A loss on kgs feels its there to drag you down forever ,or 6 months which is pretty much the same

.

I like the concept of a fix number of games. 14 victories = rank up. Makes it veeeery predictable. But i worry that being so unsophisticated, it will give innacurate results, and people will find it too plain.

Then again, i dont know what "accurate " means. What is the price of innacuracy? winning easy games and losing hard games all the time ?

Just because something feels a certain way doesn't mean that feeling is true, or even rational. (=

Last time I crunched the numbers (which things may have changed since then, but I would imagine they are still close) a worst case scenario if you play games at a consistent rate: ~40% of your rating is from games within the last month, ~80% of your rating is from games within the last three months.

Also, I think it would be reasonable to ask up front -- Do you want handicap games to be meaningful? If you are trying to have properly spaced handicap games, a simple winning streak formula is unlikely to give you reasonable results. So much so in fact, that it would be a poor idea to allow handicap games to be considered for rank. At that point I would recommend discarding the traditional rank structure anyway, because what does being three stones stronger than someone even mean if you can't give them three stones? If you are merely predicting even game winning chances, perhaps just give an elo style rating.

Kaya.gs · **#26**

wms wrote:

Once I build a system that would take any algorithm and spit out a score on how well it did, I was able to in the space of just a couple weeks of tuning and tweaking come up with a system that to me works extremely well. It does have quirks, but all rating systems do, and the quirks (e.g., your rank moves even when you don't play) aren't things that bother me very much, while I'm very happy with the accuracy.

Edit: For an example, you have two games, A vs. B and C vs. D. If A and C won, and system1 said A has a 50% chance of winning, while C has a 60% chance, then system1 gets ln(0.5)+ln(0.6) = -1.204 points. If system2 said A had an 80% chance of winning and C had a 40% chance of winning, then system2 gets ln(0.8)+ln(0.4) = -1.139 points. System2 has a higher score, so system2 is the better rank system. (Note that the scores will always be negative, because probabilities are always less than 1, so whichever system is closer to a score of 0 is the better one).

Possible costs of inaccuracy, different inaccurate systems will have different sets of problems:

* Some players will be overranked, making all their games very hard. Others will be underranked, making their games very easy.
* Handicap games will be consistently very easy (or very hard) for white to win
* Clusters of players who play each other a lot can drift away from the population, meaning that if you play somebody from a group of friends, then even though your ranks are equal the game could be an easy win for one or the other of you.

There are others, but that's off the top of my head.

Your expertise here is very much appreciated!. It is simple, one system is more accurate if it can predict results better.

I am a practical man, but i believe in theory also. The idea to compare different rating systems, and tweaking them makes me think of making this open source.

I will talk to Polly right away about making a project on Github that makes runs of statistics, and can potentially have plug&play systems. this would allow to compare different settings and also make it easy to tweak.

karaklis · **#27**

Kaya.gs wrote:

I do believe Wbaduk has a higher sample of players, which means it should present less inacuracy. However they have the issue that from 3d to weak 7d they are almost the same strength, and then inside 7d, you feel 2 stones difference.
I dont know why that happens.

In the dan ranks, WBaduk is more or less ok, but in the kyu ranks, below about 2k it is completely crap.

In spite of its weaknesses the KGS ranking system seems to be the most accurate among the common realtime go servers especially in these areas.

danielm · **#28**

HermanHiddema wrote:

Allow a player to earn the right to a temporary promotion. For example: If a player wins 4 games in a row, they get one "promotion credit", with which they can start a single game at one rank higher than their usual rank. This is invisible to the opponent. Such a game, because it is played at the normal handicap for one rank higher, gives a player a chance to gain rating more quickly. I think many players would be really psyched to earn and play such games.

This reminds me of one thing I read about the StarCraft 2 league system, where it would put a player up against a player from a stronger league occasionally to test their skill.

The concept of this seems very promising to me, as it should solve the issue of very long winning (or losing) streaks without making ranks too volatile. E.g. if a 4k wins four games in a row (or sooner), the account could be marked as 4k+ (or something else to avoid confusion with the IGS +, or it doesn't have to be visibly marked at all), meaning that the player is still considered a 4k, but will play the next game(s) as a 3k handicap-wise to test his strength more severely. Perhaps this would only apply to automatching, and the default handicap suggestions in manual games.

This could also happen the other way around with players playing one handicap stone weaker (4k- playing as 5k), which might make serious slumps more frustrating, but at the same time might also help to recover if the player regains confidence from playing truly weaker players in even games.

In chess, the lack of handicaps has the advantage that one can increase (or ruin...) ones rating quite fast by playing significantly higher or lower rated players, and this concept would bring some of that to go without losing the advantages of proper handicap games. While it would more often lead to non-proper handicap games, that is not necessarily a bad thing, as the rating system will take those differences into account of course (and there is nothing wrong in essence with occasionally playing an easy or hard game, after all chess players do this almost every single time they play).

It might be harshest on the opponents of e.g. a 4k+ player, because they stand a lot to lose from losing against a 4k in an even game who might actually be stronger, but I'm sure that can be balanced out with some math geekery.

E.g. rating change could be less severe for the opponents of such a "tested" player, or corrected afterwards if the rating of the 4k+ actually changes (which I believe something like WHR would do anyway?).

Mef · **#29**

danielm wrote:

It might be harshest on the opponents of e.g. a 4k+ player, because they stand a lot to lose from losing against a 4k in an even game who might actually be stronger, but I'm sure that can be balanced out with some math geekery.

E.g. rating change could be less severe for the opponents of such a "tested" player, or corrected afterwards if the rating of the 4k+ actually changes (which I believe something like WHR would do anyway?).

For Whole-history and Decayed-history (KGS) there is no penalty for helping an underranked played get promoted, as ultimately the promotion is figured in with the ranking calculations. For incremental systems like Elo, or "win X number of games to promote" helping an underranked person earn a promotion requires a bit of altruism as the risk/reward scenario is more one-sided.

shapenaji · **#30**

Herman: You know, I've often wondered if the best approach to rank is to track a player's distribution and mean. And then use bayes theorem to update the distribution based on the distribution of their defeated and victorious opponents.

Then you could look at each player's unique distribution, rather than just assuming everybody has a normal-distribution...

Harleqin · **#31**

We do need ranks, even though we know that their meaning is purely statistical (as Herman pointed out).

We want these ranks to be correlated to "stones" difference, i.e. handicaps.

We can measure a system's accuracy, as wms lined out, by measuring its predictive power.

I believe that the best we can currently do is the following:

First of all, we need good data. Good data need to

have standard conditions for each game,
have as little isolated subpopulations as possible, and
cover the whole handicap range.

To achieve that, I believe that a go server should derive its ratings only from the games of an ongoing tournament with standard settings. The pairing should be completely random each round, and full handicap given.

Second, we need a good model. A good model needs to

use the data,
have good predictive power,
need as little non-data parameters as possible.

The following are assessments from my educated guesses. Please take them as a motivation for research.

I believe that all points-based models are bad. By points-based I mean WBaduk, IGS, ELO, Glicko. The reason for this assessment is that they all use a lot of arbitrary parameters that have no basis in the data. For example, the EGF system (which is a modified ELO) has at least three parameters that are completely arbitrary (a, con, and epsilon, plus some rules about rating resets). They seem to work, but that is actually very dependent on things that go beyond the pure game data: more or less isolated subpopulations, different strength improvements by region, different numbers of new players all make any hope to get everything right for the whole of Europe completely futile. Another big problem is that after a result was entered and has effected a rating change, it is forgotten. At each point, the history is discarded.

The KGS decayed history model is a big improvement, because it does not discard the data after processing. The continuous reassessment of a player's strength based on all games in his recent history has several advantages:

It increases cohesion between subpopulations, because each single game continues to hold the two players together over a long time. This means that this model needs much fewer games to come to a good approximation.
Players do not need to fear anything from other players with unclear ranks, because the games are not judged on their ranks at the time of playing, but on their most current rank. Bad rankings correct themselves automatically without any adverse effect. (This is why I believe that [xx?]-players should not be discriminated against like it is currently done on KGS, by the way.)

However, this model also has a flaw: it assumes that the real strength is a single value, and in determining that value just gradually forgets old data so that the value can move.

This is where the whole history rating comes in: it assumes that the real strength changes, and it thus regards it not as a single value, but as a continuous function of time. In comparison to the KGS system, it doesn't just add a new point at the end of the rank graph, but wiggles the whole line of the graph to fit all the data.

When you think about it, it seems obvious to me that it is a better model.

I believe that if we gather better data, as outlined at the start of this post, we may be able to see the differences between the predictive powers of our models more clearly. If I were to design a go server's ranking system now, I should use the ongoing tournament together with an implementation of WHR.

daniel_the_smith · **#32**

I think there is a better way to score rating systems than WMS's. A score for a rating system should be (how well it predicts outcomes) / (how much data it required to attain that level of prediction). That is, a rating system that predicts as well as Elo but only requires half as much data to make the prediction is twice as good as Elo-- it extracts twice as much useful information from the data. The important thing, in my mind, is to squeeze the maximum information out of the data you have. (Of course, WMS is measuring the first half of the equation-- which is light-years beyond the other servers, AFAIK.)

KGS already does decently with as few as 3 or 4 games, so it scores quite well (compare IGS, which at one point (still?) took 20 games).

Remi's paper shows WHR doing only marginally better that competitors. However, I'm willing to bet that if he did more comparisons while giving the systems progressively worse data, WHR's margin would grow, perhaps significantly.

Therefore, I agree with everything Harlequin said with one exception: forcing the humans to change how they naturally want to play games is definitely not my preference.

Forcing humans to play a certain way (an ongoing explicitly paired tournament) would indeed give great data for calculating ratings, and perhaps it would be worth having a rating explicitly done that way-- you could have your "casual rating" and your "tournament rating".

flOvermind · **#33**

Harleqin wrote:

To achieve that, I believe that a go server should derive its ratings only from the games of an ongoing tournament with standard settings. The pairing should be completely random each round, and full handicap given.

I'm not sure what you mean with "ongoing tournament". You can't really force players to play pre-scheduled games, most players just want to be able to play whenever they want. And determining a matchup and making both players schedule a time for themselves won't ever work in practice, because of timezone differences, or, more likely, just general lazyness

EDIT (after reading the post of Daniel): And that's not just "I don't like to force players to do X". That also has practical consequences. Would you prefer a few data points with high quality, or rather many more data points with not so good quality?

But wouldn't a simple "automatch" implementation fulfill all your conditions, as long as you don't allow restricting the opponents? That is, have a button "play against a random player who also happens to have pressed the button right now". True, that will somewhat segment your player base into timezones, but there is no way around that since you can't force players to be available 24 hours a day.

daniel_the_smith · **#34**

To be fair, an ongoing ratings tournament wouldn't need to be that inconvenient. Just, every 2 hours start a round, anyone who is online and wants to play pushes a button a few minutes before the round starts to be included in the round. Personally, I think that'd be really cool, actually-- is there any record to be broken for tournament with greatest number of rounds? :mrgreen:

ez4u · **#35**

AFAIK KGS is an "ongoing rating tournament". :scratch:

Everyone participates, if they wish to, by choosing to play "rated" games rather than "free" games. If you think that standard conditions are necessary to produce "correct" ratings, you must think that those correct ratings will only be useful for playing under standard conditions, right? So what are the proposed "standard" conditions?

tapir · **#36**

wms wrote:

Kaya.gs wrote:

First of all, how do we know a rating system is accurate? How can we compare accuracy between KGS and Wbaduk?.

Basically, to determine the accuracy of a rating system, you feed in all but the last month's worth of games. Then you take the ranks it gives you, and use those ranks to determine who should win each game in the last month. More games predicted correctly means a better system.

I guess I am a bad case for any rating system. July record: 0:5, August record: 18:3.

But... I liked my proposal from the danigabi.gs thread. You have the data already, setting up more than one algorithm to keep track of ranks shouldn't be impossible. We would learn sth. about the performance of the different systems as well. If you make "display glicko rating" an option for KGS+ users all the rating nerds will join KGS+. Good news.

Another idea may be hiding the drift. You compute it in every time a game is played by adding / subtracting up to say 20% of the change effected by actual games (and you have a variable somewhere telling you how much drift is still to compute). The system remains basically the same, but it will feel different. Nobody will have this feeling, that he got promoted for nothing or demoted although he didn't lose a game.

palapiku · **#37**

shapenaji wrote:

Herman: You know, I've often wondered if the best approach to rank is to track a player's distribution and mean. And then use bayes theorem to update the distribution based on the distribution of their defeated and victorious opponents.

Of course it is... why wouldn't it be?

tapir · **#38**

Kaya.gs wrote:

Back then when playing with danigabi[5d] account i have played certain 2ds giving them 3 handicap stones. I would win & lose, and i think i won a tad more than lost (say 60%). The impressive happens later. Right after losing a game, i would log back in with Rakuen[7d], and play the very same player with 6H. Suddently, i would win almost 80%.

How is it possible that increasing many stones , my chances to win go up. My current account, DexMorgan, has been brought up to 7d with a similar effect.

I don't know. But what you report here is an oddity in your gameplay or in that of the 2d, not in the KGS rating system. Also, I doubt you have a sample big enough to claim you have better results against the 2d's with 6 than with 3 handicap stones with any reasonable confidence. (Pushing accounts to high levels with a relatively small number of games is pretty popular, that you work harder in "important" games is understandable as well.)

zazen5 · **#39**

Rank is a useful tool to players to give not only better games but games that enable both players to learn and progress. Rank as an endpoint I believe is similar to people arguing about how much their house should be worth or that market value has any validity. Rank is ever changing, allowing you to determine if what you are doing during play and during training is having the effect that you want, similar to using a metronome during music practice. Its a gauge for information. There should be no emotion attached to it because there will always be someone worse or better than yourself, even on the pro level, if you reach that high.

ez4u · **#40**

Kaya.gs wrote:

...
Besides accounts being heavy and such, there is an impressive psychological aspect of the system that does not feel to affect point-based systems like in Wbaduk or Tygem.

Back then when playing with danigabi[5d] account i have played certain 2ds giving them 3 handicap stones. I would win & lose, and i think i won a tad more than lost (say 60%). The impressive happens later. Right after losing a game, i would log back in with Rakuen[7d], and play the very same player with 6H. Suddently, i would win almost 80%.

How is it possible that increasing many stones , my chances to win go up. My current account, DexMorgan, has been brought up to 7d with a similar effect.

I think this is a specific anomaly of this history-based rating system, where the psychology of the palyers deeply affect the end results and hence its accuracy.
...

Call me anal, but I can't see a claim like this without wanting to check the facts. (That is also why I like GoGoD so much!) Happily we have the KGS Archives. Memory is a tricky little beast. I am sure that we have all had the experience of retelling moments of remembered glory over a beer only to find out afterwards that things weren't quite like that. So I was not too surprised that an examination of the archives for Rakuen and danigabi did not immediately turn up a lot of examples that fit the situation described above. Maybe kaya.gs could point out which games he was referring to? ;-)

Oddities in KGS ranking system

Who is online