A Curious Case Study in KGS Ranks

RobertJasiek · **#41**

Thanks, but these data can be influenced by the frequencies to attend tournaments / play regularly on servers, which might be lower for beginner ranks.

Polama · **#42**

Mef wrote:

My comment was in reference to how you would expect the rank to be weighted similarly, the "one bad day" is going to account for ~5% of the total rank.

That said, I do enjoy posts that run the numbers and look at things analytically, so thank you! Your analysis up to here is correct, and you are further correct that this data would make it very easy to reject the claim that gnugo2 should be ranked as 11k over this stretch! However beyond that it is a little trickier...

It was the actually the math that drew me in to this conversation =). I don't have particularly strong opinions on the KGS scoring algorithm, but I was intrigued by how extreme the probability this span would happen by chance was.

Quote:

This is where we run into trouble. While we know that there was an implementation bug causing him to forfeit most of his games (and hence could be reasonably classified as 30k), the system does not have that benefit.

That's why I think it made such a good test case, because we know the truth and the algorithm didn't. If a player complains they should be 2 stones stronger, we don't know for certain if the algorithm is slow or they're overestimating their ability. Here we can compare the 'truth' to the system's perception of the truth.

Quote:

If you look at the calcuations being performed mid-streak in this regard, it gets a little more reasonable. Assume that instead of being 11k, the bot is actually a very weak 12k then suddenly you perform your binomial computation to see what are the odds that it goes 5-195 with a 90% expected loss rate and the probability is something like 3x10^-5 not unreasonable to observe in a 17,000 game streak. This is of course before we attempt to explain 16,700 games of playing at 11k strength.

Yes, I agree that roughly 2 stones weaker would be on the low side of consistent with the evidence (I think 10% win rate for 13 kyu vs 11 kyu is low though, the variation is higher in the ddk's). Also, I think looking at it mid-streak would actually exacerbate this case, because we could reasonably infer at least 1 stones loss of strength sometime in the first 50 games or so, and if we implement the change then, we'd continue to see the same bad performance against weaker and weaker opponents.

It's not that I think the bot should've plummeted to 30 kyu immediately, but 1/5 a stone seems wildly low compared with the evidence.

Quote:

Another (oversimplified) way we could look at is is to approach this situation like one ez4u asked about a while back: What would you expect if you have a player who plays half the time >1D and the other half the time <30k? In this case we have a player who plays 95% of the time like a 11k, and 5% like a 30k. If you were to predict the outcome of their next game, slotting them as a 12k for the over-under isn't too unreasonable.

The key, as I see it, is that the player isn't randomly playing bad at 5% of the time, but is playing bad in contiguous bursts 5% of the time. Given that, your best prediction is to look how he was doing in his most recent games.

That his results aren't inconsistent with being 2-3 stones off is, I suppose, part of the point. Let's say it was a human who played gigantic numbers of blitz 11kyu go and never improved. Finally, he started studying under a professional teacher, devoting his time to tsumego, playing carefully. He switches to slower games, so he's playing 1/5 as many games as before. Over a couple weeks he goes completely undefeated in every match. If the system isn't willing to say that's worth at least 2 stones, it seems to be way too rigid. The model seems perfectly tuned for a static rank, but it seems to be under-appreciating our ability to improve or get worse at the game in bursts.

arndt · **#43**

I'm coming in late, and I haven't read the preceding discussion in detail, but has it been stated what the ranks of the users challenging the bot during its resigning period were? If they were 15k to 10k, why should it plummet to 30k rather than 15k?

Sorry if this was already asked and answered.

skydyr · **#44**

Polama wrote:

Mef wrote:

My comment was in reference to how you would expect the rank to be weighted similarly, the "one bad day" is going to account for ~5% of the total rank.

That said, I do enjoy posts that run the numbers and look at things analytically, so thank you! Your analysis up to here is correct, and you are further correct that this data would make it very easy to reject the claim that gnugo2 should be ranked as 11k over this stretch! However beyond that it is a little trickier...

It was the actually the math that drew me in to this conversation =). I don't have particularly strong opinions on the KGS scoring algorithm, but I was intrigued by how extreme the probability this span would happen by chance was.

Quote:

This is where we run into trouble. While we know that there was an implementation bug causing him to forfeit most of his games (and hence could be reasonably classified as 30k), the system does not have that benefit.

That's why I think it made such a good test case, because we know the truth and the algorithm didn't. If a player complains they should be 2 stones stronger, we don't know for certain if the algorithm is slow or they're overestimating their ability. Here we can compare the 'truth' to the system's perception of the truth.

Quote:

If you look at the calcuations being performed mid-streak in this regard, it gets a little more reasonable. Assume that instead of being 11k, the bot is actually a very weak 12k then suddenly you perform your binomial computation to see what are the odds that it goes 5-195 with a 90% expected loss rate and the probability is something like 3x10^-5 not unreasonable to observe in a 17,000 game streak. This is of course before we attempt to explain 16,700 games of playing at 11k strength.

Yes, I agree that roughly 2 stones weaker would be on the low side of consistent with the evidence (I think 10% win rate for 13 kyu vs 11 kyu is low though, the variation is higher in the ddk's). Also, I think looking at it mid-streak would actually exacerbate this case, because we could reasonably infer at least 1 stones loss of strength sometime in the first 50 games or so, and if we implement the change then, we'd continue to see the same bad performance against weaker and weaker opponents.

It's not that I think the bot should've plummeted to 30 kyu immediately, but 1/5 a stone seems wildly low compared with the evidence.

Quote:

Another (oversimplified) way we could look at is is to approach this situation like one ez4u asked about a while back: What would you expect if you have a player who plays half the time >1D and the other half the time <30k? In this case we have a player who plays 95% of the time like a 11k, and 5% like a 30k. If you were to predict the outcome of their next game, slotting them as a 12k for the over-under isn't too unreasonable.

The key, as I see it, is that the player isn't randomly playing bad at 5% of the time, but is playing bad in contiguous bursts 5% of the time. Given that, your best prediction is to look how he was doing in his most recent games.

That his results aren't inconsistent with being 2-3 stones off is, I suppose, part of the point. Let's say it was a human who played gigantic numbers of blitz 11kyu go and never improved. Finally, he started studying under a professional teacher, devoting his time to tsumego, playing carefully. He switches to slower games, so he's playing 1/5 as many games as before. Over a couple weeks he goes completely undefeated in every match. If the system isn't willing to say that's worth at least 2 stones, it seems to be way too rigid. The model seems perfectly tuned for a static rank, but it seems to be under-appreciating our ability to improve or get worse at the game in bursts.

If this is perceived as a problem, it would seem that it could be fixed by changing the historical weighting algorithm so that older games expire or are devalued faster than they are currently, or that the most recent X games give more weight than normal.

Mef · **#45**

Polama wrote:

That's why I think it made such a good test case, because we know the truth and the algorithm didn't. If a player complains they should be 2 stones stronger, we don't know for certain if the algorithm is slow or they're overestimating their ability. Here we can compare the 'truth' to the system's perception of the truth.

Quote:

If you look at the calcuations being performed mid-streak in this regard, it gets a little more reasonable. Assume that instead of being 11k, the bot is actually a very weak 12k then suddenly you perform your binomial computation to see what are the odds that it goes 5-195 with a 90% expected loss rate and the probability is something like 3x10^-5 not unreasonable to observe in a 17,000 game streak. This is of course before we attempt to explain 16,700 games of playing at 11k strength.

Yes, I agree that roughly 2 stones weaker would be on the low side of consistent with the evidence (I think 10% win rate for 13 kyu vs 11 kyu is low though, the variation is higher in the ddk's). Also, I think looking at it mid-streak would actually exacerbate this case, because we could reasonably infer at least 1 stones loss of strength sometime in the first 50 games or so, and if we implement the change then, we'd continue to see the same bad performance against weaker and weaker opponents.

It's not that I think the bot should've plummeted to 30 kyu immediately, but 1/5 a stone seems wildly low compared with the evidence.

Quote:

Another (oversimplified) way we could look at is is to approach this situation like one ez4u asked about a while back: What would you expect if you have a player who plays half the time >1D and the other half the time <30k? In this case we have a player who plays 95% of the time like a 11k, and 5% like a 30k. If you were to predict the outcome of their next game, slotting them as a 12k for the over-under isn't too unreasonable.

The key, as I see it, is that the player isn't randomly playing bad at 5% of the time, but is playing bad in contiguous bursts 5% of the time. Given that, your best prediction is to look how he was doing in his most recent games.

That his results aren't inconsistent with being 2-3 stones off is, I suppose, part of the point. Let's say it was a human who played gigantic numbers of blitz 11kyu go and never improved. Finally, he started studying under a professional teacher, devoting his time to tsumego, playing carefully. He switches to slower games, so he's playing 1/5 as many games as before. Over a couple weeks he goes completely undefeated in every match. If the system isn't willing to say that's worth at least 2 stones, it seems to be way too rigid. The model seems perfectly tuned for a static rank, but it seems to be under-appreciating our ability to improve or get worse at the game in bursts.

So now we are mixing lots of things --

First is that while we talk about it as making a shift at a certain point, KGS is actually constantly adjusting the predicted rank so that it fits the maximum likelihood of the data available. You are correct that in this case KGS would have continued to see bad results as it dropped the rank, and had the bot not been stopped after 1 day it's rating drop would have continued accelerating (once again - we don't actually know just how far it dropped into the 12k range, just that it was somewhere between 0.2 and 1.3 of a stone). Again, it's also worth noting that the performance observed wasn't indicative necessary of a huge change just that of perhaps 2-3 stones. The characterization of playing like a 30k was probably not entirely accurate, for instance here is one of the games it won:

Now, moving on to your hypothetical we are venturing into the part of these discussion I dislike -- Instead of having real data and real results being analyzed, we have a made up situation, with vague "data" being presented, theorized behavior being speculated on and then the rating system criticized for it. If you literally have a 3 stone step change overnight and change your playing patterns in order to exacerbate how long it will take the rating system to compensate...then yes it is possible that that you are a corner case that the system was not set up to handle in an ideal manner. This has happened before and it has disrupted the KGS rating system before, but to my knowledge it has never been by a person, it was when a stronger open-source bot was released so all the GnuGo bots got upgraded (and were 2 stones stronger overnight).

A more realistic scenario would be something like some members of this forum have done -- take a summer off to travel to China and study go. It's entirely plausible that after a 90 day break a player returns and is 3 stones stronger. Let's assume this player comes back to KGS and as you say plays games at 1/5 their original rate. If we assume this player was 5k or weaker before leaving, then by the end of the first week their "new" games will composed ~23.35 of their rank. By the end of two weeks >40% of their rank will be based on the new games. It will not take long for them to find a new equilibrium close to where they should be.

Nevertheless even for the corner case KGS has a simple way to solve this problem: Play games handicapped at the rating you think you should be! This will allow you to reach your equilibrium faster and unlike many other rating systems does not penalize the opponents who help you get there.

Mef · **#46**

arndt wrote:

I'm coming in late, and I haven't read the preceding discussion in detail, but has it been stated what the ranks of the users challenging the bot during its resigning period were? If they were 15k to 10k, why should it plummet to 30k rather than 15k?

Sorry if this was already asked and answered.

Bots will typically play a wide range of players at a wide range of handicaps. For the most part this bot plays 5k-18k players anywhere from +-6H. The distribution of games between white/black was roughly equal, and for the most part at least they were the default handicaps assigned by KGS.

That said you are correct, looking at just the win/loss rate over those games played, you would expect this bot to probably get placed around 15k (3 stones below the 12k it was ranked at when it's rating was dropping)

hashimoto · **#47**

Mef wrote:

Now, moving on to your hypothetical we are venturing into the part of these discussion I dislike -- Instead of having real data and real results being analyzed, we have a made up situation, with vague "data" being presented, theorized behavior being speculated on and then the rating system criticized for it.

Yes, you seem to prefer real data and real results... as long as those results fit within the narrow model of how you think ranking should behave. Of course you can find data to explain how the KGS ranking system works. You can't seem to see past the idea that the model itself may not be desirable for some people.

RBerenguel · **#48**

RobertJasiek wrote:

Thanks, but these data can be influenced by the frequencies to attend tournaments / play regularly on servers, which might be lower for beginner ranks.

Histogram of GoR for players with more than 5 tournaments, 40 breaks (4149 players)

hyperpape · **#49**

I can't speak for Mef, but I'm perfectly willing to accept that some people would trade away a portion of the predictive power that KGS has for the predictability that other systems have. I don't think that trade off is absolutely wrong, though it's not the one I'd make. But I also expect them to be clear about how it might work and that this is what they're doing. I also expect them to be as accurate about what KGS does.

Mef · **#50**

hashimoto wrote:

Mef wrote:

Now, moving on to your hypothetical we are venturing into the part of these discussion I dislike -- Instead of having real data and real results being analyzed, we have a made up situation, with vague "data" being presented, theorized behavior being speculated on and then the rating system criticized for it.

Yes, you seem to prefer real data and real results... as long as those results fit within the narrow model of how you think ranking should behave. Of course you can find data to explain how the KGS ranking system works. You can't seem to see past the idea that the model itself may not be desirable for some people.

I prefer real data and real results, period. When real data is unavailable, modeling and analysis are a reasonable substitute. I prefer having clearly stated, objectively measurable goals with with to evaluate a rating system. I strongly dislike vague hypotheticals and baseless speculation.

I understand completely that different people want different things out of rating systems and frequently acknowledge that. KGS's system strives for accuracy over noise. Many have stated they prefer noise because to them it is more fun. There's nothing wrong with that, it's just not a goal of KGS's rating system.

In KGS's subforum discussing KGS´s rating system I assume unless otherwise stated we are evaluating systems based on prediction accuracy, because I assume it is well known what the rating system's aim is.

If we wanted to establish other objectively measurable goals for a rating system I would be happy to evaluate with those in mind, but for here and now I chose these.

Polama · **#51**

Mef wrote:

Now, moving on to your hypothetical we are venturing into the part of these discussion I dislike -- Instead of having real data and real results being analyzed, we have a made up situation, with vague "data" being presented, theorized behavior being speculated on and then the rating system criticized for it.

Ok, let's stay away from hypotheticals. What we strictly, factually know is that over 242 games this account was at least 3 stones weaker, potentially more depending on the exact nature of the bug. The length of the streak was such that bad luck is completely out of the question as an explanation. We can consider the previous 17,000 games, but again, the streak was long enough that we can very easily see that these are distinct distributions. Any statistician looking at the results would state that there's no longer a connection between the earlier record and the newest record.

I think the only possible conclusion without bringing in hypothetical factors is that the algorithm was wrong in this case. Given the time series of results, I would think a student in a statistics class would not be marked correct for estimating the win% of the the next game against an 11kyu at 30% or whatever the lowest rank reached would suggest.

RBerenguel · **#52**

Polama wrote:

Mef wrote:

Now, moving on to your hypothetical we are venturing into the part of these discussion I dislike -- Instead of having real data and real results being analyzed, we have a made up situation, with vague "data" being presented, theorized behavior being speculated on and then the rating system criticized for it.

Ok, let's stay away from hypotheticals. What we strictly, factually know is that over 242 games this account was at least 3 stones weaker, potentially more depending on the exact nature of the bug. The length of the streak was such that bad luck is completely out of the question as an explanation. We can consider the previous 17,000 games, but again, the streak was long enough that we can very easily see that these are distinct distributions. Any statistician looking at the results would state that there's no longer a connection between the earlier record and the newest record.

I think the only possible conclusion without bringing in hypothetical factors is that the algorithm was wrong in this case. Given the time series of results, I would think a student in a statistics class would not be marked correct for estimating the win% of the the next game against an 11kyu at 30% or whatever the lowest rank reached would suggest.

A student in statistics won't look at the data and say, "hey, this player is a sucker now!" Instead he'd fit hundreds of players results and games with an ARMA or ARIMAA process (for instance), and dismiss the error in this particular case as "well, fit happens."

skydyr · **#53**

RBerenguel wrote:

Polama wrote:

Ok, let's stay away from hypotheticals. What we strictly, factually know is that over 242 games this account was at least 3 stones weaker, potentially more depending on the exact nature of the bug. The length of the streak was such that bad luck is completely out of the question as an explanation. We can consider the previous 17,000 games, but again, the streak was long enough that we can very easily see that these are distinct distributions. Any statistician looking at the results would state that there's no longer a connection between the earlier record and the newest record.

I think the only possible conclusion without bringing in hypothetical factors is that the algorithm was wrong in this case. Given the time series of results, I would think a student in a statistics class would not be marked correct for estimating the win% of the the next game against an 11kyu at 30% or whatever the lowest rank reached would suggest.

A student in statistics won't look at the data and say, "hey, this player is a sucker now!" Instead he'd fit hundreds of players results and games with an ARMA or ARIMAA process (for instance), and dismiss the error in this particular case as "well, fit happens."

I would additionally point out that given a long term perspective, the data for the loss streak is a fluke and should not be counted heavily. The corollary to "the rank should have dropped at least X stones" is that as soon as the 12 hour issue was over, the new rank would be equally off from the presumably correct rank where it was before the problem occurred. I suppose you could argue that it would then be a feature of the proposed system's volatility that the rank goes back up relatively quickly, but it seems like it would be better not to have the huge rating discrepancy in the first place. If you look at 24 hours before the occurrence and 24 hours after, the ratings system as it is seems significantly more correct than a proposed more volatile one.

Looking at the somewhat different rating systems used for DGS and the old OGS (not sure about the new one) they both suffer from problems where a player stops playing, and loses some large number of games that time out over the weeks that they are gone. By having their rank drop 10 or more stones at a blow, when they start playing again, the act of them fighting their way back up to the old rank destabilises the entire ranking system to some degree, as all the ranks get corrected, and may end up skewing it in one direction or another over time.

Going with the assumption that rank differences should be relatively predictive of game outcomes, why is this a good thing? And as Mef mentioned, if you don't think that rank differences should be relatively predictive of game outcomes, you should be looking at a different system, or asking if you actually need to worry about rank at all, rather than the KGS one which has this explicit goal.

Polama · **#54**

RBerenguel wrote:

A student in statistics won't look at the data and say, "hey, this player is a sucker now!" Instead he'd fit hundreds of players results and games with an ARMA or ARIMAA process (for instance), and dismiss the error in this particular case as "well, fit happens."

I'm not a statistician so I can't speak with any authority, but I think an advanced statistical model would view this case as a meaningful shift, as an extreme outlier, or at least as not conforming to the expected distribution.

skydyr wrote:

I would additionally point out that given a long term perspective, the data for the loss streak is a fluke and should not be counted heavily. The corollary to "the rank should have dropped at least X stones" is that as soon as the 12 hour issue was over, the new rank would be equally off from the presumably correct rank where it was before the problem occurred.

My point is that I think people are underestimating how powerful a signal a loss streak of this magnitude really is. I understand we have priors that say "people don't drop multiple stones all at once", but this should overwhelm those priors. We wouldn't expect this sort of streak by chance with trillions of go players. Something definitely happened above and beyond a bad day. In this case it was counteracted the next day, but I see no reason to assume that shift will inherently be followed immediately by a recovery. I'd bet there's never been a streak of a 100th of this probability in an established, non robot account.

Now, this case is a bizzare one. It's an extreme edge case. I'm fine with the algorithm not handling it well. I find it interesting for it's extremenes, and it's not something you should draw conclusions from for general players. My point was merely that I wouldn't hold this up as an example of the algorithm getting an extreme case right. I think it got an extreme case wrong in the way I'd expect it to get it wrong.

RBerenguel · **#55**

Polama wrote:

RBerenguel wrote:

A student in statistics won't look at the data and say, "hey, this player is a sucker now!" Instead he'd fit hundreds of players results and games with an ARMA or ARIMAA process (for instance), and dismiss the error in this particular case as "well, fit happens."

I'm not a statistician so I can't speak with any authority, but I think an advanced statistical model would view this case as a meaningful shift, as an extreme outlier, or at least as not conforming to the expected distribution.

I'm no statistician either, but I know a little about it (and I know people who are into statistical modelling.) Models, as such, are general. Outliers? Well, they are outliers. A model only needs to model most of the subjects, if it does a good job with most subjects, it is a good model. Probably there's a better model than KGS's current model (one that takes into account history, weights, fast improvement, etc), but finding it is probably too hard to be worth finding it.

Polama · **#56**

RBerenguel wrote:

Polama wrote:

RBerenguel wrote:

A student in statistics won't look at the data and say, "hey, this player is a sucker now!" Instead he'd fit hundreds of players results and games with an ARMA or ARIMAA process (for instance), and dismiss the error in this particular case as "well, fit happens."

I'm not a statistician so I can't speak with any authority, but I think an advanced statistical model would view this case as a meaningful shift, as an extreme outlier, or at least as not conforming to the expected distribution.

I'm no statistician either, but I know a little about it (and I know people who are into statistical modelling.) Models, as such, are general. Outliers? Well, they are outliers. A model only needs to model most of the subjects, if it does a good job with most subjects, it is a good model. Probably there's a better model than KGS's current model (one that takes into account history, weights, fast improvement, etc), but finding it is probably too hard to be worth finding it.

Agreed. Models usually aren't judged on their handling of outliers, although there's obviously differences in how well they do. In some fields outliers are exactly what you're most interested in, though that's not the case here.

But the performance of the KGS algorithm on an extreme outlier was specifically what this thread was created about. And although I agree it isn't particularly important, it's interesting. There's clearly disagreement on how the ranking of the performance should have gone. I'm arguing that given the full kgs records and asked about this account at that point in time, I certainly wouldn't say he was off by less than a stone. That's not intended to mean that it's a bad algorithm, just that in the case under discussion, I disagree with the conclusion.

Bantari · **#57**

hashimoto wrote:

You can't seem to see past the idea that the model itself may not be desirable for some people.

This seems to be the crux of the argument here. Yes, some people prefer fun over accuracy and predictive power, and such people have for example Tygem to have fun on. However, some people prefer accuracy and predictive power over widely inaccurate ratings and such people have for example KGS to play on.

I absolutely see no reason why all servers should be the same, catering to one specific group of people and making sure *those* selected people have more fun. It is a big world, and there is certainly room for few *different* approaches. Especially since it seems that the preference of one over the other is purely subjective over the short streak.

As for rank stability and inertia, I think both systems have advantages and disadvantages. For improperly ranked players, KGS system offers much more quick adjustment than Tygem (as already stated.) For properly rated players, a single winning/losing streak (like a bad or good day) can get them dislodged from their proper rank much faster - and thus make the "improperly" rated easier. Wait... ok, it seems one system has more advantages than the other.

Somebody called Tygem ratings a roulette. Its fun, its fast paced, and its exciting, and so there is a place for it in the world. Just like there is place for arcade games and shoot-em galleries and maybe even peep-holes.

Personally, I think the accuracy and predictive power are more valuable then cheap thrills of seeing numbers by your name change daily. But that's just me. Or is it?

How about real-world ratings. Lets say RJ is 5d, and on the verge of being invited to a prestigious tournament based on this rank. But look, there is a 4d player who just won 20 games in a row from his friend, and now he is invited instead, as a 6d. Ha ha ha, very exciting! So in reality, if you adopt +/-0.1 per game rating system, you will have to include all kinds of weights, checks, balances, and factors - just to make it behave more sensible, more like the current system (or like the KGS system.)

Arcade and cheap thrills are good on a server, and I am glad such server exists for those who like this kind of stuff. But this simply cannot be the *only* model we use, and not even the main one. This is what I think, even though I am not going to get into all this math stuff, have enough of that at work to play with it on my free time.

RobertJasiek · **#58**

Bantari, my proposal is not meant for accurate real world ranks.

Bantari · **#59**

RobertJasiek wrote:

Bantari, my proposal is not meant for accurate real world ranks.

This is obvious.

What you need to explain is why would you want such system, which is not good enough for real world, be implemented on each server.
And before you object - if you only want it on *some* servers, it already is - on Tygem, no?
So I don't get what the fuss is about. Just play there and be happy like a clam.

You certainly have to admit that there *is* room for a major server with more accurate real-world-like rating system.
Or you don't want to admit that, and this is the point of contention?

RobertJasiek · **#60**

Bantari wrote:

What you need to explain is why would you want such system, which is not good enough for real world, be implemented on each server.
And before you object - if you only want it on *some* servers, it already is - on Tygem, no?
So I don't get what the fuss is about. Just play there and be happy like a clam.

You certainly have to admit that there *is* room for a major server with more accurate real-world-like rating system.

There are also other reasons why I do not play much on other servers, such as extremely disliking having to use another software for every server.

There are other reasons to like KGS, so I want the worst part of KGS (the rating system) to improve so that I can better enjoy to good features of KGS.

Regardless of whether my rating proposal or something similar is adopted on KGS, this is not so important. It is also a think model for encouraging to overcome the too great rating stability for quite a few players. I made other proposals that were rejected, but it is not the specific proposal that matters. Instead, it is the aim of overcoming the problem.

The proposal is a rough draft; I do not mind if it is improved, completed, changed to model also accuracy to some reasonable extent etc.

I have not said that one system must be used on all servers. You have made this up.

There is room for a server with real world ratings. In fact, there is so much room that such a server does not even exist remotely. Don't even try to pretend KGS would be such a server, ridiculous. On KGS, equally KGS-ranked players can easily be 5 real world ranks apart.

There is also room for a server with accurate ratings, i.e., where almost equal ratings imply a great likelihood of 50% winning chances (in non-integer komi even games). As before. Which "accuracy" do you paint on KGS? Your dream of how accurate it should be? Accuracy measured by letting two KGS players play also real world games and assessing their winning percentages.

My system (when worked out to have global non-deflationary stability) would have much greater volatiliy, but I am not at all convinced it would have smaller accuracy. Rather I think that, on average for every particular player, it would have greater accuracy, because it can correct his temporarily wrong ratings much more quickly.

A Curious Case Study in KGS Ranks

Who is online