Alternate goals and alternate aims of rating systems

Mef · #1

As a thread in the KGS forum has kicked off a couple long discussions about rating systems there were a few lines in a few comments that I wanted to respond to but didn't because they would be tangential and honestly, overly pedantic. Nevertheless, because I think the conversations are worth having in their own right, I am starting a new thread. I'm not sure where the best forum for "general discussions of go rating systems" should be so I have chosen the general forum as a default.

To provide a brief background for those who are uninterested in Rating system argument minutiae:

- KGS has a sophisticated mathematical algorithm that aims to most accurately predict a game outcome between two arbitrary players on the server. The trade off is that the system can seem intractable at times much to the frustration of players who wish to make sense of what they must do to change their rank.

-Tygem has a very simple system that is easy to understand how and why ratings move, accepting the tradeoff of clarity for potentially inaccurate ranks and mismatches.

-Most other servers fall somewhere in between on these two ends of the spectrum (The notable exception to this is GoShrine, which to my knowledge falls even on the farther end of the spectrum than KGS, but it doesn't quite have the same player base so you rarely hear complaints about it.)

These above mentioned bullet points I feel we can all more or less agree on, and I don't want this to be another thread hashing out those point. Instead, during the discussion some alternate possibilities for rating systems have been discussed (though sometimes in jest, but I think they are worth considering) so I thought it would be fun to try and outline various scenarios or goals you might have for a rating system, and then discuss what might be a possible way to achieve that. This will hopefully lead to some fun thought experiments and interesting discussion.

There were quite a few things I thought were interesting, but to avoid running off on too many tangents I'll start slow. The first one that I'll throw out to consider comes from one of Bantari's comments (emphasis mine):

Bantari wrote:

Lets look at your various playing "modes" hypothetically. Lets say that: when you play only casual and fun games you play like 3d, when you play seriously you play like 5d, and when you play a mixture of both modes, you play like 4d. This is how it can hypothetically look when taken your history into account, and this is pretty much what you are saying as well. Now what you seem to want is a system which lets you generally play in the mix mode but ranks you as if you were constantly in the serious mode. This is not reasonable, and no system should do that.

Again, it's perhaps a bit pedantic of me...but I do try to err on the side against absolute statements, so this got me thinking: Could there be a time where you do want to do this? And If so how would you do it?.

As a discussion starting point, I will posit a time when I think you may want to do this:

Imagine you are a go teacher and you you have a class of pupils. You are aiming to select for the most promising pupils who you will then encourage to move on to either a more advanced group or perhaps take dedicated lessons. In this case you would be trying to select for those who have the highest "peak" potential. In that case it may be useful to figure out who, when playing at their best, is the strongest (as opposed to who, on average, is strongest).

So, the questions now become:

- What type of rating system would be best for selecting for this top "peak potential" candidates?
- What type of challenges might one face when implementing such a system?
- What other situations might one want to separate out the "strongest" one plays from the "average" one plays?

Aside from this, if anyone has some other interesting scenarios or other interesting goals a rating system may want to have, I would be interested in hearing them!

RobertJasiek · #2

Even if the basic theory of a rating system is sound, it can still be a failure if its parameters are set wrongly. For example:

* In a KGS-like rating system, the parameters are set wrongly if they force part of the players to play 1000+ games to improve a rank. 1000 is just an extreme number. The best maximal number for any player would be set by a compromise, so that undesired objective side effects, such as too many players restarting with new accounts to circumvent the rating system, do not occur.

* In a Tygem-like rating system, the parameters are set wrongly if one win or loss moves a player's rank by 7 ranks. Again, this is just an extreme number, but you get the idea: a good parameter must be set so that people do not run away from or circumvent the system.

Parameters are not god-given by the programmer, but parameters must be evaluated and adjusted properly. This is so for every rating system. There is no good rating system without proper calibration of the parameters in their conflict with human preferences.

DrStraw · #3

As Bantari says, it is not only unreasonable, it is impractical. A go server, by its very nature, cannot be a source for a reliable rating across all games. Serious competitive games are rarely played online and when they are the real world rankings are usually used to determine handicaps or, more likely, the game is just an even game. So expecting an online server to provide reliable rankings in all scenarios is simply impractical in my opinion. All they can be expected to do is provide a fairly accurate assessment of the handicap which should be used to provide an enjoyable game between two players.

I believe that playing online is an excellent way to improve one's skills, but I don't think that using online rankings to absolutely determine one's strength is a good idea. This can only be achieved by serious over-the-board play. If you really want to push for more reliability with them you need to have an additional parameter in the setting up of an account. A more of play for each account would need to be selected and that account can only play games within certain time limits (the only way to judge seriousness online as far as I can see). This would require everyone to have separate accounts for serious and fun games.

Bill Spight · #4

The question of determining peaks is an interesting one, and, as you say, relevant to the question of identifying potential and setting goals for improvement. Most of us underperform.

I am sure that there is a literature on this. When I was a kid, I read about how to walk along the beach and remain close to the incoming waves but not get your feet wet. You could see in the sand where the highest waves had come and could stay close to that line but on the dry side of it. (That worked because the tide did not change very quickly. It is not infallible, however. When I was in Hawai'i there were people who were sitting or lying on dry beach who were swept out to sea by killer waves.)

Economists may be interested in the potential performance of an economy or sector, and thus be interested in the upper envelope of trend data. The upper envelope of climate data is also important, as it is the extremes that cause or accompany natural disasters.

uPWarrior · #5

I guess you could design a rating system that provides people with a confidence interval instead of a single number, sort of not hiding the volatility of ones rating. In that scenario you could rate someone as [3d-5d] instead of 4d, or [4d-5d], or even [1k-5d] if they are somehow likely to be playing drunk.
I would say that trying to get any finer granularity (e.g. trying to keep a rating distribution per player) will likely be impossible due to lack of data.

Mef · #6

DrStraw wrote:

If you really want to push for more reliability with them...

I don't!

(at least not in this thread)

This was meant to be "What other goals might you want to achieve with a rating system?" It also wasn't meant to only apply to servers. Clubs might want to do something special too.

One simple example: A streak limiting system. You could design a system that goes out of its way to pair people who are starting winning or losing steaks (aiming to keep the size and frequency of steaks at a minimum).

wineandgolover · #7

Mef wrote:

One simple example: A streak limiting system. You could design a system that goes out of its way to pair people who are starting winning or losing steaks (aiming to keep the size and frequency of steaks at a minimum).

IMHO, you are overthinking this. Nobody is asking for streak busters. We just don't see the harm in promoting somebody on a hot streak. If the promotion is valid, they will prove it. If not, they will regress to their old rank naturally. Having a non-mathematically optimal rating for a few games doesn't hurt anybody. Encouraging improvement through positive reinforcement could have lasting value.

Your math ain't wrong. Your incentives are.

pwaldron · #8

Mef wrote:

So, the questions now become:

- What type of rating system would be best for selecting for this top "peak potential" candidates?
- What type of challenges might one face when implementing such a system?
- What other situations might one want to separate out the "strongest" one plays from the "average" one plays?

Several years ago there was a good article in the Nordic Go Journal about the time evolution of people's ratings. It turns out that the data was well fit by a decaying exponential towards a terminal strength. If you're looking to predict peak rating then one way would be to fit the data we do have on a player's rating and extract the terminal strength value.

Mef · #9

wineandgolover wrote:

]
IMHO, you are overthinking this. Nobody is asking for...

Just to reiterate, because there still appears to be some confusion I am not trying to start yet another thread about the same tired arguments about rating systems...There's plenty of other places to talk about those. This was meant as a thread to discuss other questions/issues/etc you might want a rating system to address in certain cases, and how you would go about doing this.

The streak busting and peak prediction are just two examples of an alternate goals you might have in mind.

Another possible question you might want to answer - "Who was the most valuable player to their AGA city league team last year?"

This question would be fundamentally different from asking "Who performed the strongest?" and different still from "Who would we expect to win the championship if the games were replayed?"

DrStraw · **#10**

Mef wrote:

This was meant as a thread to discuss other questions/issues/etc you might want a rating system to address in certain cases, and how you would go about doing this.

Okay, since you put it that was I only want a rating system to do one thing: give a good approximation of the handicap I should give or take against a potential opponent.

A corollary to this, but not part of the need per se, is that by determining handicaps I can also determine a rank relative to some strong player whose ability is stable. It is up to others to determine that staionary point that individual represents, but from that I can assign a dan/kyu rank to myself.

Anything beyond this is unneccary from a rating system. A ranking system, on the other hand, is usually considered a lifetime achievement award and represents the highest level a person has acheived during their live. The two should not be confused.

Boidhre · **#11**

Mef wrote:

One simple example: A streak limiting system. You could design a system that goes out of its way to pair people who are starting winning or losing steaks (aiming to keep the size and frequency of steaks at a minimum).

Isn't there natural steak limiting in go though? At sudden break points your handicap versus everyone else changes. I mean, you constantly see this on KGS with a player having a good stretch brought to a sudden halt as soon as they dip their toe into the next rank. If the streaks are limited to ratings movements within one stone, is there a problem other than people not understanding statistics?

You could argue for moves to komi changes to reflect ratings changes between than dividing people into ranks one stone apart but I'm not convinced this gives you much for the amount of complication it introduces.

hashimoto · **#12**

Mef wrote:

Just to reiterate, because there still appears to be some confusion I am not trying to start yet another thread about the same tired arguments about rating systems...

Sorry, but it's hard to believe you when you throw in your own personal opinions from other threads:

Mef wrote:

KGS has a sophisticated mathematical algorithm that aims to most accurately predict a game outcome between two arbitrary players on the server.

Mef wrote:

Tygem has a very simple system that is easy to understand how and why ratings move, accepting the tradeoff of clarity for potentially inaccurate ranks and mismatches.

Mef wrote:

These above mentioned bullet points I feel we can all more or less agree on, and I don't want this to be another thread hashing out those point.

You argue that you're not trying to start another argument about rating systems while still ignoring the idea that your opinions may not be universal.

Bantari · **#13**

DrStraw wrote:

As Bantari says, it is not only unreasonable, it is impractical. A go server

Exactly!! I probably used the wrong word of saying "system" in my post while I meant a generic "server" which was the context of the discussion my statement was made.

As Mef said - there might be situations in which you want to look and evaluate peak plays and periods, but a generic Go server should never do that, imho.

DrStraw wrote:

, by its very nature, cannot be a source for a reliable rating across all games. Serious competitive games are rarely played online and when they are the real world rankings are usually used to determine handicaps or, more likely, the game is just an even game. So expecting an online server to provide reliable rankings in all scenarios is simply impractical in my opinion. All they can be expected to do is provide a fairly accurate assessment of the handicap which should be used to provide an enjoyable game between two players.

I believe that playing online is an excellent way to improve one's skills, but I don't think that using online rankings to absolutely determine one's strength is a good idea. This can only be achieved by serious over-the-board play. If you really want to push for more reliability with them you need to have an additional parameter in the setting up of an account. A more of play for each account would need to be selected and that account can only play games within certain time limits (the only way to judge seriousness online as far as I can see). This would require everyone to have separate accounts for serious and fun games.

This is what I think as well.

What I can also add is that it would help if separate ratings were set for separate games, per account. So you don't need multiple accounts, just one account with a switch between what kind of game you wish to play at the moment. Blitz, small boards, and so on - all kinds of games which can introduce noise to a rating value. Ideal situation, imho, would be for each account to have multiple ratings. For example: by time controls, board size, and seriousness of the game. I understand that would complicate the matters tremendously in many ways, but might be worth it. I think some chess servers already have something like that, or at least separate blitz and regular game ratings.

Bantari · **#14**

Mef wrote:

wineandgolover wrote:

]
IMHO, you are overthinking this. Nobody is asking for...

Just to reiterate, because there still appears to be some confusion I am not trying to start yet another thread about the same tired arguments about rating systems...There's plenty of other places to talk about those. This was meant as a thread to discuss other questions/issues/etc you might want a rating system to address in certain cases, and how you would go about doing this.

The streak busting and peak prediction are just two examples of an alternate goals you might have in mind.

Another possible question you might want to answer - "Who was the most valuable player to their AGA city league team last year?"

This question would be fundamentally different from asking "Who performed the strongest?" and different still from "Who would we expect to win the championship if the games were replayed?"

Oh, I see what you mean.
I was mislead by the thread title. To me a "rating system" in Go is a system which serves mainly to assign people some values to help them figure out handicaps.

What you are talking about is another kind of algorithm - I think to avoid confusion we should call it something else, but it might be that I have to widen my definition.

Be it as it may, any such algorithm will start with the same thing - the bulk data about games played, and then it will evaluate this data. Conventional rating algorithm will try to assign each player a value as stated above. But you are right - there might be many other goals for other algorithms. You mention the MVP goal, which is a good one. I can see other goals - like in tournaments, you might want to give rewards based on longest winning streaks. Or in a club - for best improvement over a period of time. Or even for most games played. All of those are pretty trivial to program.

Very interesting.

Mef · **#15**

pwaldron wrote:

Several years ago there was a good article in the Nordic Go Journal about the time evolution of people's ratings. It turns out that the data was well fit by a decaying exponential towards a terminal strength. If you're looking to predict peak rating then one way would be to fit the data we do have on a player's rating and extract the terminal strength value.

Interesting...so if I understand this correctly, you're talking about using the slope of a traditional rating graph over time to project an ultimate maximum value? I could see this being useful for the issue as stated (find students who project high early and focus efforts on them). I could also see this being used to rate training programs (how does "ultimate peak and time to 95% of peak compare before, during, and after training?)

My first thought was something similar to warrior's idea-- use a system with a confidence interval and see whose 75% projection (or whatnot) was highest.

ez4u · **#16**

pwaldron wrote:

Several years ago there was a good article in the Nordic Go Journal about the time evolution of people's ratings. It turns out that the data was well fit by a decaying exponential towards a terminal strength. If you're looking to predict peak rating then one way would be to fit the data we do have on a player's rating and extract the terminal strength value.

The article 'Search for a Universal Rating Progress Function' in this issue?

Alternate goals and alternate aims of rating systems

Who is online