Whole History Rating open source implementation.

yoyoma · **#21**

pete wrote:

yoyoma wrote:

player gives anchor 600 Elo advantage (maybe 3 stones, depends on your model), 50% wins

Yes, giving it crazy input can produce crazy output. I should have documented the handicap parameter better, but I'm not sure what the exact range is.

On GoShrine, handicap values for a single stone range roughly between 30-60 elo, depending on the strength of the players. So 600 elo is quite a bit. I've yet to see it go unstable with real data.

I'm guessing what is happening is that we're running into floating point precision issues with certain params. If stability does end up being a problem for real data, I'd definitely take a deeper look at it, though I might need some help from Remi.

-Pete

60 elo for a handicap stone? That is far too low! KGS uses 148 per rank for 30k-5k, and 226 per rank for 2d+ (The constants are given in a different form here http://senseis.xmp.net/?KGSRatingMath log(e^0.85)*400=148 to convert to Elo form). EGF uses similar numbers. Besides, even using your too low value of 60, this is just a 10 rank improvement over 6 months. It's very easy to go from 25kyu to 15kyu in 6 months.

quantumf · **#22**

So after 5 games (3 wins 2 losses) I still don't have a rank. This is somewhat frustrating and not encouraging me to carry on trying. In general I prefer servers that allow one to self-select a starting rank, and find KGS quite annoying, but even KGS gives me a rank after 2 games. Kind of off-topic, but relevant in the sense that there are usability considerations that override perfection/accuracy in ranking systems.

Rémi · **#23**

yoyoma wrote:

I found what look like some numerical stability problems. I had similar problems when I implemented this as well with Newton's method failing or oscillating.

Newton's method is very efficient but tricky. In order to guarantee it works, it is necessary to check that the Newton iteration brings an improvement in the log-likelihood. If it does not, a fallback method should be used (such as a line search in the gradient direction).

IIRC, in my implementation I add a small negative constant to the diagonal of the Hessian before inversion. This prevents instability very well, at almost no cost in terms of efficiency. Maybe a good fallback method would be to increase this additional diagonal until the Newton's step increases the log-likelihood.

Rémi

Rémi · **#24**

hyperpape wrote:

So you're not asserting this is true of ordinary players who have (previously) played opponents selected more or less at random?

If you don't play on KGS, your rating will improve like your opponents.

Rémi

pete · **#25**

yoyoma wrote:

60 elo for a handicap stone? That is far too low! KGS uses 148 per rank for 30k-5k, and 226 per rank for 2d+ (The constants are given in a different form here http://senseis.xmp.net/?KGSRatingMath log(e^0.85)*400=148 to convert to Elo form). EGF uses similar numbers. Besides, even using your too low value of 60, this is just a 10 rank improvement over 6 months. It's very easy to go from 25kyu to 15kyu in 6 months.

KGS and WHR have a different elo scale, I believe. The total spread of ranks from my 40k games on GoShrine is ~2000 elo which, if spread evenly, is 50 elo per rank/stone.

Again, I'd be interested to know if you see this with real data. The example you propose is not completely impossible in general usage, but is one I would certainly not see on GoShrine (600 elo is a 15 stone handicap at 25 kyu),

-Pete

Rémi · **#26**

pete wrote:

yoyoma wrote:

60 elo for a handicap stone? That is far too low! KGS uses 148 per rank for 30k-5k, and 226 per rank for 2d+ (The constants are given in a different form here http://senseis.xmp.net/?KGSRatingMath log(e^0.85)*400=148 to convert to Elo form). EGF uses similar numbers. Besides, even using your too low value of 60, this is just a 10 rank improvement over 6 months. It's very easy to go from 25kyu to 15kyu in 6 months.

KGS and WHR have a different elo scale, I believe. The total spread of ranks from my 40k games on GoShrine is ~2000 elo which, if spread evenly, is 50 elo per rank/stone.

Again, I'd be interested to know if you see this with real data. The example you propose is not completely impossible in general usage, but is one I would certainly not see on GoShrine (600 elo is a 15 stone handicap at 25 kyu),

-Pete

How did you select the volatility meta-parameter of WHR? handicap values?

In my experiments, it was very clear that the handicap value changes a lot with player strength, and also volatility. When choosing the volatility in order to optimize prediction quality over the KGS database, it was too low (14 Elo^2/Day) for beginners, so it produced very "compressed" ratings.

For a rating system to properly understand the variations of strength in a pool of players that mixes beginners and experts, it is really necessary to consider that the strengths of beginners changes faster than the strengths of experts.

Rémi

pete · **#27**

quantumf wrote:

So after 5 games (3 wins 2 losses) I still don't have a rank. This is somewhat frustrating and not encouraging me to carry on trying. In general I prefer servers that allow one to self-select a starting rank, and find KGS quite annoying, but even KGS gives me a rank after 2 games. Kind of off-topic, but relevant in the sense that there are usability considerations that override perfection/accuracy in ranking systems.

Thanks for the feedback, quantum. I'm leaning towards implementing what Remi suggested about using the lower confidence bound as the rating, which would give you a rank much sooner (though probably lower than your actual rank).

pete · **#28**

Rémi wrote:

How did you select the volatility meta-parameter of WHR? handicap values?

I did some optimization runs, and came up with 300 Elo^2/day, somehow. You can configure the library like this:

Code:

@whr = WholeHistoryRating::Base.new(:w2 => 17)

I know 300 seems like a lot. But it does still seem to produce sensible results, and allows beginners to make more rapid progress.

BTW, yoyoma, if you bump :w2 down below 100, your example remains stable.

Rémi wrote:

In my experiments, it was very clear that the handicap value changes a lot with player strength, and also volatility. When choosing the volatility in order to optimize prediction quality over the KGS database, it was too low (14 Elo^2/Day) for beginners, so it produced very "compressed" ratings.

For a rating system to properly understand the variations of strength in a pool of players that mixes beginners and experts, it is really necessary to consider that the strengths of beginners changes faster than the strengths of experts.

Rémi

Are you working on a new version of WHR that takes this into consideration?

pete · **#29**

As an aside, I'm glad to finally have some questions and feedback on this code that I struggled to write.

I'm certainly open to the possibility that there may be mistakes in the code, and would love to have someone other than me look it over. That's one of the reasons I open sourced it. If you see anything, or have questions, send a pull request on github, or just send me an email.

-Pete

yoyoma · **#30**

pete wrote:

yoyoma wrote:

60 elo for a handicap stone? That is far too low! KGS uses 148 per rank for 30k-5k, and 226 per rank for 2d+ (The constants are given in a different form here http://senseis.xmp.net/?KGSRatingMath log(e^0.85)*400=148 to convert to Elo form). EGF uses similar numbers. Besides, even using your too low value of 60, this is just a 10 rank improvement over 6 months. It's very easy to go from 25kyu to 15kyu in 6 months.

KGS and WHR have a different elo scale, I believe. The total spread of ranks from my 40k games on GoShrine is ~2000 elo which, if spread evenly, is 50 elo per rank/stone.

Again, I'd be interested to know if you see this with real data. The example you propose is not completely impossible in general usage, but is one I would certainly not see on GoShrine (600 elo is a 15 stone handicap at 25 kyu),

-Pete

I like playing around with rating math, sorry for the tldr text.

I did convert from the KGS scale to the standard Elo scale, and it looks like your WHR code handicap parameter takes a standard Elo scale number.
KGS: P = 1 / ( 1 + e^(k*(RankB-RankA)) ) [k=0.85 for 30k-5k, k=1.3 for 2d+]
Elo: P = 1 / ( 1 + 10^((RankB-RankA)/400)) )

So for kyu players and 1 rank difference: RankB-RankA=1 and k=0.85. Then you can solve for what the Elo difference is. EGF has some statistics on even games here: http://gemma.ujf.cas.cz/~cieply/GO/statev.html
Generally for weaker kyu players the chance of upset is around 45%, for stronger players it goes down. I put the expected win rates for KGS and EGF formulas, along with the observed win rates for EGF tournaments here:

Code:

|           | KGS   | EGF   | EGF   | KGS   | EGF   | EGF   |
|           | exp.  | exp.  | obs.  | exp.  | exp.  | obs.  |
| even game | win % | win % | win % | elo   | elo   | elo   |
|-----------|-------|-------|-------|-------|-------|-------|
| 10k vs 9k | 30.0  | 33.9  | 44.8  | 148   | 116   | 36    |
| 5d vs 6d  | 21.4  | 20.1  | 27.8  | 226   | 232   | 166   |

You can see quite a discrepancy between the win rates predicted by the EGF formula and those observed. Since ratings are estimated values of random variables, the observed win% will usually be lower than the expected win% (errors in the rating estimation tend to create more upsets than expected). Also these statistics are mostly from McMahon tournaments, which tends to match underrated 10kyus with overrated 9kyus.

Remi do you have any numbers like this for observed KGS games to get numbers for Elo/Rank from them?

Rémi · **#31**

yoyoma wrote:

Remi do you have any numbers like this for observed KGS games to get numbers for Elo/Rank from them?

I did most of my experiments without handicap. If I find time in the days to come, I'll try to take a closer look. But I have been saying this to myself since the WHR paper in 2008, so I am not sure I'll do it soon.

Rémi

pete · **#32**

Yoyoma,

I'm wondering if we have different models in our heads at this point. When you present a probability statement like P = 1 / ( 1 + 10^((RankB-RankA)/400)) ) and then go on to say that RankB and RankA are actual kyu/dan ranks, I don't follow.

The model that WHR uses (and Remi, correct me if I misspeak) is P(A wins) = NaturalA/(NaturalA+NaturalB). To convert from Natural ratings to ELO, use the formula (NaturalX * 400.0)/ln(10). WHR primarily works on Natural scaled ratings internally. In my library, I convert the user's input into Natural ratings, and convert output back into ELO.

This produces a "linear" strength scale. Linear in the sense that the probability of a 1000 ELO player beating a 900 ELO player is the same as that of a 200 ELO player beating a 100 ELO player. (see the test_winrates_are_equal_for_same_elo_delta test in the library).

Historically, Go ranks are tied to handicap stones, and stronger players can use stones more effectively, thus ranks are not an equal distance apart in terms of strength. So it is in the conversion from ELO to ranks (which happens outside of the library, and in GoShrine code), that the strength scale takes on a curve.

Since a handicap stone is a varying amount of ELO, based on the players' strengths, the library supports the use of a callback, which allows the calling to code to implement a curve for handicap values as well.

Does this clear matters up? Essentially WHR knows nothing about the curved scale of go ranks and go handicaps, but just does what it's good at, computing estimates of relative strengths on a flat scale.

-Pete

yoyoma · **#33**

Yes we need to be clear what scales we're talking about. What you call Natural I thought was called Gamma.
Natural = ln(Gamma).
Elo = Natural*400/ln(10)
I think these are the same as the definitions given in 2.1 of http://remi.coulom.free.fr/WHR/WHR.pdf (Greek letter gamma = Gamma, lowercase r = Natural, uppercase R = Elo).

Code:

|Elo    |Natural|Gamma  | win%|
|0.00   |0      |1.00   |0.50 |
|30.00  |0.075  |1.19   |0.46 |
|60.00  |0.15   |1.41   |0.41 |
|400.00 |1      |10.00  |0.09 |

Am I right that the handicap argument for Game::initialize is on the classic Elo scale? I see this bits of code that make me think so:

opponent_elo = bpd.elo + black_advantage # Addition used here, as I expected
rval = 10**(opponent_elo/400.0) # Here is the conversion from Elo to Natural

When I wrote: "So for kyu players and 1 rank difference: RankB-RankA=1 and k=0.85.", that was for the KGS formula, which uses a Natural scale: P = 1 / ( 1 + e^(k*(RankB-RankA)) ). So for that formula ranks are fixed to always be 1 rank = 1.0 on the Natural scale. And the "k" parameter is used to change expected win rates for dans vs kyus.

So to compare apples to apples I converted from that formula to the classic Elo formula which uses log10 and has the 400 constant in there. I did a similar conversion from EGF GoR's parameter they call "a" (http://www.europeangodatabase.eu/EGD/EG ... system.php).

When you wrote your system used 30-60 Elo per rank, I assumed you meant the classic Elo scale using log10 and the 400 constant, is that right? I added a table for those values:

Code:

|           | KGS   | EGF   | EGF   | KGS   | EGF   | EGF   |
|           | exp.  | exp.  | obs.  | exp.  | exp.  | obs.  |
| even game | win % | win % | win % | elo   | elo   | elo   |
|-----------|-------|-------|-------|-------|-------|-------|
| 10k vs 9k | 30.0  | 33.9  | 44.8  | 148   | 116   | 36    |
| 5d vs 6d  | 21.4  | 20.1  | 27.8  | 226   | 232   | 166   |

30 Elo difference | 45.7% |  (go shrine lower end 1 rank difference)
60 Elo difference | 41.5% |  (go shrine lower end 1 rank difference)

pete · **#34**

yoyoma wrote:

Code:

|           | KGS   | EGF   | EGF   | KGS   | EGF   | EGF   |
|           | exp.  | exp.  | obs.  | exp.  | exp.  | obs.  |
| even game | win % | win % | win % | elo   | elo   | elo   |
|-----------|-------|-------|-------|-------|-------|-------|
| 10k vs 9k | 30.0  | 33.9  | 44.8  | 148   | 116   | 36    |
| 5d vs 6d  | 21.4  | 20.1  | 27.8  | 226   | 232   | 166   |

30 Elo difference | 45.7% |  (go shrine lower end 1 rank difference)
60 Elo difference | 41.5% |  (go shrine lower end 1 rank difference)

Ok, I understand the table now, thanks for being patient, Your assumptions are correct about handicap being in ELO, and that the ELO in my WHR implementation is the same ELO you are talking about. The 30 & 60 elo deltas do indeed give the winrates that you list in the table above.

I'm wondering if you would indulge my curiosity and expand upon your explanation for why the observed values in the table above are at such odds with the expected winrates. "errors in the rating estimation" should create errors in both directions, overestimating, and underestimating, no? And why do McMahon tournaments match underrated 10kyus with overrated 9kyus? Wouldn't they also match overrated 9kyus with underrated 10kyus?

I'm willing to accept that my ELO values might be low, but perhaps existing rating systems are also erring on the high side, as the above tables might suggest.

-Pete

yoyoma · **#35**

pete wrote:

yoyoma wrote:

Code:

|           | KGS   | EGF   | EGF   | KGS   | EGF   | EGF   |
|           | exp.  | exp.  | obs.  | exp.  | exp.  | obs.  |
| even game | win % | win % | win % | elo   | elo   | elo   |
|-----------|-------|-------|-------|-------|-------|-------|
| 10k vs 9k | 30.0  | 33.9  | 44.8  | 148   | 116   | 36    |
| 5d vs 6d  | 21.4  | 20.1  | 27.8  | 226   | 232   | 166   |

30 Elo difference | 45.7% |  (go shrine lower end 1 rank difference)
60 Elo difference | 41.5% |  (go shrine lower end 1 rank difference)

Ok, I understand the table now, thanks for being patient, Your assumptions are correct about handicap being in ELO, and that the ELO in my WHR implementation is the same ELO you are talking about. The 30 & 60 elo deltas do indeed give the winrates that you list in the table above.

I'm wondering if you would indulge my curiosity and expand upon your explanation for why the observed values in the table above are at such odds with the expected winrates. "errors in the rating estimation" should create errors in both directions, overestimating, and underestimating, no? And why do McMahon tournaments match underrated 10kyus with overrated 9kyus? Wouldn't they also match overrated 9kyus with underrated 10kyus?

I'm willing to accept that my ELO values might be low, but perhaps existing rating systems are also erring on the high side, as the above tables might suggest.

-Pete

I probably shouldn't have thrown in the errors in rating estimation part, because I don't know much about it. I read that somewhere but I can't find it. Basically what I understood is that when you have two players who are estimated to be 1500 and 1600, with some normal distribution of what their ratings *really* are... Blah blah lots of math I can't do on my own (hehe), turns out just using the 1500 and 1600 numbers by themselves gives a lower probability of upsets than using the full distributions? Honestly I don't know how that works so maybe someone can explain better, or maybe I'll find where I read it.

The McMahon one is easier to understand. Take a tournament with two 9ks and two 10ks, and many 30k-11k and 8k+. In round one, the 9ks play each other and the 10ks play each other. In round 2, the 9k winner players the 10k loser. Typically this will be whichever 9k was most underrated and whichever 10k was most overrated. So in general in McMahon tournaments, underrated players go up and overrated players go down, meeting each other and creating more than expected upsets. How big this effect is I don't know.

daniel_the_smith · **#36**

I don't have anything to contribute but I'm very much enjoying the thread!

Kaya.gs · **#37**

Its a nice discussion

.

I think that it could be a valuable effort to set up a testing environment for the testing of different rating systems. I had planned on doing this on OpenKaya, but i never compiled a set of games to make estimates with .

It can be very fruitful to agree on some systematic testing, so everytime we try out new rating systems and more specifically, tweaking on those systems, we can easily compare them.

Just figure running tests against different compliations (with handicap, witohut handi, with bots, etc) and getting figures directly like:

Accuracy
Glicko -> 40%
WHR(GoShrine's) -> 47%
WHR(yoyoma's) -> 49%
Tygem's -> ?

Performance
Glicko -> X operations
WHR(GoShrine's) -> Y operations
WHR(yoyoma's) -> Z operations
Tygem's -> ?

and so on.

Id like to get this rodeo going at some point , although its not top priority for us now.

Making it an open standard could end up serving in other places, like chess, or just a novel use like comparing EGF rating with the same game results with different systems.

hyperpape · **#38**

I wonder: while having real life games is nice, there is the problem that game pairings are influenced by the rating system. Perhaps that's not an issue for reasonable systems, but since some systems (Tygem) are very slow to fix large errors, that could introduce a real distortion.

bakekoq · **#39**

hello.
may I know how to install it?
it can be good for me and my clubs in the future.

pete · **#40**

bakekoq wrote:

hello.
may I know how to install it?
it can be good for me and my clubs in the future.

There are instructions on the linked page. It's a ruby gem, so you must be familiar with ruby and rubygems first.

Whole History Rating open source implementation.

Who is online