KGS ranking system

hibbs · **#41**

Mef wrote:

Boidhre wrote:

hibbs: Probabilities most likely aren't independent when it comes to a series of go game results for most people given how human psychology makes a difference on the board. If it were bots playing each other you'd be correct and the probabilities would be independent. I'd be very surprised if someone on a winning streak was not more likely to win their next game than someone on a losing streak.

I would take this one step further (especially for games played on the same day). A person on a winning streak is likely well-rested, not distracted, not hungry, etc (i.e. closer to their peak playing condition, playing at a stronger level even outside of psychology). The same person on a losing streak is more likely tired, nervous, angry/frustrated, thinking more about the problems they had at work that day, and so on (i.e. playing below at a average strength). Of course once the streak starts, the psychological feedback loop you mention is probably only going to amplify whatever effect is already being observed.

First of all, the statistical independence is a necessary assumption for the various calculations to be meaningful (As I wrote, otherwise these calculations would not be valid).

The question you bring up here is an interesting one: Is this assumption also true in reality?
Against it, you can argue as you do: That undoubdetly in all kinds of sports psychology plays a role, and that feeling stronger actually makes you stonger, sou would have a positive feedback loop out of winning streaks.

In favor of it: There are countless examples where people believe in such patterns, and they all don't hold up against a critical look at the data. I recently read a nice publication where all soccer matches since the introduction of the German Bundesliga were analyzed. It turns out: All winning streaks (better: the frequency of winning or loosing streaks) are in perfect agreement with pure chance. A side note: Most people believe that in a game with many goals, it would be more likely that the team scores another one, because the team has a run or plays in temporary perfect rapport. If you wath such a game it seems to be totally obvious. However: The frequencies of goals scored in a game is in total agreement with the assumption of scoring a goal is an independent random event. (It only deviates from the statistical Poisson distribution for games that with games that end 0:0 or 1:0). Of course the outcome of soccer games is not entirely random, because there are better teams that have an overall higher baseline frequency of scoring or winning, but that can be properly modeled.
Also in baseball there is the common and intuitive phenomenon that a batter is on a streak. As someone has mentioned earlier i this thread, this is also a myth that was debunked (there is no positive feedback from a winning streak).
Also more general: Humans are masters of pattern recognition even to a point of seeing patterns where is only randomness (look at the stars, for example). Someone who is not really trained at this will usually see meaningful patterns in random events, or if he sees something that looks random totally overlook that this is actually a pattern. Most untrained people underestimate the statistical frequency of winning or looking streaks and beleive they see something real. So most often things like "I must have been tired" are in fact a post-hoc rationalization of a random event.

For all these reasons, my intital guess would be that the outcomes of games are indeed independent from previous results.

Who would be right? There is no reason to argue about it, all that is needed to have look at the data: Find a few players who have played many rated games at the proper handicap without improving in that period. Check the frequencies of streaks (how often they won 2,3,4,... games in a row), compare that frequency with what would be expected statistically. I would be surprised if no one did this before, but I feel tmepted now to have a look if I have some spare time.

I would like to give one possible explanation why the outcomes could indeed be random in spite of the psychological effects: If you feel stronger because of a random streak, you are likely to play a bit more agressive and therefore more likely to make some overplays that occasionally get punished. So feeling stronger man not necessarily correlate with actually playing stronger.

Mef · **#42**

hibbs wrote:

First of all, the statistical independence is a necessary assumption for the various calculations to be meaningful (As I wrote, otherwise these calculations would not be valid).
*Snip the rest of the post for length*

I agree with much of what you say, and also agree that, for the most part, each individual game in a series is essentially a random event. I would still contend that there are not non-random factors that can increase the likelihood of a winning or losing streak. Once again, you might have a person who is tired, sick or nervous playing a lot of games in a row on a given day, and they are not at their peak performance (hence have a greater than normal chance of losing their games).

While you point to data from sports about "streakiness" (which is in and of itself true), there is also precedent for situations where non-random streaks do exist due to external conditions. One example would be pitcher performance prior to being placed on the disabled list. During the onset of an injury a pitcher will generally see a drop in their fastball velocity, greater variation in their release point, and other things injury-related things which often manifests itself in a "bad streak" shortly before they take time to have surgery, recover, etc. These are cases of legitimate "losing streaks" so to speak.

Of course you could say that even in these cases your outcomes are still independent events, it's merely the expected value has shifted prior to the streak due to an external condition. When you are coming from the perspective of estimating the expected value though, playing especially well on one day and especially poorly on another would look just about identical to the "psychology of streaks" so to speak.

At the end of the day though, I'm with you, I'd prefer to see someone dig into some data and see if there's anything worthwhile there.

Mef · **#43**

Mef wrote:

At the end of the day though, I'm with you, I'd prefer to see someone dig into some data and see if there's anything worthwhile there.

All right...since KGS analytics just spits out a CSV with all the game results....and I ended up having a bit of free time...I made a quick and dirty excel macro that analyzed streaks in game histories. I looked at 3 players who I like to use for KGS statistical data because A: Their ratings are fairly consistent, B: They play a ton of games, and C: They are fairly recognizable KGS personalities, here are my results:

Streak = 3 games

Streak= 4 games

Streak=5 Games

I need to go to sleep now, but later today I'll try to double check my script and make sure there's no glaring errors. Also I may rework it to try and test my "Good days / bad days" theory.

Mike Novack · **#44**

hibbs wrote:

Since all the probabilities are known, the probability for each outcome can be calculated, e.g. the probability for outcome 1 is 0.15 (chance to improve in the workshop) * 0.75^5 (probability to win five games in a row at a 75% average win rate) = 3.5%. Other probabilities can be calculated in a similar way, the probability that the person did not improve in the workshop and got a 5 game win streak is 2.7 %

What is important now: We have observed the outcome “five wins in a row”, which means under the given assumptions it is actually more likely that the person has really improved than not. And even though winning 5 games in a row is a rather common event that happens by chance in 3% of all cases, and even though it is unlikely to improve by attending the workshop, the person may still correctly feel that he should get a promotion. (Everyone: Please do not start a discussion if this should be reflected in a ranking system… Read above disclaimer first)

I think that is perhaps the crux of the disagreement. Possibly related to the usual and customary certainties expected before "publication" in the different science. Yes, .56 (3.5/6.2) is greater than .44 (2.7/6.2) but not a whole lot greater. If the system gave promotions based upon attending this class and then having a five game winning streak to 100 mythical players would have been correct to do so 56 times and incorrect to do so 44 times. That's a pretty bad "error rate". The calculation might be redone to determine what lengths of streaks would have been necessary to get the error rate down to below 10%, below 5%, etc.

Boidhre · **#45**

hibbs wrote:

First of all, the statistical independence is a necessary assumption for the various calculations to be meaningful (As I wrote, otherwise these calculations would not be valid).

There's plenty of maths out there for dealing with non-independent events statistically. I've forgotten most/all of it since college since I no longer work with it, but assuming non-independent events to be independent just so you can use a linear regression or whatever just gives you misleading results.

skydyr · **#46**

One other thing to consider is that the system picks anchors for the ratings from active and relatively stable players, so if one of them improves suddenly and rapidly, they may warp everyone elses' ratings around the anchor instead of changing ranks themselves. As far as I know, there is no way to tell who is an anchor by design, without access to the underlying database.

hibbs · **#47**

Mike Novack wrote:

hibbs wrote:

Since all the probabilities are known, the probability for each outcome can be calculated, e.g. the probability for outcome 1 is 0.15 (chance to improve in the workshop) * 0.75^5 (probability to win five games in a row at a 75% average win rate) = 3.5%. Other probabilities can be calculated in a similar way, the probability that the person did not improve in the workshop and got a 5 game win streak is 2.7 %

What is important now: We have observed the outcome “five wins in a row”, which means under the given assumptions it is actually more likely that the person has really improved than not. And even though winning 5 games in a row is a rather common event that happens by chance in 3% of all cases, and even though it is unlikely to improve by attending the workshop, the person may still correctly feel that he should get a promotion. (Everyone: Please do not start a discussion if this should be reflected in a ranking system… Read above disclaimer first)

I think that is perhaps the crux of the disagreement. Possibly related to the usual and customary certainties expected before "publication" in the different science. Yes, .56 (3.5/6.2) is greater than .44 (2.7/6.2) but not a whole lot greater. If the system gave promotions based upon attending this class and then having a five game winning streak to 100 mythical players would have been correct to do so 56 times and incorrect to do so 44 times. That's a pretty bad "error rate". The calculation might be redone to determine what lengths of streaks would have been necessary to get the error rate down to below 10%, below 5%, etc.

I think you got this one wrong, but first I want to re-iterate two earlier statements:
The crux of the disagreement (or at least what I consider a flaw in the KGS rating system) is that it takes a different number of games to be promoted in dependence of the frequency of played games. It not a priori reasonable why this should be the case, and it may lead to the fact the someone has a winning streak beyond reasonable statistical doubt and still does not get the promotion (at least not immediately). WMS himself has stated that the KGS ranking system does not account well for sudden or fast improvements in strength. What I consider a flaw here is apparently the price to pay for an otherwise probabilistically correct system.
I have also stated that a ranking system cannot account for things like someone attending a class, so therefore it mus not. The rest of the discussion is entirely hypothetical.

Now if you think this error rate of making a mistake of about 50% as in the example. That is the way it is. The Bayesian inference does not help us with calculating the wins in a row needed to get a smaller error rate. The question that is answered is that after the observation "Attended a class and played a streak of 5 won games in a row": What is the probability that the observation is caused by a real improvement? This depends on the prior probability of improving by attending the class. So what would be the probability that there was a real improvement after attending a cooking class? Zero. But the better the class actually is, the more likely an observation of 5 won games in a row points towards a real improvement of the player in question.

jts · **#48**

It's very reasonable. The more data you have attesting a statistic, the more confident you can be that the observed statistic is close to the actual statistic. The more confident you are, the more data you need to readjust your conclusions. Easy peasy.

hibbs · **#49**

skydyr wrote:

One other thing to consider is that the system picks anchors for the ratings from active and relatively stable players, so if one of them improves suddenly and rapidly, they may warp everyone elses' ratings around the anchor instead of changing ranks themselves. As far as I know, there is no way to tell who is an anchor by design, without access to the underlying database.

Whether someone is an anchor or not should make no difference. The system first calculates all ranks independently of wether someone is an anchor nor not. After that, it shifts all ranks so that the average difference of the calculated ranks of the anchors to their "anchored ranks" gets minimal. That means all rankings are effected by the anchor system to the same extent.

hibbs · **#50**

Boidhre wrote:

hibbs wrote:

First of all, the statistical independence is a necessary assumption for the various calculations to be meaningful (As I wrote, otherwise these calculations would not be valid).

There's plenty of maths out there for dealing with non-independent events statistically. I've forgotten most/all of it since college since I no longer work with it, but assuming non-independent events to be independent just so you can use a linear regression or whatever just gives you misleading results.

That is right. In this case the assumption of statistical independence should nevertheless be the first one to consider. It should only be changed if the observed behavior is not consistent with it. Since mef has embarked on figuring this out, we should just wait...

Mike Novack · **#51**

I think I see the problem. Confusion about the observer.

"The question that is answered is that after the observation "Attended a class and played a streak of 5 won games in a row": What is the probability that the observation is caused by a real improvement? This depends on the prior probability of improving by attending the class. So what would be the probability that there was a real improvement after attending a cooking class? Zero. But the better the class actually is, the more likely an observation of 5 won games in a row points towards a real improvement of the player in question."

No, the question was when the "observer" (the rating system) had reason to conclude an improvement had taken place based solely upon observation of a streak of wins of size M. The "observer" in this case has no knowledge about any class the player may or may have taken immediately before this streak let alone whether if a class had been taken was a class on go or a class on cooking. Another observer (the player who took the class) has additional information and therefor a different conclusion.

That is what I meant by "subjective". The player who thinks the streak plenty long enough may have confused his or her judgement of the probability with that of the rating system. There is nothing wrong with using Bayesian inference here as long as you are looking from the point of view of the correct observer.

ez4u · **#52**

Mef wrote:

At the end of the day though, I'm with you, I'd prefer to see someone dig into some data and see if there's anything worthwhile there.

All right...since KGS analytics just spits out a CSV with all the game results....and I ended up having a bit of free time...I made a quick and dirty excel macro that analyzed streaks in game histories. I looked at 3 players who I like to use for KGS statistical data because A: Their ratings are fairly consistent, B: They play a ton of games, and C: They are fairly recognizable KGS personalities, here are my results:

Streak = 3 games

Streak= 4 games

Streak=5 Games

I need to go to sleep now, but later today I'll try to double check my script and make sure there's no glaring errors. Also I may rework it to try and test my "Good days / bad days" theory.

Bored with waiting for mef to wake up from his nap, I started to fool with this stuff too. IANAS (I am not a statistician) so my method was to try to parrot the statistical work of Gilovich, Vallone, and Tversky in The Hot Hand in Basketball: On the Misperception of Random Sequences. I used the downloaded records of twoeye, sum, thecaptain (as described by mef earlier), and our own speedchase. Since I downloaded the csv files on a different date than mef, my numbers are slightly different than his.

In the paper they begin by looking at the question, "Do players hit a higher percentage of their shots after having just made their last shot (or last several shots), than after having just missed their last shot (or last several shots)?" Of course for Go this translates to the question - Do players win a higher percentage of their games after having just won their last game (or last several games), than after having just lost their last game (or last several games)?

For the most basic answer to this question I constructed table 1 below. Here we see the
* total games,
* total wins,
* total losses, and
* average winning/losing percentages in the upper section.

Under that we have the summary figures from a more detailed analysis of streaks to be described later. This gives us the number of:

Streak extending:
* wins that occurred following a win,
* losses that occurred following a loss,

Streak ending:
* wins that occurred following a loss, and
* losses that occurred following a win
together with their applicable winning/losing percentages.

We can clearly see that for all four players the winning percentage following a win and the losing percentage following a loss were higher than the average winning and losing percentages. Meanwhile the winning percentages following a loss and the losing percentages following a win were lower than the average winning and losing percentages. (Note that percentages below the average are highlighted in red for easier reading.)

So the simple answer to the question above is "YES". Unlike professional basketball players, our sample KGS'ers do seem to have hot hands!

Next in the paper they used the Wald-Wolfowitz runs test to check whether the number of runs observed was consistent with a random distribution of hits or misses (wins or losses for us). Here a "run" means a series of one or more wins or losses. What we call winning "streaks" are unexpectedly long runs of wins. The more "streaky" our data, the fewer runs we will observe as the players continue to win or lose longer than expected before losing or winning and thereby starting a new run.

The WW runs test calculates an expected number of runs and standard deviation from the number of wins and losses that we actually observe. It then calculates the difference between the actual runs observed in our data and the expected number. The difference is expressed is a Z statistic (i.e. a measure expressed as a number of standard deviations). With a Z table (downloaded from the internet in my case) we can find the probability that the observed runs were produced by a random process. The result for our four players is shown in table 2 below.

Here we see that the observed number of runs for twoeye, sum, and thecaptain are all quite far away (in standard deviations) from the expected figure. We can reject the idea that they are randomly produced at a quite high confidence level (99.99%). In the case of speedchase we can not reject the idea that the number of runs we see is simply a random fluctuation in the data with much confidence (72%).

Finally in the paper they create a test for non-stationarity or the idea that players temporarily become hot or cold with an elevated or depressed winning percentage over short periods of time. For basketball players they cut their shooting records into four-shot intervals, totaled the number of hits in each "set" of shots and looked for unusually high numbers of "high performance" and "low performance" sets. I did the same for collections of four-game sets for each of our players. As in the basketball paper, I repeated the set-building process three more times, stepping one game forward in the overall player history each time. This gave four related but different sets of data. The resulting numbers were tested against the expected numbers from a random process with the same overall winning rate using the chi squared test. The results are shown in table 3 below.

Here we can see that twoeye again is an outlier. Unlike the basketball players, his data seems to strongly indicate that his wins are not produced be a single random process. In other words he plays in streaks. Our other two big guns, sum and thecaptain, are less clear in this regard. Some of their data sets produce low probability measures, like twoeye, but others are higher. Finally, our man speedchase puts up results that fit a random process fairly well.

Overall this all may be nonsense due to errors due to my ignorance. Hopefully our more erudite posters will point such out if they see them. Otherwise I would say there is pretty strong indications here that streaks happen more often than expected on kgs for at least some of the players.

daal · **#53**

ez4u wrote:

IANAS (I am not a statistician)

Missed your calling? :cool:

pwaldron · **#54**

ez4u wrote:

Here we can see that twoeye again is an outlier. Unlike the basketball players, his data seems to strongly indicate that his wins are not produced be a single random process. In other words he plays in streaks.

Nice work!

The only thing that comes to mind is the analysis (or at least the null hypothesis) assumes that game results are independent and identically distributed. Of course it isn't the case--my chance of winning vs. a 3-dan are much higher than of winning against a 5-dan. A player who plays a streak of 3-dans would be expected to score a streak of wins also.

hibbs · **#55**

pwaldron wrote:

Nice work!

I totally agree.

pwaldron wrote:

The only thing that comes to mind is the analysis (or at least the null hypothesis) assumes that game results are independent and identically distributed. Of course it isn't the case--my chance of winning vs. a 3-dan are much higher than of winning against a 5-dan. A player who plays a streak of 3-dans would be expected to score a streak of wins also.

If the games are played with a proper handicap, than this should not matter. One of the assumtions of the KGS rating system is that a proper handicap brings the win ratio on average to around 50%. (At least within the accuracy of the system, if a strong 3D plays a weak 4D things would be different, of course).

Probably it would be a good idea to open a new thread for this line of discussion?

speedchase · **#56**

I just had an Idea, what if the handicap (default, and for automatch) were calculated using the difference in rating instead of the difference in rank.

wms · **#57**

speedchase wrote:

I just had an Idea, what if the handicap (default, and for automatch) were calculated using the difference in rating instead of the difference in rank.

This is more a matter of preference than accuracy. I thought it would be annoying if a 5k played a 6k, and might get anything from an even game up through h-2. As it stands, if you see your rank and somebody else's, you know exactly what the default handicap/komi will be.

yoyoma · **#58**

hibbs wrote:

pwaldron wrote:

Nice work!

I totally agree.

pwaldron wrote:

The only thing that comes to mind is the analysis (or at least the null hypothesis) assumes that game results are independent and identically distributed. Of course it isn't the case--my chance of winning vs. a 3-dan are much higher than of winning against a 5-dan. A player who plays a streak of 3-dans would be expected to score a streak of wins also.

If the games are played with a proper handicap, than this should not matter. One of the assumtions of the KGS rating system is that a proper handicap brings the win ratio on average to around 50%. (At least within the accuracy of the system, if a strong 3D plays a weak 4D things would be different, of course).

Probably it would be a good idea to open a new thread for this line of discussion?

According to KGS rating math:
A middle 4-dan is expected to beat a very weak 3-dan 79% of the time (taking white, no komi)
A middle 4-dan is expected to beat a very strong 5-dan 21% of the time (taking black, no komi)

In both cases their ratings are 1.5 stones apart, and the handicap only corrects for 0.5 stones, leaving a 1.0 stone difference. That 1.0 stone difference translates into a 79:21 win ratio.

So winning streaks could be the 4-dan rematching the weak 3-dan. Losing streaks could be the 4-dan rematching the strong 5-dan.

ETA: Also ez4u, you just took win/loss data from sum's history? Going how far back? His rank has changed from 5d to 4d 3 months ago. Throwing games played as 5d and games played as 4d together will confuse things a lot I think.

speedchase · **#59**

wms wrote:

This is more a matter of preference than accuracy. I thought it would be annoying if a 5k played a 6k, and might get anything from an even game up through h-2. As it stands, if you see your rank and somebody else's, you know exactly what the default handicap/komi will be.

Perhaps, but under the current system, if you are are at a given rank, and you are about to rank up, almost all of you matches (>90%) will favor you to win, so it will be difficult for you to "prove yourself" and ultimately increase your rank. Using my idea, the you always have a 50% chance of being favored to win, which would remove the stickyness associated with passing through ranks.
Ultimately, I suppose everything is a matter of prefernce, but I would "prefer" the ranking system to be less sticky, to being able to guess something that the system will tell me anyway.

ez4u · **#60**

yoyoma wrote:

...

ETA: Also ez4u, you just took win/loss data from sum's history? Going how far back? His rank has changed from 5d to 4d 3 months ago. Throwing games played as 5d and games played as 4d together will confuse things a lot I think.

I think this is a good point, but I don't know how good. :scratch:

The table below shows the winning record of the three bigs broken down by their rank at the time the game was played (our speedster lies on a completely different scale and so does not make it into this graph). Each of the three has two different ranks making up significant portions of their records, with clearly different winning percentages. This alone should force the issue of non-stationarity into the statistics if I understand that concept correctly.

BTW, I downloaded and used the complete kgs history of each player. They appeared on kgs (in their current username anyway):
* thecaptain 2002-09-26
* sum 2004-03-25
* twoeye 2004-05-12

KGS ranking system

Who is online