LZ's progression

ez4u · Post by **ez4u** » Tue May 15, 2018 4:08 pm

The test matches are only 400 games long. As a result, they still reflect a good deal of luck. The project appears to use a threshold win rate of 55% in selecting the next “best network”. That 55% does not represent a reliable measurement of the difference in strength. It simply signals that ‘probably’ the new “best” is stronger than the old one. There are undoubtedly any number of candidates that have a sub-50% win rate in a 400-game match that would be over 50% in a 10,000-game match. However, that 55% threshold has worked to support an automated process of developing stronger and stronger networks.

If you think about it (I didn’t until now), the graph showing the strength progression is just cute PR. To check the real change in strength, we would need to go back and test across different ranges in the progression of best nets. However, we are much more interested in the ability of LZ to beat humans or other AI’s so why waste the time?

Vargo · Post by **Vargo** » Tue May 15, 2018 8:40 pm

Bill Spight wrote:
Using odds, (3/2) (3/2) = 9/4.

A is weaker than B
C is weaker than B by the exact same ratio,
A and C must be the same strength.

But you're right, the mathematical model can probably not be a perfect fit here.

Moha wrote :
This assumes that the observed winrates equal to their theoretical values (without sampling errors), and also that there are no distorting factors (like various correlations).

Both assumptions seems wrong here, the first one in particular. Consider the extreme: a program has a bug and it plays randomly with all networks. You would still see a climbing elo graph (in a few percent of matches one side would go above the promotion threshold by pure luck), but the latest net would not do well against the first one.

Sampling errors on a too small sample (100 games) are most likely, but they should go both ways, and I would have hoped that they more or less cancel each other. Your bugged program is a good example, but the progression would be so slow (probably logarithmic) as to be almost non existant.

ez4u wrote :
To check the real change in strength, we would need to go back and test across different ranges in the progression of best nets.

I'll do that !

For the rest, you're right, and my much too small sample must be the principal cause.
But you're definitely WRONG

about the waste of time, I find it fascinating to pitch different programs or different networks against each other.

And also, I like the feeling to be part of the LZ experiment by using autogtp and contributing to better networks.

Uberdude · Post by **Uberdude** » Wed May 16, 2018 1:50 am

Something I saw raised on the Leela Zero github pages was whether a broader test than ">55% win vs last best network" would be better (such as a league against several previous versions). One could imagine 3 version of Leela that form a cycle of A beats B, B beat C, C beats A each by >55% (e.g. maybe B sucks at ladders and A doesn't so A can often win, but B is good at semeai and C sucks, and C has the best positional judgement and manages to make that count against A). If the training happens to select these in order then the self-play Elo will keep going up and up when really it's not getting stronger, just going round in circles.

Bill Spight · Post by **Bill Spight** » Wed May 16, 2018 3:10 am

Vargo wrote:
Bill Spight wrote:
Using odds, (3/2) (3/2) = 9/4.
A is weaker than B
C is weaker than B by the exact same ratio,
A and C must be the same strength.

But you're right, the mathematical model can probably not be a perfect fit here.

Let's back up.

Vargo wrote:Andrew would win 69.23% of his games against Charlie, I think.

69.23% ≅ 9/13 , so the win/loss odds are 9/4.
60% = ⅗, so the win/loss odds are 3/2.

Andrew beats Bob 60% of the time, with win/loss odds of 3/2; Bob bets Charlie 60% of the time, with win/loss odds of 3/2. Assuming transitivity and no error, Andrew beats Charlie with win/loss odds of (3/2) (3/2) = 9/4, or 9/13 of the time.

In terms of the log of the odds, log(3/2) + log(3/2) = log(9/4).

A is weaker than B
C is weaker than B by the exact same ratio,
A and C must be the same strength.

In that case, using odds, the odds that A beats B are p/q and the odds that B beats C are q/p; assuming transitivity and no error, the odds that A beats C are (p/q) (q/p) = 1. Or log(p/q) + log(q/p) = log(p/q) - log(p/q) = 0. (Obviously, if A always loses to B and C always loses to B, A and B do not have to be the same strength.

)

Bill Spight wrote: However, in a multi-skill game like go, I would expect the odds to be less than {9/4}.

Vargo wrote:But you're right, the mathematical model can probably not be a perfect fit here.

It is not clear to me that you get my point. Lack of transitivity is a well known phenomenon where Andrew usually beats Bob, Bob usually beats Charlie, and Charlie usually beats Andrew. This lack of transitivity is not just a question of errors. Each player has a number of go skills, at different strengths. Thus a comparison of their strength at go is multi-dimensional, even though any one on one comparison reduces to a win/loss ratio. The win/loss ratio does not tell the whole story. That means that we cannot derive the win/loss ratio of Andrew vs. Charlie from the win/loss ratios of Andrew vs. Bob and the win/loss ratio of Bob vs. Charlie. Fortunately, however, in both chess and go transitivity approximately holds. I have never heard of a case where, except perhaps for short periods of time, Andrew can give two stones to Bob, who can give two stones to Charlie, who can give two stones to Andrew.

It is not just that the model, which assumes transitivity, is not a perfect fit, we are lucky that it is a fit at all.

My point is that the model will overestimate the win/loss ratio of A vs. C, as calculated from the win/loss ratios of A vs. B and B vs. C, when each of the ratios is greater than 1. The reason has to do with multi-dimensionality, and is similar to the phenomenon of regression to the mean.

Darwin's cousin Galton discovered that tall fathers had tall sons, on average, but the sons were not as tall, on average, as the fathers of a particular height. At first he thought that he had discovered a law of evolution, whereby the height of the sons approached average height over time. But it is actually a phenomenon of reducing a two-dimensional plot of father-son heights to one dimension, a line of regression. That becomes obvious when you notice that it works the other way. Given sons of a particular height, the fathers are not as tall as the sons, on average.

In such a case you cannot predict the height of the grandsons from the difference in average height of the sons, given the height of the fathers. It is not like the sons are on average 1" shorter, so the grandsons are 2" shorter. Ceteris paribus, the grandsons are probably only 1" shorter, as well.

Anyway, multi-dimensionality not only destroys (perfect) transitivity, it tends to do so in one direction. moha gives the example of pure drift, where successive winners in the contests are not actually better than the losers. This phenomenon does not depend upon the number of games played. ez4u points out that the assumption of progress needs to be checked out by play against more than the previous winner. Ideally you would play against all previous winners, but playing against the previous 3 or 4 is probably good enough.

Drift is a real concern, especially with self-play, where the players have similar strengths in all dimensions. You can see drift with hill-climbing, near the top of the hill. Randomness may be enough to stall progress, so that successive winners are no closer to the hilltop.

Bill Spight · Post by **Bill Spight** » Wed May 16, 2018 3:33 am

Uberdude wrote:Something I saw raised on the Leela Zero github pages was whether a broader test than ">55% win vs last best network" would be better (such as a league against several previous versions). One could imagine 3 version of Leela that form a cycle of A beats B, B beat C, C beats A each by >55% (e.g. maybe B sucks at ladders and A doesn't so A can often win, but B is good at semeai and C sucks, and C has the best positional judgement and manages to make that count against A). If the training happens to select these in order then the self-play Elo will keep going up and up when really it's not getting stronger, just going round in circles.

An excellent example of drift.

I would go further, and, since we are talking go, not just require the candidate winner to beat the previous version, but to beat each previous version by a greater margin. The reason is that the winner can play worse in some aspects than the loser, if it plays better in other aspects. Thus, skills can be lost in succeeding winners. Let's say that winning by 55% is roughly equivalent to taking White with no komi to produce an even game.

Then say that we have previous winners, A - E in alphabetical order. Suppose candidate F beats E 55% of the time. Now we have F take White vs. E with no komi. If F wins approximately 50% of the time, then let F play vs. D, taking White and giving (reverse) komi. Let's say that again, F wins about 50% of the time. Then let F give 2 stones to C. Now, surprise(!), C wins more than 50% of the time. That may well mean that C is stronger than D and E in some regard, to the detriment of F. At this point we train F against C until we get a version, F', that plays even with C at 2 stones. Now we go back and play F' versus E, taking White. Etc., etc.

Vargo · Post by **Vargo** » Wed May 16, 2018 4:06 am

For Bill Spight.
Sorry, for the 9/4 I thought you were talking about

A wins 50% against B, who wins 50% against C (--> A wins 50% against C)
or A wins 1 game out of 3 against B, who wins 2 games out of 3 against C (A wins 50% against C)

moha · Post by **moha** » Wed May 16, 2018 9:56 am

Vargo wrote:Sampling errors on a too small sample (100 games) are most likely, but they should go both ways, and I would have hoped that they more or less cancel each other. Your bugged program is a good example, but the progression would be so slow (probably logarithmic) as to be almost non existant.

Quite the contrary I think: It will be linear (since promotion probability does not decrease by time, as would normally be the case), just less steep. In fact, such linearity of the elo graph can be an indication of problems in the learning process. The deviation on 400 samples is 10, so 55% is not completely out of reach of luck (esp. with early sprt stops).

Even for nets that are truly stronger, luck will usually be there as that is still an easy way towards promotion. Most networks with >55% winrates will in fact be around 52% or so.

Bill Spight wrote:I would go further, and, since we are talking go, not just require the candidate winner to beat the previous version, but to beat each previous version by a greater margin.

But don't forget we don't just randomly pick nets here. Each one is the same as the last promoted net, after showing it a few more training samples. Each net is expected to be slightly stronger, even if this does not manifests in the test match. The test is there to prevent catastrophic regression, but as AlphaZero shown it is not absolutely necessary.

BTW, this A>B>C problem reminds me of an earlier idea:

Bill Spight wrote:69.23% ≅ 9/13 , so the win/loss odds are 9/4.
60% = ⅗, so the win/loss odds are 3/2.

Andrew beats Bob 60% of the time, with win/loss odds of 3/2; Bob bets Charlie 60% of the time, with win/loss odds of 3/2. Assuming transitivity and no error, Andrew beats Charlie with win/loss odds of (3/2) (3/2) = 9/4, or 9/13 of the time.

In terms of the log of the odds, log(3/2) + log(3/2) = log(9/4).

I would try to model a player's level as a distribution of total errors made in a game. Assuming the simplest case with no distorting factors and a roughly bell shaped distribution (binomial? questionable OC), a player can be described as dropping an average of P points with an sd of D. The outcome of the game is the difference of dropped points, which should follow a shifted bell curve. If the players are close in strength (no adjustments for sd difference), the original question can be seen as shifting it twice as much as necessary for a 60% result.

It's a pity today's programs don't reliably analyse expected score, as that would enable a more accurate error distribution analysis.

Bill Spight · Post by **Bill Spight** » Wed May 16, 2018 11:13 am

moha wrote:
Bill Spight wrote:I would go further, and, since we are talking go, not just require the candidate winner to beat the previous version, but to beat each previous version by a greater margin.
But don't forget we don't just randomly pick nets here. Each one is the same as the last promoted net, after showing it a few more training samples. Each net is expected to be slightly stronger, even if this does not manifests in the test match. The test is there to prevent catastrophic regression, but as AlphaZero shown it is not absolutely necessary.

Drift is still possible, as is the accumulation of deleterious changes in the successive winners. Skill can be lost which an earlier winner might have, which would not be enough for that winner to beat the current winner, unless the current winner needs to show sufficient superiority (as in giving a handicap).

BTW, this A>B>C problem reminds me of an earlier idea:
Bill Spight wrote:69.23% ≅ 9/13 , so the win/loss odds are 9/4.
60% = ⅗, so the win/loss odds are 3/2.

Andrew beats Bob 60% of the time, with win/loss odds of 3/2; Bob bets Charlie 60% of the time, with win/loss odds of 3/2. Assuming transitivity and no error, Andrew beats Charlie with win/loss odds of (3/2) (3/2) = 9/4, or 9/13 of the time.

In terms of the log of the odds, log(3/2) + log(3/2) = log(9/4).
I would try to model a player's level as a distribution of total errors made in a game. Assuming the simplest case with no distorting factors and a roughly bell shaped distribution (binomial? questionable OC), a player can be described as dropping an average of P points with an sd of D. The outcome of the game is the difference of dropped points, which should follow a shifted bell curve. If the players are close in strength (no adjustments for sd difference), the original question can be seen as shifting it twice as much as necessary for a 60% result.

While a single player's variation in overall skill may be roughly normal (bell shaped), that is not the shape of the presumed "fitness landscape" for advancing players. Both Elo (and I, when I set up a ratings system years ago for New Mexico) assumed a kind of power law shape, which is decidedly not bell shaped.

It's a pity today's programs don't reliably analyse expected score, as that would enable a more accurate error distribution analysis.

I agree. But the "win rate" estimate worked better in producing stronger Monte Carlo programs.

moha · Post by **moha** » Wed May 16, 2018 4:00 pm

Bill Spight wrote:Drift is still possible, as is the accumulation of deleterious changes in the successive winners. Skill can be lost which an earlier winner might have, which would not be enough for that winner to beat the current winner, unless the current winner needs to show sufficient superiority (as in giving a handicap).

The only difference I see to the simple common case when a net is continuously trained on existing data is a potential negative feedback loop through the selfplay games (generated by the current partially trained net). But apparently such didn't happen with AlphaZero (without promotion matches), at least not to an extent to make real problems.

While a single player's variation in overall skill may be roughly normal (bell shaped), that is not the shape of the presumed "fitness landscape" for advancing players. Both Elo (and I, when I set up a ratings system years ago for New Mexico) assumed a kind of power law shape, which is decidedly not bell shaped.

Could you elaborate on "fitness landscape for advancing players" and it's role in the A>B>C case?

I think go is a bit different to chess (elo) in that the accumulation of those tiny errors (which makes some normality of a players performance) is actually visible (in points) and verifiable here (with a strong enough program, and enough match samples). Which distribution would we see on expected points dropped (sum of single errors), and on the expected match score between two players (difference of the sums of single errors)? One distorting factor I see is winning players (like programs) trade margin for safety and simplicity, intentionally dropping some points.

Bill Spight · Post by **Bill Spight** » Wed May 16, 2018 6:07 pm

moha wrote:
Bill Spight wrote:Drift is still possible, as is the accumulation of deleterious changes in the successive winners. Skill can be lost which an earlier winner might have, which would not be enough for that winner to beat the current winner, unless the current winner needs to show sufficient superiority (as in giving a handicap).
The only difference I see to the simple common case when a net is continuously trained on existing data is a potential negative feedback loop through the selfplay games (generated by the current partially trained net).

I don't want to strain a metaphor too far, but Uberdude's post exemplifies the potential problem which might mean that LeelaZero is making less progress than it appears to be making. Different players have different weaknesses, and it is possible for successive winners to cycle between different strengths and weaknesses without making overall progress. I don't mean that the cycle is only three winners long, but the accumulation of small errors in exchange for small advantages elsewhere can produce the effect. Both randomness and multiple skills make this phenomenon possible.

But apparently such didn't happen with AlphaZero (without promotion matches), at least not to an extent to make real problems.

In hill-climbing this kind of phenomenon tends to happen near the top of a hill. Perhaps we have not seen it with AlphaGo Zero because it is not near the hilltop for go.

However, I suspect that it did happen with AlphaZero (chess), which is why they played against a hobbled version of Stockfish. Considering the rapid initial progress of AlphaZero, reaching top level play in only a few hours, why did they not run it for a few days more and take on the best, including an opening book and endgame table bases? My guess is that AlphaZero stalled out. That does not minimize their accomplishment, nor does it alter the fact that the way AlphaZero plays chess is more human like than the play of other chess engines. But stalling out is not so good from a PR standpoint. {shrug}

While a single player's variation in overall skill may be roughly normal (bell shaped), that is not the shape of the presumed "fitness landscape" for advancing players. Both Elo (and I, when I set up a ratings system years ago for New Mexico) assumed a kind of power law shape, which is decidedly not bell shaped.
Could you elaborate on "fitness landscape for advancing players" and it's role in the A>B>C case?

Consider the case of pool (pocket billiards). One test of skill in straight pool, where you can shoot the ball you pick and call the shot, is the average length of a run, how many balls, on average, that you can sink in a row. If the probability of sinking each ball is constant (not true, but perhaps approximately so), then in a sense the gain in skill for a poor player with an average run of 1 to increase it to 2 is approximately the same as for a much better player with an average run of 50 to increase it to 51. OC, the much better player has a harder time improving by 1 ball, in general, because he is much nearer the limit of the skills needed to play pool than the poor player (nearer the top of the hill).

Let us say that if Player B is one "level" better than Player A that he can beat Player A with a win/loss ratio of 1.5, and Player C is one level better than Player B. Then, based upon the structure of the levels (the "fitness landscape"), and not upon the shape of the variation in each player's play, we may, with simplifying assumptions, expect that Player C can beat Player A with a win/loss ratio of 1.5^2 = 2.25. The less the variation in each player's play, the more accurate that estimate will be. (Edit: But, as both of us have pointed out, it is more likely to be an overestimate than an underestimate.)

I think go is a bit different to chess (elo) in that the accumulation of those tiny errors (which makes some normality of a players performance) is actually visible (in points) and verifiable here (with a strong enough program, and enough match samples). Which distribution would we see on expected points dropped (sum of single errors), and on the expected match score between two players (difference of the sums of single errors)?

I took advantage of that in my rating system by basing ratings on the ability to give handicaps, not simply upon win/loss ratios of even games.

One distorting factor I see is winning players (like programs) trade margin for safety and simplicity, intentionally dropping some points.

Right. That is one reason to use handicap stones or variable komi to measure progress, so that the winner cannot afford to slack off against a weaker player.

Edit: Since I based ratings on the ability to give handicaps and komi, I did not follow in Elo's footsteps and had no reason to study that system. I infer Elo's "fitness landscape" from Vargo's remarks. I may well be mistaken about that.

moha · Post by **moha** » Thu May 17, 2018 10:39 am

Bill Spight wrote:Consider the case of pool (pocket billiards). One test of skill in straight pool, where you can shoot the ball you pick and call the shot, is the average length of a run, how many balls, on average, that you can sink in a row. If the probability of sinking each ball is constant (not true, but perhaps approximately so), then in a sense the gain in skill for a poor player with an average run of 1 to increase it to 2 is approximately the same as for a much better player with an average run of 50 to increase it to 51. OC, the much better player has a harder time improving by 1 ball, in general, because he is much nearer the limit of the skills needed to play pool than the poor player (nearer the top of the hill).

Let us say that if Player B is one "level" better than Player A that he can beat Player A with a win/loss ratio of 1.5, and Player C is one level better than Player B. Then, based upon the structure of the levels (the "fitness landscape"), and not upon the shape of the variation in each player's play, we may, with simplifying assumptions, expect that Player C can beat Player A with a win/loss ratio of 1.5^2 = 2.25. The less the variation in each player's play, the more accurate that estimate will be.

I don't see how can you ignore the shape of the actual distribution of single game performances. I think the correctness of this oddswise estimate depends heavily on that. For normal distributions it can be reasonably correct (69.4% instead of 69.23% for the A>B>C 60%+60% case), but for other distributions cannot it also be completely wrong? (even without distorting factors like correlation or rock-paper-scissors)

dfan · Post by **dfan** » Thu May 17, 2018 11:15 am

moha wrote:I don't see how can you ignore the shape of the actual distribution of single game performances. I think the correctness of this oddswise estimate depends heavily on that. For normal distributions it can be reasonably correct (69.4% instead of 69.23% for the A>B>C 60%+60% case), but for other distributions cannot it also be completely wrong? (even without distorting factors like correlation or rock-paper-scissors)

Indeed. See my earlier comment #13, which I think got lost in the shuffle a little, for a trivial example.

Bill Spight · Post by **Bill Spight** » Thu May 17, 2018 3:11 pm

moha wrote:
Bill Spight wrote:Consider the case of pool (pocket billiards). One test of skill in straight pool, where you can shoot the ball you pick and call the shot, is the average length of a run, how many balls, on average, that you can sink in a row. If the probability of sinking each ball is constant (not true, but perhaps approximately so), then in a sense the gain in skill for a poor player with an average run of 1 to increase it to 2 is approximately the same as for a much better player with an average run of 50 to increase it to 51. OC, the much better player has a harder time improving by 1 ball, in general, because he is much nearer the limit of the skills needed to play pool than the poor player (nearer the top of the hill).

Let us say that if Player B is one "level" better than Player A that he can beat Player A with a win/loss ratio of 1.5, and Player C is one level better than Player B. Then, based upon the structure of the levels (the "fitness landscape"), and not upon the shape of the variation in each player's play, we may, with simplifying assumptions, expect that Player C can beat Player A with a win/loss ratio of 1.5^2 = 2.25. The less the variation in each player's play, the more accurate that estimate will be.
I don't see how can you ignore the shape of the actual distribution of single game performances.

I am not ignoring it. We are talking about two different things, that is all. We have to consider the distribution of game results to address the question of whether the winner in a set of games is better than his opponent or opponents. But in the case of A vs. B vs. C, we are assuming that C is better than B and B is better than A. The question then is how often the winner should win, given that difference. Different questions.

I think the correctness of this oddswise estimate depends heavily on that. For normal distributions it can be reasonably correct (69.4% instead of 69.23% for the A>B>C 60%+60% case), but for other distributions cannot it also be completely wrong? (even without distorting factors like correlation or rock-paper-scissors)

For the purpose of model building I think that we can assume the normality of the results of a sufficiently large number of games. I suspect that the log odds is the best measure of results, but with enough games it should not make much difference if we use percentages.

The oddswise estimate is only correct if there is no variability in the win/loss odds. Otherwise it should be an overestimate. (Edit: For games like go and chess, I mean.) You can see that phemonenon with go ranks. With even matches (alternating Black and White) a 10 kyu will have a lower winning percentage against a 12 kyu than a 3 dan will have against a shodan, because the dan players' results are less variable. (I am assuming that if the stronger player of each pair alternates between giving two or three stones, the results are even. I do not assume that ranking based upon even games will behave that way.)

Bill Spight · Post by **Bill Spight** » Thu May 17, 2018 3:36 pm

dfan wrote:
moha wrote:I don't see how can you ignore the shape of the actual distribution of single game performances. I think the correctness of this oddswise estimate depends heavily on that. For normal distributions it can be reasonably correct (69.4% instead of 69.23% for the A>B>C 60%+60% case), but for other distributions cannot it also be completely wrong? (even without distorting factors like correlation or rock-paper-scissors)
Indeed. See my earlier comment #13, which I think got lost in the shuffle a little, for a trivial example.

How is it an example? Doesn't it depend upon the structure of the game and a presumed definition of expertise at it, rather than the distribution of the game results per se?

Your point there is no necessary relationship between the win rates of A vs. B, B vs. C, and A vs. C is well taken. But I don't think that is what moha is saying.

dfan · Post by **dfan** » Thu May 17, 2018 3:47 pm

Bill Spight wrote:
dfan wrote:Indeed. See my earlier comment #13, which I think got lost in the shuffle a little, for a trivial example.
How is it an example? Doesn't it depend upon the structure of the game and a presumed definition of expertise at it, rather than the distribution of the game results per se?

Your point there is no necessary relationship between the win rates of A vs. B, B vs. C, and A vs. C is well taken. But I don't think that is what moha is saying.

I thought it was an example of a "shape of the actual distribution of single game performances" (uniform rather than Gaussian or somesuch) but maybe I was misinterpreting moha's phrase. For one thing, I was interpreting "game performance" as being a function of a single player, and whoever has the better performance wins; perhaps something else was meant.

Life In 19x19

LZ's progression

Re: LZ's progression

Re: LZ's progression

Re: LZ's progression

Re: LZ's progression

Re: LZ's progression

Re: LZ's progression

Re: LZ's progression

Re: LZ's progression

Re: LZ's progression

Re: LZ's progression

Re: LZ's progression

Re: LZ's progression

Re: LZ's progression

Re: LZ's progression

Re: LZ's progression