Home-made Elo ratings for some engines

For discussing go computing, software announcements, etc.
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

Just for fun, let's do some dodgy mathematical analysis of the 1-minute and 5-minute results, to see if we can extrapolate what will happen in 20-minute games. (I'll post some actual 20-minute results tomorrow. I did the analysis last week, just didn't get around to posting until today.)

We already know that small networks beat bigger networks in fast games, but we might expect the bigger networks to catch up in slower games. Let's pretend that each engine/network combination has an "baseline" strength (how well it can play on minimal thinking time) plus an ability to get stronger with more time. There will be diminishing returns: you'd expect a big difference between 1 minute and 10 minutes, but not much difference between 60 minutes and 69 minutes. But the strength is theoretically unbounded (Monte Carlo search converges to the best move given unlimited time and unlimited memory).

So a half way reasonable model might be:
                              Elo rating = b + alpha times log(t)
where b is the baseline strength, t is the thinking time in minutes per player per game (absolute time, because I don't want to get into complications around byo-yomi), and alpha represents how well the engine/network can make use of extra thinking time.

At 1 minute time limits, t=1, log(t)=0, so b is the just the 1-minute Elo rating. Then we can calculate alpha as (5-minute Elo minus 1-minute Elo)/log(4). (For me, log means natural log, because I did too much calculus as a teenager, so log(4) is about 1.386.) And then the expected 20-minute rating from this model would be b + alpha times log(19), or b+2.944 alpha.

If we have gnugo at 1500 on both rating scales, then it gets alpha=0, meaning that gnugo gets no stronger when it thinks for a long time. Worse, a few of the weaker engines get negative numbers for alpha. I don't believe that, so I'm going to subtract 200 from all the 1-minute ratings, just to get some more reasonable alpha values.

Finally, this projects pachi_nn to be about 3300 in 20 minute games, which isn't realistic (it's nowhere near pro strength), so I'm going to subtract a few rating points from the results to put pachi_nn at 2400.

Then a few lines of R programming gives these results:

Code: Select all

             Name Elo1_adjusted Elo5 rank1 rank5 alpha Elo20 rank20
1          LZ_ELF          3517 4544     3     1   741           4993      1
2       LZ_ELF_6t          3436 4496     4     3   765           4982      2
3         LM_GX47          3558 4504     2     2   682           4862      3
4      LZ_phoenix          2956 4133    17     9   849           4751      4
5          LZ_157          3580 4457     1     4   633           4738      5
6          LZ_173          3319 4307     7     5   713           4712      6
7          LZ_141          3271 4280     8     7   728           4709      7
8          LZ_174          3390 4291     5     6   650           4599      8
9       LZ_174_6t          3125 4120    11    10   718           4533      9
10    ray_ELF_12t          3342 4145     6     8   579           4343     10
11          LM_B5          2978 3932    16    13   688           4299     11
12    ray_173_12t          3089 3969    14    11   635           4253     12
13          LM_Z2          3108 3933    13    12   595           4155     13
14     ray_173_6t          3111 3874    12    15   550           4027     14
15         LZ_116          3169 3881    10    14   514           3976     15
16     ray_173_2t          2903 3706    19    18   579           3904     16
17         LM_W11          3059 3786    15    17   524           3898     17
18    ray_W11_12t          2950 3649    18    19   504           3730     18
19          LM_E8          3250 3794     9    16   392           3700     19
20        ray_ELF          2699 3487    23    20   568           3668     20
21        ray_173          2687 3392    24    22   509           3479     21
22          leela          2790 3414    21    21   450           3410     22
23         LZ_zed          2810 3372    20    23   405           3299     23
24   dream_ponder          2559 3232    26    25   485           3283     24
25          LZ_91          2735 3289    22    24   400           3207     25
26          dream          2437 3130    27    27   500           3204     26
27        LM_E8_c          2272 3035    31    28   550           3188     27
28        ray_W11          2578 3172    25    26   428           3135     28
29     LZ_116_c2t          2297 2988    30    29   498           3060     29
30       LM_W11_c          2106 2874    35    30   554           3032     30
31        leela_c          1926 2759    39    33   601           2990     31
32      leela_c2t          1935 2740    37    34   581           2940     32
33      LZ_91_c2t          1933 2648    38    35   516           2747     33
34      LM_GX47_c          2361 2842    29    31   347           2678     34
35     oakfoam_nn          2424 2839    28    32   299           2600     35
36      leela_c1t          2002 2535    36    39   384           2429     36
37       pachi_nn          1828 2429    40    40   434           2400     37
38        LM_Z2_c          2170 2601    34    36   311           2380     38
39          LZ_57          2173 2597    33    38   306           2369     39
40      LZ_57_c2t          1292 2093    44    42   578           2288     40
41        LM_B5_c          2271 2599    32    37   237           2263     41
42       pachi_1t          1013 1858    49    45   610           2103     42
43          pachi          1618 2137    41    41   374           2015     43
44          fuego          1138 1865    47    44   524           1977     44
45    leela_nonet          1583 2086    42    43   363           1946     45
46 leela_nonet_1t          1186 1856    45    46   483           1904     46
47          michi           934 1466    51    48   384           1359     47
48          gnugo          1300 1500    43    47   144           1020     48
49       oakfoam1          1163 1287    46    49    89            721     49
50        oakfoam           995 1068    50    50    53            445     50
51        matilda           864  975    52    52    80            395     51
52   oakfoam_book          1056  998    48    51   -42            228     52
So we can see for example that LZ_phoenix comes 17th in 1-minute games, but 9th in 5-minute games, giving it a big alpha value (it's making great use of the extra thinking time), and we'd expect it to shoot up to 4th place in 20-minute games. On the other hand, LM_E8 (with a 128x10 network) did better at 1 minute than at 5 minutes, so its alpha is lower, and we'd expect it to rank even lower at 20 minutes. Then again, the alpha values for LZ 141 and 174 don't look quite right.

This is a pretty simplistic model, so I don't expect the results to be at all accurate (we can tell it's not right by the way gnugo has dropped 500 points in the output), but it's interesting food for thought.
moha
Lives in gote
Posts: 311
Joined: Wed May 31, 2017 6:49 am
Rank: 2d
GD Posts: 0
Been thanked: 45 times

Re: Home-made Elo ratings for some engines

Post by moha »

xela wrote:This is a pretty simplistic model, so I don't expect the results to be at all accurate (we can tell it's not right by the way gnugo has dropped 500 points in the output), but it's interesting food for thought.
The basic idea usually is that each doubling of thinking time gives a roughly similar strength increase (OC this is not necessarily reasonable idea for all engines). Your formula could capture this if you wouldn't substract 1 from 5 and 20 before log.

But in these rating pools one's result depends on others' performances as well, quite a problem for this approach. Maybe you could anchor at gnugo=1500 for 1 min, and anchor other times at a guessed gnugo improvement factor / rating. If you expect your numbers to go up with more time, then you basically compare performance to 1-min gnugo (how strong I should be to play this well in 1-min games), so going up into otherwise "pro" number range is not surprising and does not necessarily mean pro strength.
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

OK, time for some actual results at 20 minutes per game. To start with, I decided to do this as a "win and continue" series of 8-game matches, starting with pachi_nn and introducing opponents about 100-200 Elo points above the previous winner (going by my dodgy projected ratings, to see just how bad they are). I'd expect each new engine to win 5-3 or 6-2. I also decided that if an engine wins its match 8-0, I should backtrack and look for something slightly weaker, in the interest of making the ratings a bit more accurate. Once I get to the top of the list, then I'll go back and add some more games to try and reduce the error margins, and maybe add a couple more engines if anything looks especially interesting. Without gnugo in the list, I've decided to anchor pachi_nn's rating at 2400, so the ratings are still in the same ballpark as my other lists.

Round 1: pachi_nn vs oakfoam_nn. This was the first surprise: pachi_nn won the match 5-3. It seems that oakfoam has some problems with time management. It plays at about the right pace in 1-minute or 5-minute games, but in 20-minute games it only uses 6 or 7 minutes total, so it's giving pachi a bit of an advantage. It seemed to be ahead in the opening and early middlegame of each game, and then managed to misread something and lose.

Round 2: LZ_91_c2t d pachi_nn 6-2.

Round 3: leela_c2t d LZ_91_c2t 5-3

Round 4: LM_W11_c d leela_c2t 5-3

Round 5: LM_W11_c d dream 6-2, another surprise, was expecting dream to do a bit better against a CPU-only engine. Here there were two games with "disputed" scores (both engines agreed that LM_W11_c had won, but gave different winning margins): each game involved a seki.

Round 6: LZ_zed d LM_W11_c 5-3

Round 7: leela vs LZ_zed was a 4-4 tie. I decided to add a tiebreaker match: leela d LM_W11_c 6-2. This put leela on top of the list, because it did a better job of beating up LM.

Round 8: ray_ELF d leela 8-0. Backtrack: ray_173 d leela 6-2. Again there was one game with disputed score, another seki. ray_173 d ray_ELF 5-3, not what I expected! At this point, BayesElo had both ray_173 and ray_ELF on exactly the same rating (3082): 173 had won the head to head, but ELF did better against leela, and these factors cancelled out. There was also another game with disputed result (agreed that ELF won, but disagreed on the amount), but not a seki this time; instead, the scoring was messed up because ray_173 passed before the game was actually finished. (No harm done, it was losing anyway.)

Round 9: Instead of running another tiebreaker match, I decided to just give the next engine 6 games each against the tied leaders:
  • ray_173 vs LM_W11 3-3
  • ray_ELF d LM_W11 4-2
Another surprise: I'd expected LM_W11 to be stronger.

Round 10: ray_173_6t d ray_ELF 8-0

Remember that ray defaults to one thread, but in 1-minute or 5-minute games it gets a little bit stronger given extra threads, but not by a huge amount. It looks like it gets a lot more benefit from those extra threads in slower games! (LZ uses two threads by default, and seems to actually get weaker given extra threads, at least in short games. But maybe it's worth retesting this theory in longer games?)

Backtrack: ray_173_2t d ray_ELF 5-3; ray_173_6t d ray_2t 7-1. In the 6t vs 2t match, there were two games with disputed result: both players passed early and disagreed on who was ahead. I decided to step in as referee and, looking at the positions, awarded one game to each player.

Round 11: LM_Z2 d ray_173_6t 7-1. Two more games where ray passed early from a losing position.

And throwing all this into BayesElo, the rating list so far is:

Code: Select all

Name        Elo   Elo+  Elo-  games  score  avg_opp
LM_Z2       3726  312   203   8      88%    3505
ray_173_6t  3505  165   154   24     67%    3350
ray_173_2t  3212  169   177   16     38%    3309
ray_ELF     3113  112   111   38     47%    3140
ray_173     3087  143   133   22     64%    2991
LM_W11      3053  166   179   12     42%    3100
leela       2823  119   125   32     38%    2922
LZ_zed      2796  156   149   16     56%    2758
LM_W11_c    2692  109   109   32     50%    2697
leela_c2t   2618  150   150   16     50%    2620
dream       2553  197   256   8      25%    2692
LZ_91_c2t   2547  156   150   16     56%    2509
pachi_nn    2400  146   151   16     44%    2442
oakfoam_nn  2336  193   217   8      38%    2400
To be continued...
Last edited by xela on Sat Nov 10, 2018 4:04 am, edited 1 time in total.
Kris Storm
Beginner
Posts: 3
Joined: Sun Oct 14, 2018 12:01 pm
GD Posts: 0

Re: Home-made Elo ratings for some engines

Post by Kris Storm »

Hi xela. It's a good idea of doing such comparision.

How are you using BayesELO with SGF files? I found it useful only for chess PGN files.
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

Kris Storm wrote:How are you using BayesELO with SGF files?
I've written a few lines of Python code to read the *.dat files created by GoGui and output the results as PGN. (You could do this manually too for a small number of games, and you could do it just as well from the SGF instead the DAT.) The PGN file can be pretty minimal. BayesElo doesn't need the moves of the game; it's happy running on something that looks like this:

Code: Select all

[White "leela_c"][Black "LM_E8_c"][Result "1-0"] 1-0
[White "LM_E8_c"][Black "leela_c"][Result "1-0"] 1-0
[White "LM_E8_c"][Black "leela_c"][Result "1-0"] 1-0
[White "LM_E8_c"][Black "leela_c"][Result "1-0"] 1-0
[White "leela_c"][Black "LM_W11_c"][Result "1-0"] 1-0
[White "leela_c"][Black "LM_W11_c"][Result "1-0"] 1-0
[White "LM_W11_c"][Black "leela_c"][Result "1-0"] 1-0
[White "LM_W11_c"][Black "leela_c"][Result "1-0"] 1-0
Then I feed these commands to BayesElo:

Code: Select all

readpgn filename.pgn
elo
offset 2000
advantage 0
drawelo 0.01
mm
exactdist
ratings
The "advantage 0" part means that it doesn't care who played black or white, so I can put the winner's name first in my PGN file and all the results as 1-0, which makes it simpler to create the PGN. There was a forum post somewhere by Rémi Coulomb recommending the "advantage 0" and "drawelo 0.01" settings for go games. The "offset 2000" part means that the average rating of the outputs will be 2000; I have another Python script which changes 2000 to a different number, which is how I anchor the ratings (run it twice, figure out which offset will put gnugo at 1500).
Kris Storm
Beginner
Posts: 3
Joined: Sun Oct 14, 2018 12:01 pm
GD Posts: 0

Re: Home-made Elo ratings for some engines

Post by Kris Storm »

Thanks for your explanation. That is a clever method. I have a lot of .dat files from GoGui tournaments and always wanted to make such ELO list. Maybe you can share your Python code. I'm sure it would be useful for others.
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

Kris Storm wrote:Maybe you can share your Python code. I'm sure it would be useful for others.
It needs a bit of a rewrite before I can share it. At the moment it wouldn't work on someone else's computer because of all the hard-coded path names (and I'd be embarrassed to let it out in this shape). I'll add it to my to-do list.
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

Going slow this week, because my system keeps crashing! It doesn't like the combination fo strong bots and slow games. I think I need to upgrade an nvidia driver, but it's hard to find information on which drivers are more stable. At the moment, I can queue up 8 games to be played overnight, but I'll wake up to a black screen and an unresponsive system, and when I reboot it looks like only two or three games got played.

Anyway, new this week:

AQ has entered the 1-minute and 5-minute ratings. It's at a disadvantage because it was trained for Japanese rules and 6.5 komi, and I'm playing all the games with Chinese rules and 7.5 komi (this works for the majority of bots). My guess is that AQ's rating will therefore be 50-100 points below its true strength, but I can't think of a good way to measure how much difference it actually makes.

Looking at the 5-minute games:
  • AQ played 52 games
  • 50 were won or lost by resignation.
  • One was lost by AQ by 47.5 points. I can't figure out why AQ didn't resign. The game was 449 moves long. A large group died at move 234, and analysis with LZ_157 says the winrate was below 5% for the rest of the game. The position could have been scored at move 346, but AQ kept playing inside its own territory and trying to live inside black's territory for another 100 moves.
  • One game was lost by AQ by 2.5 points. Ray actually gave away 2 points with a slack endgame move, so AQ was previously behind by 4.5 points, hard to explain this as a 6.5 versus 7.5 komi issue.
  • Of course we don't know how many of the resigned games were due to an overplay that wouldn't have happened with the correct komi.
A few more bots added to the 20-minute ratings. My mathematical model (post number 16 above) is looking about as bad as expected :-) At the slower time limit, it looks like LZ now gets stronger with more threads, unlike in the fast games.

Results so far at 1 minute time limit, based on 1350 games with 62 engines:

Code: Select all

Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_157          4164  97    91    60     72%    3998
LM_GX47         4142  102   95    62     73%    3900
LZ_ELF          4100  101   98    48     58%    4034
LZ_ELF_6t       4020  96    101   48     42%    4076
LZ_174          3973  87    88    64     47%    3995
ray_ELF_12t     3926  93    95    58     45%    3962
LZ_173          3903  115   112   48     60%    3794
LZ_141          3854  115   114   44     59%    3741
LM_E8           3833  116   115   50     64%    3630
LZ_116          3753  99    98    58     55%    3701
LZ_174_6t       3708  112   113   42     50%    3699
ray_173_6t      3694  112   110   36     53%    3676
LM_Z2           3691  98    94    60     67%    3518
ray_173_12t     3672  115   113   34     53%    3652
LM_W11          3642  111   115   44     50%    3611
LM_B5           3561  102   101   56     59%    3457
LZ_phoenix      3539  119   123   36     39%    3648
ray_W11_12t     3533  114   118   36     42%    3595
ray_173_2t      3486  129   129   28     50%    3487
LZ_zed          3393  123   126   34     41%    3471
leela           3374  115   114   56     55%    3294
LZ_91           3318  99    105   76     32%    3516
ray_ELF         3279  128   124   34     50%    3305
ray_173         3271  131   125   34     59%    3206
ray_W11         3129  104   103   48     50%    3138
dream_ponder    3129  119   123   40     53%    3050
AQ              3054  146   149   24     46%    3086
oakfoam_nn      2993  117   119   84     62%    2784
dream           2992  116   119   36     44%    3033
LM_GX47_c       2969  122   117   34     59%    2909
LZ_116_c2t      2866  109   112   60     30%    3144
LM_E8_c         2851  141   141   22     50%    2851
LM_B5_c         2849  134   134   24     50%    2849
LZ_116_c6t      2828  137   139   24     50%    2824
LM_Z2_c         2746  121   114   36     61%    2667
LZ_57           2745  114   116   52     50%    2725
LM_W11_c        2683  125   134   28     39%    2755
leela_c1t       2576  108   108   42     52%    2551
leela_c2t       2508  128   136   30     37%    2622
LZ_91_c2t       2506  137   140   26     46%    2533
leela_c         2499  103   101   88     59%    2377
pachi_nn        2400  110   107   76     64%    2228
pachi           2190  127   123   68     54%    2179
leela_nonet     2156  105   102   88     58%    2094
gnugo           1872  88    83    84     64%    1774
gnugo_l7        1871  120   122   52     38%    2005
LZ_57_c2t       1864  246   218   8      63%    1791
gnugo_M         1842  140   133   34     53%    1844
gnugo_l1        1824  91    89    84     48%    1882
gnugo_l4        1807  141   139   32     47%    1862
leela_nonet_1t  1758  241   333   10     10%    2160
oakfoam1        1735  126   122   32     56%    1692
pachi_pat       1711  385   328   2      50%    1711
fuego           1711  90    90    78     37%    1945
oakfoam_book    1628  113   119   40     38%    1731
pachi_1t        1585  199   231   14     14%    1894
oakfoam         1567  92    101   72     25%    1806
oakfoam2        1524  130   150   30     23%    1725
pachi_monte     1523  357   203   2      0%     1711
pachi_plain     1523  357   203   2      0%     1711
michi           1506  312   189   4      0%     1791
matilda         1437  141   120   44     9%     1877
1_min_crosstable-2018-10-20.csv
(11.36 KiB) Downloaded 548 times
Results so far at 5 minute time limit, based on 1448 games with 53 engines:

Code: Select all

Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          4546  2     114   46     67%    4416
LM_GX47         4505  40    113   44     66%    4371
LZ_ELF_6t       4497  47    105   48     65%    4382
LZ_157          4458  82    124   32     59%    4386
LZ_173          4308  94    93    66     55%    4243
LZ_174          4292  103   98    64     66%    4137
LZ_141          4282  90    90    78     58%    4190
ray_ELF_12t     4146  102   105   46     46%    4174
LZ_phoenix      4134  121   122   36     47%    4155
LZ_174_6t       4121  84    82    92     57%    4067
ray_173_12t     3970  103   108   46     39%    4055
LM_Z2           3934  95    97    72     50%    3885
LM_B5           3932  112   111   44     57%    3861
LZ_116          3882  87    88    94     51%    3845
ray_173_6t      3876  99    101   48     44%    3925
LM_E8           3793  115   113   46     54%    3759
LM_W11          3791  109   105   60     63%    3663
ray_173_2t      3707  114   114   50     54%    3664
ray_W11_12t     3627  113   104   52     67%    3493
ray_ELF         3490  102   102   62     39%    3646
AQ              3457  110   112   52     54%    3385
ray_173         3406  102   107   54     33%    3568
leela           3400  93    96    94     44%    3438
LZ_zed          3360  109   108   52     48%    3397
LZ_91           3271  94    98    74     35%    3431
dream_ponder    3208  118   117   40     58%    3094
ray_W11         3144  104   105   52     44%    3207
dream           3095  125   127   32     47%    3117
LM_E8_c         3006  108   104   54     56%    2970
LZ_116_c2t      2959  106   105   64     56%    2903
LM_W11_c        2845  116   114   36     53%    2828
LM_GX47_c       2814  105   105   46     46%    2863
oakfoam_nn      2811  92    89    72     51%    2824
leela_c         2732  91    92    74     50%    2732
leela_c2t       2711  82    83    78     49%    2717
LZ_91_c2t       2620  105   101   56     59%    2539
LM_Z2_c         2573  109   110   38     47%    2596
LM_B5_c         2570  114   121   34     38%    2655
LZ_57           2569  113   114   38     45%    2616
leela_c1t       2507  95    100   68     41%    2596
pachi_nn        2400  106   113   64     39%    2515
pachi           2109  112   108   80     58%    2005
LZ_57_c2t       2064  132   121   40     70%    1872
leela_nonet     2058  137   150   42     36%    2157
fuego           1837  107   105   72     65%    1662
pachi_1t        1829  119   115   54     65%    1662
leela_nonet_1t  1827  124   117   52     69%    1624
gnugo           1472  142   -77   106    20%    1763
michi           1438  230   -112  40     55%    1403
oakfoam1        1258  377   -291  28     43%    1402
oakfoam         1039  562   -509  26     27%    1357
oakfoam_book    970   615   -578  32     13%    1406
matilda         947   645   -601  26     15%    1379
5_min_crosstable-2018-10-20.csv
(8.8 KiB) Downloaded 581 times
Results so far at 20 minute time limit, based on 188 games with 19 engines:

Code: Select all

Name         Elo   Elo+  Elo-  games  score  avg_opp
LZ_174_6t    4337  276   194   16     94%    3948
LZ_174       4183  228   202   8      63%    4116
ray_ELF_12t  4116  118   120   32     50%    4098
LM_Z2_6t     4090  186   167   16     69%    3948
LM_Z2        3781  143   146   40     40%    3895
ray_173_6t   3504  184   166   24     67%    3368
LM_B5        3427  238   520   8      0%     3781
ray_173_2t   3212  172   181   16     38%    3308
ray_ELF      3112  114   114   38     47%    3140
ray_173      3087  145   135   22     64%    2991
LM_W11       3053  170   182   12     42%    3100
leela        2823  121   127   32     38%    2922
LZ_zed       2796  160   152   16     56%    2758
LM_W11_c     2692  111   111   32     50%    2697
leela_c2t    2618  153   153   16     50%    2620
dream        2553  201   261   8      25%    2692
LZ_91_c2t    2547  159   153   16     56%    2509
pachi_nn     2400  149   154   16     44%    2442
oakfoam_nn   2336  197   221   8      38%    2400
20_min_crosstable-2018-10-20.csv
(1.61 KiB) Downloaded 598 times
Last edited by xela on Sat Nov 10, 2018 4:05 am, edited 2 times in total.
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

Here are the two games where I thought AQ should have resigned.


Attachments
ray_ELF-AQ-2018-10-11_21-15.sgf
AQ loses by 2.5 points; ray gives away points at move 282
(2.85 KiB) Downloaded 1536 times
AQ-ray_W11_12t-2018-10-11_20-15.sgf
AQ loses by 47.5 points; resignable from move 242; game could have been scored at move 346
(3.07 KiB) Downloaded 1562 times
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

Sorry for the long gap between updates! I spent a lot of time figuring out how to update my graphics drivers, but I still haven't solved the crashing problem. It looks like I can't reliably run LZ with 6 or more threads in long games. But that's OK, I've found out what I originally wanted to, which is that LZ (even with 2 threads) seems to achieve superhuman performance on a fairly ordinary computer.

I'm a little surprised to see ELF still at the top of the list, as I thought recent LZ networks had overtaken ELF at time parity. Over the next couple of weeks I'll add some more games to reduce some of the error margins, maybe throw LZ_157 into the mix, and maybe do some benchmarking to see how many visits per second I'm getting for various different networks.

Oh, and for anyone who's observant: in previous posts, the Elo+ and Elo- columns were the wrong way round. I've gone back and edited the earlier posts so they're now correct.

Results so far at 20 minute time limit, based on 228 games with 22 engines:

Code: Select all

Name         Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF       4657  160   136   24     79%    4451
LM_GX47      4567  143   132   24     63%    4481
LZ_188       4440  132   143   24     38%    4523
LZ_phoenix   4346  107   112   40     38%    4439
LZ_174       4318  148   133   24     71%    4180
ray_ELF_12t  4214  147   157   24     54%    4144
LZ_141       3980  233   482   8      0%     4318
LM_Z2        3769  199   165   24     63%    3716
ray_173_6t   3504  182   166   24     67%    3364
LM_B5        3431  233   482   8      0%     3769
ray_173_2t   3212  173   182   16     38%    3308
ray_ELF      3112  115   115   38     47%    3140
ray_173      3087  147   136   22     64%    2990
LM_W11       3052  171   184   12     42%    3099
leela        2823  122   128   32     38%    2921
LZ_zed       2795  161   153   16     56%    2757
LM_W11_c     2691  112   112   32     50%    2697
leela_c2t    2617  155   155   16     50%    2619
dream        2552  203   263   8      25%    2691
LZ_91_c2t    2547  160   154   16     56%    2508
pachi_nn     2400  150   155   16     44%    2441
oakfoam_nn   2336  199   222   8      38%    2400
20_min_crosstable-2018-11-09.csv
(1.99 KiB) Downloaded 551 times
pangafu
Beginner
Posts: 2
Joined: Tue Dec 04, 2018 8:24 pm
GD Posts: 0

Re: Home-made Elo ratings for some engines

Post by pangafu »

@xela

I am the author of LeelaMaster Weight
I had seen you do some elo test with LM, so could I add this post to the readme of LeelaMaster Weigth

https://github.com/pangafu/LeelaMasterWeight/

About LeelaMaster strength(elo)
.....

Home-made Elo ratings for some engines (by xela@lifein19x19.com)

https ://lifein19x19.com/viewtopic.php?f=18&t=16086

....

Thanks for your great work~
pangafu
Beginner
Posts: 2
Joined: Tue Dec 04, 2018 8:24 pm
GD Posts: 0

Re: Home-made Elo ratings for some engines

Post by pangafu »

Hello @xela

I am the author of Leela Master weight, and glad to see you do some test with lm weight.

So could I add this post to the readme of Leela Master weight?


About LeelaMaster strength(elo)
....
Home-made Elo ratings for some engines (by xela@lifein19x19.com)

viewtopic.php?f=18&t=16086
....


Please enjoy the human style of go game~
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

pangafu wrote:Hello @xela

I am the author of Leela Master weight, and glad to see you do some test with lm weight.

So could I add this post to the readme of Leela Master weight?
Yes. Thanks for asking!
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

Here are the final results (unless I get inspired to do more). Looking at the error bounds, we can't say for sure which of the top 6 is actually the strongest, but they all seem to be definitely in the "superhuman" range (considering that the bottom of this list is already amateur dan level). Just for interest, on my hardware LZ_174 and LZ_188 get about 300 visits per second, ELF about 700, GX47 around 1200, LZ_157 around 1500 (numbers are approximate because they vary from one game to another, possibly depending on the board position and how much of the tree is reused from previous moves).

Results at 20 minute time limit, based on 426 games with 25 engines:

Code: Select all

Name         Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF       4481  132   115   32     72%    4330
LM_GX47      4394  106   102   40     58%    4343
LZ_157       4378  98    94    48     58%    4320
LZ_188       4324  102   105   40     45%    4357
LZ_174       4308  97    93    56     63%    4201
LZ_phoenix   4225  90    95    56     39%    4299
LZ_173       4191  118   123   32     44%    4233
ray_ELF_12t  4020  115   119   40     43%    4079
LZ_141       3873  134   134   32     44%    3942
LM_Z2        3801  132   130   32     56%    3741
ray_173_6t   3640  119   112   48     65%    3509
LM_B5        3433  112   118   40     35%    3555
AQ           3348  144   144   20     50%    3346
ray_173_2t   3348  121   124   32     44%    3398
ray_ELF      3169  114   115   42     43%    3242
ray_173      3121  138   132   24     58%    3062
LM_W11       3118  144   138   22     59%    3053
leela        2898  109   116   40     35%    3009
LZ_zed       2870  162   155   16     56%    2832
LM_W11_c     2765  116   116   32     50%    2764
leela_c2t    2695  158   158   16     50%    2695
LZ_91_c2t    2624  114   109   36     61%    2544
dream        2594  113   112   36     53%    2573
pachi_nn     2400  134   148   24     33%    2517
oakfoam_nn   2335  160   194   16     25%    2504
20_min_crosstable-2018-12-08.csv
(2.47 KiB) Downloaded 562 times
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

Updated with KataGo, OpenCL version (and also throwing in some recent LZ weights for comparison). Just fast games for this one, didn't get around to updating the 20 minute results.

kata_6b is the 6-block network, and you can probably guess the names for 10, 15, 20 blocks. In the 1 minute games I also tried different numbers of threads but didn't see much potential for significant improvement. The suggestion in the config file of trying more threads than you have cores wasn't a success on my hardware.

Results at 1 minute time limit, based on 1520 games with 72 engines:

Code: Select all

Name            Elo   Elo+  Elo-  games  score  avg_opp
kata_15b        4224  184   166   16     75%    4041
LZ_242          4194  215   218   8      63%    4117
LZ_157          4186  94    85    74     74%    3993
LZ_188          4172  174   166   14     57%    4128
LM_GX47         4160  101   94    64     72%    3921
kata_20b        4142  179   167   16     63%    4047
LZ_ELF          4130  85    82    72     61%    4039
kata_10b        4037  118   110   44     68%    3873
LZ_ELF_6t       4024  92    97    54     39%    4100
LZ_174          3993  83    85    72     47%    4010
LZ_173          3941  107   106   54     59%    3836
ray_ELF_12t     3920  88    92    66     39%    3997
LZ_141          3907  98    97    60     58%    3805
kata_10_12t     3895  141   139   24     50%    3902
kata_10_6t      3856  130   134   28     43%    3912
LM_E8           3827  107   109   56     59%    3673
kata_10_2t      3820  140   145   24     42%    3886
LZ_116          3752  89    89    74     49%    3757
LZ_174_6t       3733  108   109   46     50%    3723
ray_173_6t      3698  107   106   40     53%    3681
kata_10_24t     3689  159   186   20     25%    3888
LM_Z2           3679  96    92    62     65%    3525
ray_173_12t     3672  112   110   36     53%    3653
LM_W11          3649  113   116   44     50%    3619
LZ_phoenix      3554  116   118   38     42%    3641
LM_B5           3545  99    99    58     57%    3460
kata_6b         3540  185   188   14     43%    3608
ray_W11_12t     3518  111   116   38     39%    3599
ray_173_2t      3489  124   124   30     50%    3489
LZ_zed          3402  119   122   36     42%    3476
leela           3378  116   115   56     55%    3298
LZ_91           3319  99    105   80     30%    3548
ray_ELF         3280  128   124   34     50%    3308
ray_173         3272  132   126   34     59%    3206
ray_W11         3130  104   103   48     50%    3139
dream_ponder    3129  119   123   40     53%    3051
AQ              3054  146   149   24     46%    3087
oakfoam_nn      2993  117   119   84     62%    2785
dream           2992  116   120   36     44%    3034
LM_GX47_c       2969  123   117   34     59%    2909
LZ_116_c2t      2865  109   112   60     30%    3142
LM_E8_c         2851  141   141   22     50%    2851
LM_B5_c         2849  135   135   24     50%    2849
LZ_116_c6t      2828  138   139   24     50%    2824
LM_Z2_c         2746  121   115   36     61%    2667
LZ_57           2744  114   116   52     50%    2725
LM_W11_c        2683  126   134   28     39%    2754
leela_c1t       2576  109   108   42     52%    2551
leela_c2t       2508  129   136   30     37%    2621
LZ_91_c2t       2506  137   140   26     46%    2533
leela_c         2499  103   101   88     59%    2377
pachi_nn        2400  111   107   76     64%    2228
pachi           2190  127   123   68     54%    2179
leela_nonet     2156  105   102   88     58%    2094
gnugo           1872  89    83    84     64%    1774
gnugo_l7        1871  120   122   52     38%    2005
LZ_57_c2t       1864  246   218   8      63%    1791
gnugo_M         1842  140   133   34     53%    1844
gnugo_l1        1823  91    89    84     48%    1882
gnugo_l4        1807  141   139   32     47%    1862
leela_nonet_1t  1758  244   255   10     10%    2160
oakfoam1        1735  126   122   32     56%    1692
pachi_pat       1711  394   220   2      50%    1711
fuego           1711  90    90    78     37%    1945
oakfoam_book    1628  113   115   40     38%    1731
pachi_1t        1585  207   110   14     14%    1894
oakfoam         1567  93    85    72     25%    1806
oakfoam2        1524  137   53    30     23%    1725
pachi_monte     1523  387   49    2      0%     1711
pachi_plain     1523  387   49    2      0%     1711
michi           1506  339   34    4      0%     1791
matilda         1437  170   -31   44     9%     1877
Results at 5 minute time limit, based on 1680 games with 59 engines:

Code: Select all

Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_242          4662  -15   144   34     79%    4442
LZ_ELF          4506  90    87    66     64%    4389
LM_GX47         4499  97    94    58     64%    4380
kata_20b        4465  106   106   42     52%    4446
LZ_188          4454  112   114   36     47%    4473
LZ_ELF_6t       4444  92    90    62     56%    4390
LZ_157          4390  106   105   44     55%    4354
LZ_174          4259  91    89    76     61%    4149
LZ_173          4257  86    86    80     54%    4189
LZ_141          4243  87    86    82     57%    4160
kata_15b        4156  103   104   46     48%    4172
LZ_phoenix      4144  102   102   54     52%    4109
ray_ELF_12t     4134  97    99    50     48%    4144
LZ_174_6t       4093  80    78    100    57%    4034
ray_173_12t     3944  98    101   54     41%    4017
LM_Z2           3912  92    95    76     49%    3878
LM_B5           3905  112   111   46     59%    3808
ray_173_6t      3853  99    101   48     44%    3902
LZ_116          3844  84    85    98     50%    3827
LM_W11          3802  105   99    64     66%    3657
LM_E8           3787  109   106   50     56%    3740
kata_10b        3697  114   115   48     44%    3777
ray_173_2t      3682  106   108   54     52%    3659
ray_W11_12t     3636  107   99    58     67%    3497
ray_ELF         3491  98    99    66     38%    3641
AQ              3445  102   106   60     50%    3411
ray_173         3422  96    99    62     37%    3548
leela           3386  90    93    98     44%    3425
LZ_zed          3367  105   103   56     50%    3388
LZ_91           3282  92    95    76     37%    3419
kata_6b         3192  109   112   48     38%    3354
ray_W11         3190  94    93    62     50%    3202
dream_ponder    3181  110   112   44     52%    3113
dream           3091  117   121   36     44%    3130
LM_E8_c         3011  104   102   58     53%    2988
LZ_116_c2t      2959  105   104   66     55%    2916
LM_W11_c        2846  116   114   36     53%    2829
LM_GX47_c       2817  106   105   46     46%    2867
oakfoam_nn      2811  92    89    72     51%    2823
leela_c         2733  92    92    74     50%    2731
leela_c2t       2712  83    84    78     49%    2718
LZ_91_c2t       2620  106   101   56     59%    2540
LM_Z2_c         2573  109   110   38     47%    2596
LM_B5_c         2571  114   121   34     38%    2655
LZ_57           2569  113   114   38     45%    2616
leela_c1t       2507  95    100   68     41%    2596
pachi_nn        2400  107   113   64     39%    2514
pachi           2108  112   108   80     58%    2005
LZ_57_c2t       2064  132   122   40     70%    1872
leela_nonet     2058  137   150   42     36%    2157
fuego           1836  108   105   72     65%    1662
pachi_1t        1829  119   114   54     65%    1662
leela_nonet_1t  1827  125   116   52     69%    1624
gnugo           1472  214   -177  106    20%    1763
michi           1438  298   -211  40     55%    1403
oakfoam1        1258  462   -391  28     43%    1401
oakfoam         1039  657   -609  26     27%    1357
oakfoam_book    970   710   -678  32     13%    1406
matilda         947   741   -701  26     15%    1379
Attachments
1_min_crosstable-2019-09-16.csv
(14.68 KiB) Downloaded 510 times
5_min_crosstable-2019-09-16.csv
(10.58 KiB) Downloaded 484 times
Post Reply