Home-made Elo ratings for some engines

For discussing go computing, software announcements, etc.
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Home-made Elo ratings for some engines

Post by xela »

Just how far ahead of us puny humans is Leela Zero by now? On a home PC, is it at human pro strength, or is it already superhuman? How much difference does it make whether or not you use a GPU?

Inspired by the excellent Engine Tournament, I'm trying to calculate some Elo ratings for a few engines. (I know that CGOS has already done this, but it's very hard to get information about exactly what hardware, software and configuration was used for those engines.)

The good news about using Elo, compared with a league tournament format: it doesn't just tell you "this engine is stronger than that one", it also measures how big a difference it is. Using BayesElo, you can even get error bounds, so you can see roughly how accurate the ratings are.

The bad news: you need a much larger number of games to get accurate ratings. I won't be able to run 50 engines at an hour per player per game and play the 1000 or so games you'd need for high quality data.

What I've done so far: play a bunch of games at 1 minute absolute time, for a quick check that I've configured everything correctly (actually I caught a few mistakes this way), and to get a ballpark estimate of the ratings. Then more games at 5 minutes, for something that I hope is slightly more accurate.

Soon I plan to start a series at 20 minutes per player per game, so we have some data at roughly human-like time controls. I'll have to limit this series to about 15 or 20 engines, otherwise it will takes years to generate enough data. But first there's a few more different engines and configurations I want to try out.

My system:
GTX 1070 GPU, 1920 cores, 8GB memory
Ryzen 5 2600 CPU, 6 cores (12 threads)
16 GB RAM
Linux operating system: Ubuntu 18.04
Engines tested so far:
I've used short names for the engines, so that when I view the full crosstable I can fit more columns on my screen :-) I hope this isn't too cryptic...

Leela versions:
LZ is Leela Zero version 0.15
Numbers are the network number from https://zero.sjeng.org/
for example LZ_174 is LZ with the 256x40 network c9d70c41
LZ_ELF is Leela Zero with the ELF weights from http://physik.de/CNNlast.tar.gz
LM is Leela Zero version 0.15 using one of the Leela Master networks from https://github.com/pangafu/LeelaMasterWeight
Just plain "leela" is the one from https://sjeng.org/leela.html, version 0.11
_c on the end means CPU-only mode
_1t means running with one thread only, similar for other numbers
By default, LZ in GPU mode uses 2 threads, LZ in CPU mode uses 12 threads, leela uses 12 threads in either mode

ray is the Ray lz branch from https://github.com/zakki/Ray.git checked out on 12th September, using Leela Zero weights

oakfoam_nn is Oakfoam 0.2.1-dev with the included nicego-cnn-06.gtp configuration file, meaning that it uses a neural network
Plain oakfoam is Oakfoam 0.1.3 with no configuration
Other oakfoams are failed attempts at getting a better configuration, before I figured out how to make oakfoam_nn work

pachi_nn is pachi 12.10 using the network http://physik.de/CNNlast.tar.gz
pachi is pachi 12.10 with the --nodcnn option
pachi_monte and pachi_pat are alternative engines for pachi (with the "-e" option), which turned out to be not very good

fuego is version 1.1.SVN

gnugo is version 3.8
gnugo_M is gnugo with more memory (cache size increased from default of 80M to 7G)
gnugo_l1 is gnugo on level 1; similarly for gnugo_l4 and gnugo_l7

You'll also notice two 1-minute games for AQ. It won one and crashed in the other. I decided that AQ was too unstable on my machine for further testing, so it doesn't appear in the 5-minute series.
Results so far at 1 minute time limit:

Code: Select all

Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          3574  176   175   12     67%    3481
LM_GX47         3568  174   156   28     86%    3179
LZ_174          3441  165   156   18     61%    3359
LZ_ELF_6t       3433  174   199   12     33%    3528
ray_ELF_12t     3325  143   131   24     63%    3244
LZ_173          3320  117   112   42     62%    3218
LZ_141          3249  140   133   30     67%    3096
LM_E8           3214  150   142   32     72%    2974
LZ_116          3162  97    95    58     55%    3123
LM_W11          3135  159   162   24     58%    3039
LZ_174_6t       3129  148   160   24     42%    3177
LM_Z2           3087  116   107   42     69%    2931
ray_173_6t      3069  161   169   16     44%    3107
ray_W11_12t     3022  215   275   8      25%    3168
ray_173_12t     3021  182   196   12     42%    3072
LM_B5           2966  110   109   44     57%    2903
ray_173_2t      2963  172   164   16     56%    2923
LZ_91           2826  104   114   60     27%    3049
leela           2813  141   137   38     53%    2780
ray_ELF         2785  219   280   10     20%    3020
ray_W11         2646  175   176   16     44%    2707
ray_173         2605  197   240   12     25%    2785
oakfoam_nn      2577  122   122   74     66%    2308
LM_B5_c         2574  194   171   12     67%    2484
LZ_116_c2t      2574  120   124   44     34%    2756
LM_GX47_c       2535  184   173   12     58%    2491
LM_E8_c         2501  194   194   10     50%    2502
LZ_57           2491  197   197   26     65%    2239
LZ_116_c6t      2468  185   189   12     50%    2460
LM_W11_c        2417  171   194   12     33%    2510
LM_Z2_c         2410  186   220   10     30%    2520
LZ_91_c2t       2169  178   166   18     56%    2132
AQ              2116  383   383   2      50%    2116
leela_c         2116  110   106   78     63%    1948
leela_c1t       2100  210   187   12     67%    1984
leela_c2t       2083  174   180   16     38%    2204
pachi_nn        2022  109   105   74     66%    1826
pachi           1816  125   122   66     56%    1772
leela_nonet     1782  104   101   88     58%    1717
gnugo           1500  88    83    84     64%    1402
gnugo_l7        1498  119   121   52     38%    1631
LZ_57_c2t       1492  246   218   8      63%    1420
gnugo_M         1470  139   133   34     53%    1471
gnugo_l1        1451  90    88    84     48%    1508
gnugo_l4        1435  140   139   32     47%    1489
leela_nonet_1t  1390  240   349   10     10%    1811
oakfoam1        1363  126   122   32     56%    1320
pachi_pat       1339  383   368   2      50%    1339
fuego           1339  90    90    78     37%    1571
oakfoam_book    1256  112   119   40     38%    1359
pachi_1t        1214  197   267   14     14%    1521
oakfoam         1195  92    101   72     25%    1433
oakfoam2        1152  129   153   30     23%    1353
pachi_plain     1151  345   325   2      0%     1339
pachi_monte     1151  345   325   2      0%     1339
michi           1135  301   311   4      0%     1420
matilda         1065  138   181   44     9%     1504
Results so far at 5 minute time limit:

Code: Select all

Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          4561  -71   184   28     82%    4314
LM_GX47         4436  51    128   34     68%    4287
LZ_ELF_6t       4410  74    119   36     64%    4302
LZ_173          4270  111   112   48     54%    4197
LZ_141          4252  98    96    68     62%    4126
LZ_174          4239  110   107   52     62%    4110
ray_ELF_12t     4133  159   172   18     44%    4159
LZ_174_6t       4054  92    89    78     55%    4026
ray_173_12t     3957  164   165   17     47%    3981
LM_Z2           3887  98    99    65     51%    3843
ray_173_6t      3855  157   175   18     33%    3975
LZ_116          3816  92    93    84     51%    3793
LM_B5           3808  148   158   26     46%    3820
LM_E8           3784  134   129   36     58%    3718
LM_W11          3762  117   110   50     64%    3646
ray_W11_12t     3593  157   157   24     54%    3553
ray_173_2t      3585  186   212   16     31%    3750
ray_ELF         3484  149   157   36     28%    3750
ray_173         3475  175   189   20     30%    3676
leela           3380  98    104   82     40%    3438
LZ_91           3274  128   136   42     31%    3466
ray_W11         3175  162   161   24     46%    3234
LM_E8_c         3037  142   120   34     76%    2841
LZ_116_c2t      3000  115   111   56     63%    2883
LM_W11_c        2870  117   115   34     53%    2849
LM_GX47_c       2853  104   104   44     48%    2872
oakfoam_nn      2851  94    90    66     56%    2814
leela_c         2784  95    97    66     52%    2751
leela_c2t       2747  82    83    78     49%    2752
LM_B5_c         2631  173   192   14     36%    2731
LZ_91_c2t       2625  115   112   48     56%    2566
LM_Z2_c         2597  139   141   24     46%    2634
LZ_57           2589  129   130   30     43%    2651
leela_c1t       2519  104   111   60     40%    2626
pachi_nn        2419  107   114   60     38%    2530
pachi           2130  116   112   76     61%    2003
leela_nonet     2093  142   152   40     38%    2211
LZ_57_c2t       2042  157   145   28     71%    1830
fuego           1878  119   118   50     62%    1751
pachi_1t        1790  126   124   42     57%    1720
leela_nonet_1t  1788  131   126   40     63%    1677
gnugo           1500  122   7     66     26%    1755
oakfoam         1437  334   -58   10     20%    1772
michi           1435  210   -57   24     25%    1675
oakfoam1        1272  406   -221  12     0%     1836
matilda         1260  420   -232  10     0%     1808
oakfoam_book    1241  356   -251  16     6%     1628
Edited 24th September: crosstables attached in CSV format, with a count of how many games each engine has played against each opponent.

The "Elo" column is the rating. Elo- and Elo+ are error bounds (to be pedantic, they're Bayesian credible intervals, not to be confused with frequentist confidence intervals). So for example, in the 1-minute ratings with LZ_ELF at Elo=3574, Elo- = 175, Elo+ = 176, this means that BayesElo thinks there's a 95% chance of the true rating being between 3399 and 3750. In the 5-minute ratings, you'll see some negative numbers near the top and bottom. I think this is a symptom of a skewed probability distribution: BayesElo can tell that LZ_ELF is "a lot stronger" than the other engines, but there isn't enough data to measure exactly how much stronger. This time last week I was seeing a lot more minus signs, and they're gradually going away as I add more data.

I've offset the ratings to put gnugo at 1500 each time, on the principle that gnugo is theoretically around 5K. This should mean that the BayesElo ratings are more or less in line with EGF ratings (plus or minus a couple of hundred rating points), and also not too far away from Remi Coulomb's ratings for pros. Looking at fuego and pachi, this seems to be in the right ballpark. So we have some weak evidence that the strongest CPU-only engines on a home PC can play at around top amateur or low pro level, and the good GPU-accelerated engines are already superhuman, at least in 5-minute games.

It will take a couple of months for me to get similar data for 20-minute games. I'll update here some time (don't hold your breath, I'm very good at procrastination!)
Attachments
5_min_crosstable-2018-09-19.csv
(7.05 KiB) Downloaded 812 times
1_min_crosstable-2018-09-19.csv
(9.62 KiB) Downloaded 822 times
Last edited by xela on Sat Nov 10, 2018 4:02 am, edited 3 times in total.
User avatar
EdLee
Honinbo
Posts: 8859
Joined: Sat Apr 24, 2010 6:49 pm
GD Posts: 312
Location: Santa Barbara, CA
Has thanked: 349 times
Been thanked: 2070 times

Post by EdLee »

xela, Thanks. :tmbup:
Uberdude
Judan
Posts: 6727
Joined: Thu Nov 24, 2011 11:35 am
Rank: UK 4 dan
GD Posts: 0
KGS: Uberdude 4d
OGS: Uberdude 7d
Location: Cambridge, UK
Has thanked: 436 times
Been thanked: 3718 times

Re: Home-made Elo ratings for some engines

Post by Uberdude »

Very nice, thanks xela. Could you please add in LZ #157, the best 15 block network? I use that a lot as I think it gives better performance at shortish time limits than the deeper networks (to read a ladder the superior judgement of a 40 block network doesn't help if it only has 200* playouts , but 800 playouts of a less-skilled network enables the ladder to be read).

* exact numbers not guaranteed, and with more training a deeper network may be able to read ladders with few playouts, but we aren't there yet afaik.
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

Uberdude wrote:Could you please add in LZ #157, the best 15 block network?
Good suggestion. I chose the 15-block network LZ 141 because it's the same one used for the engine tournament. It'll be interesting to see how much stronger 157 is. I'll include it in the next update.
User avatar
EdLee
Honinbo
Posts: 8859
Joined: Sat Apr 24, 2010 6:49 pm
GD Posts: 312
Location: Santa Barbara, CA
Has thanked: 349 times
Been thanked: 2070 times

Post by EdLee »

Hi xela,

An engine (Taiwan-based?) on IGS, the username is
leelazero ( one word, all lowercase ).
Its info includes "GTX970. zero.sjeng.org "

Any possibility to extrapolate or guessitimate its Elo range ?
( It has the (small avalanche) ladder problem that people exploit. ) Thanks.

This page has an Elo graph, roughly at 12,700 ?
mb76
Dies in gote
Posts: 23
Joined: Sun Mar 12, 2017 12:43 am
GD Posts: 0
DGS: embee
Has thanked: 4 times
Been thanked: 3 times

Re: Home-made Elo ratings for some engines

Post by mb76 »

Could you please add in LZ zediir, based "Supervised. From the TYGEM dataset"? https://github.com/gcp/leela-zero/issues/884 .
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Post by xela »

EdLee wrote:Hi xela,

An engine (Taiwan-based?) on IGS, the username is
leelazero ( one word, all lowercase ).
Its info includes "GTX970. zero.sjeng.org "

Any possibility to extrapolate or guessitimate its Elo range?
Sorry, that's not enough information to work with. At https://zero.sjeng.org/ there is a list of 178 different networks that leelazero can use. If you can find out which network this engine uses, then we can make some guesses.
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

mb76 wrote:Could you please add in LZ zediir, based "Supervised. From the TYGEM dataset"? https://github.com/gcp/leela-zero/issues/884 .
Will do. I'll have some results to show in a couple of days. Thanks for the suggestion.
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

New this week:
  • Added DreamGo. I was hoping for a low dan level bot that would fill the large rating gap between pachi and pachi_nn. Unfortunately (!) the latest DreamGo is actually quite a lot stronger than that!
  • Added LZ 157, the strongest 192x15 network. In fast games, it turns out to be a bit stronger than some of the bigger networks. I'd expect this to change in slower games.
  • Added LZ zediir weights (LZ_zed). This turned out to be weaker than I expected, but again we might see a different story with slower games.
  • Played a few more games with the other engines, to try and reduce some of the error margins on the ratings.
Results so far at 1 minute time limit, based on 986 games with 59 engines:

Code: Select all

Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          3594  165   154   16     63%    3521
LM_GX47         3559  156   140   32     78%    3239
LZ_157          3550  134   124   28     64%    3451
LZ_174          3490  149   143   22     59%    3420
LZ_ELF_6t       3483  146   158   18     39%    3548
LZ_173          3350  111   108   48     60%    3245
ray_ELF_12t     3347  136   129   26     58%    3297
LZ_141          3294  141   135   32     66%    3132
LM_E8           3239  143   137   36     69%    3017
LZ_116          3186  97    95    58     55%    3141
LZ_174_6t       3162  148   157   26     46%    3176
LM_W11          3152  162   167   24     58%    3048
LM_Z2           3096  120   111   42     69%    2931
ray_173_6t      3096  149   149   20     50%    3096
ray_173_12t     3059  164   164   16     50%    3061
LM_B5           2999  106   103   50     60%    2911
ray_W11_12t     2961  123   129   30     40%    3032
ray_173_2t      2952  139   135   24     54%    2926
leela           2862  118   113   50     58%    2785
LZ_zed          2831  136   152   26     31%    2981
LZ_91           2804  105   113   64     27%    3040
ray_ELF         2730  133   138   28     39%    2834
ray_173         2678  186   202   14     36%    2785
ray_W11         2659  148   148   22     45%    2703
dream           2627  151   160   24     54%    2525
oakfoam_nn      2571  120   122   76     64%    2325
LM_B5_c         2554  194   171   12     67%    2464
LZ_116_c2t      2551  119   123   46     33%    2754
LM_GX47_c       2514  184   173   12     58%    2470
LZ_57           2483  199   197   26     65%    2239
LM_E8_c         2480  194   194   10     50%    2481
LZ_116_c6t      2451  184   189   12     50%    2444
LM_W11_c        2397  171   195   12     33%    2490
LM_Z2_c         2389  186   220   10     30%    2499
LZ_91_c2t       2166  177   165   18     56%    2129
leela_c         2114  116   111   78     62%    1962
leela_c1t       2099  210   187   12     67%    1983
leela_c2t       2082  174   180   16     38%    2202
pachi_nn        2022  109   105   76     64%    1846
pachi           1817  125   122   68     54%    1796
leela_nonet     1781  104   101   88     58%    1717
gnugo           1500  88    83    84     64%    1401
gnugo_l7        1498  119   122   52     38%    1630
LZ_57_c2t       1492  246   218   8      63%    1419
gnugo_M         1469  139   133   34     53%    1471
gnugo_l1        1451  90    88    84     48%    1508
gnugo_l4        1435  140   139   32     47%    1489
leela_nonet_1t  1390  240   346   10     10%    1809
oakfoam1        1363  126   122   32     56%    1319
pachi_pat       1339  384   360   2      50%    1339
fuego           1339  90    90    78     37%    1571
oakfoam_book    1256  112   119   40     38%    1358
pachi_1t        1213  198   262   14     14%    1522
oakfoam         1195  92    101   72     25%    1433
oakfoam2        1151  129   154   30     23%    1353
pachi_monte     1151  348   288   2      0%     1339
pachi_plain     1151  348   288   2      0%     1339
michi           1134  304   274   4      0%     1419
matilda         1064  139   172   44     9%     1504
5_min_crosstable-2018-09-26.csv
(7.95 KiB) Downloaded 747 times
Results so far at 5 minute time limit, based on 1310 games with 50 engines:

Code: Select all

Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          4556  -23   123   46     67%    4427
LM_GX47         4515  16    117   44     66%    4378
LZ_ELF_6t       4506  25    109   48     65%    4390
LZ_157          4463  64    130   28     54%    4437
LZ_173          4328  97    97    62     55%    4254
LZ_141          4309  94    94    74     59%    4190
LZ_174          4279  106   103   60     63%    4136
ray_ELF_12t     4148  109   113   42     45%    4178
LZ_174_6t       4114  89    86    88     57%    4057
ray_173_12t     3944  109   114   42     38%    4040
LM_Z2           3932  99    100   68     53%    3860
LM_B5           3918  113   111   44     57%    3850
ray_173_6t      3860  99    101   48     44%    3911
LZ_116          3854  89    91    90     51%    3826
LM_E8           3778  115   112   46     54%    3748
LM_W11          3770  111   107   56     61%    3668
ray_173_2t      3691  119   123   44     50%    3680
ray_W11_12t     3633  125   114   44     68%    3487
ray_ELF         3472  113   114   54     35%    3667
leela           3400  97    99    88     44%    3436
ray_173         3377  107   115   50     30%    3571
LZ_zed          3359  107   106   52     48%    3399
LZ_91           3272  99    104   66     32%    3456
dream           3218  126   128   34     56%    3107
ray_W11         3156  111   114   44     43%    3225
LM_E8_c         3048  111   106   50     60%    2971
LZ_116_c2t      3002  115   110   56     63%    2888
LM_W11_c        2868  118   116   34     53%    2849
oakfoam_nn      2849  93    90    68     54%    2829
LM_GX47_c       2848  105   105   44     48%    2869
leela_c         2771  91    91    72     51%    2746
leela_c2t       2752  82    83    78     49%    2757
LM_B5_c         2662  129   135   26     42%    2718
LZ_91_c2t       2638  116   113   48     56%    2576
LM_Z2_c         2618  123   124   30     47%    2646
LZ_57           2600  128   131   30     43%    2660
leela_c1t       2526  104   111   60     40%    2631
pachi_nn        2430  106   112   64     39%    2543
pachi           2137  111   108   80     58%    2033
LZ_57_c2t       2093  132   121   40     70%    1900
leela_nonet     2086  137   149   42     36%    2185
fuego           1865  107   105   72     65%    1691
pachi_1t        1858  119   115   54     65%    1690
leela_nonet_1t  1856  124   117   52     69%    1652
gnugo           1500  120   -33   106    20%    1791
michi           1466  206   -68   40     55%    1432
oakfoam1        1287  341   -246  28     43%    1430
oakfoam         1068  521   -465  26     27%    1386
oakfoam_book    998   573   -534  32     13%    1435
matilda         975   603   -557  26     15%    1407
5_min_crosstable-2018-09-26.csv
(7.95 KiB) Downloaded 747 times
Next I want to try LZ with the Phoenix weights. After that, I might start the 20-minute series.
Attachments
1_min_crosstable-2018-09-26.csv
(10.27 KiB) Downloaded 768 times
Last edited by xela on Sat Nov 10, 2018 4:02 am, edited 1 time in total.
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

New this week:
  1. Added LZ with Phoenix weights. This doesn't do so well in fast games, I'd expect it to overtake LZ 157 in slower games. I'll get to that in a few weeks...
  2. I realised that DreamGo has pondering turned on by default. So I've renamed dream from last week's update to dream_ponder, and added a new dream with pondering off. It looks like pondering is worth about 100 rating points in 5-minute games, and less in fast games.
  3. Played some ray vs ray games, now that I know how to make ray play against itself with two different weight files. This helps with making the ratings more accurate (no more negative errors at the top of the table now).
Results so far at 1 minute time limit, based on 1326 games with 61 engines:
(edited 9th October: subtract 372 from all ratings to put gnugo at 1500, consistent with my other rating lists)

Code: Select all

Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_157          3780  98    91    60     72%    3614
LM_GX47         3758  102   95    62     73%    3517
LZ_ELF          3717  101   98    48     58%    3650
LZ_ELF_6t       3636  96    101   48     42%    3693
LZ_174          3590  87    88    64     47%    3611
ray_ELF_12t     3542  93    95    58     45%    3578
LZ_173          3519  115   112   48     60%    3411
LZ_141          3471  115   114   44     59%    3359
LM_E8           3450  116   115   50     64%    3249
LZ_116          3369  99    97    58     55%    3318
LZ_174_6t       3325  112   113   42     50%    3315
ray_173_6t      3311  112   110   36     53%    3293
LM_Z2           3308  98    94    60     67%    3137
ray_173_12t     3289  115   113   34     53%    3269
LM_W11          3259  111   115   44     50%    3230
LM_B5           3178  102   101   56     59%    3076
LZ_phoenix      3156  119   123   36     39%    3265
ray_W11_12t     3150  114   118   36     42%    3212
ray_173_2t      3103  129   129   28     50%    3104
LZ_zed          3010  122   126   34     41%    3088
leela           2990  116   116   54     54%    2925
LZ_91           2935  100   106   74     30%    3149
ray_ELF         2899  129   127   32     47%    2949
ray_173         2887  133   128   32     56%    2843
ray_W11         2778  110   108   44     52%    2767
dream_ponder    2759  117   121   40     53%    2679
dream           2637  121   122   34     47%    2658
oakfoam_nn      2624  121   123   82     62%    2406
LM_GX47_c       2561  130   124   30     57%    2520
LZ_116_c2t      2497  111   113   58     31%    2768
LM_E8_c         2472  140   140   22     50%    2472
LM_B5_c         2471  134   134   24     50%    2471
LZ_116_c6t      2450  137   138   24     50%    2446
LZ_57           2373  116   117   50     52%    2336
LM_Z2_c         2370  121   114   36     61%    2291
LM_W11_c        2306  125   133   28     39%    2377
leela_c1t       2202  108   108   42     52%    2177
leela_c2t       2135  128   136   30     37%    2248
LZ_91_c2t       2133  137   140   26     46%    2160
leela_c         2126  103   101   88     59%    2004
pachi_nn        2028  110   107   76     64%    1855
pachi           1818  126   123   68     54%    1807
leela_nonet     1783  105   102   88     58%    1722
gnugo           1500  88    83    84     64%    1402
gnugo_l7        1498  120   122   52     38%    1633
LZ_57_c2t       1492  246   218   8      63%    1419
gnugo_M         1470  139   133   34     53%    1472
gnugo_l1        1451  91    89    84     48%    1509
gnugo_l4        1435  140   139   32     47%    1490
leela_nonet_1t  1386  241   335   10     10%    1788
oakfoam1        1363  126   122   32     56%    1319
pachi_pat       1338  385   332   2      50%    1338
fuego           1338  90    90    78     37%    1573
oakfoam_book    1256  113   119   40     38%    1359
pachi_1t        1213  198   235   14     14%    1522
oakfoam         1195  92    101   72     25%    1434
oakfoam2        1152  130   151   30     23%    1353
pachi_monte     1151  356   211   2      0%     1338
pachi_plain     1151  356   211   2      0%     1338
michi           1134  311   197   4      0%     1419
matilda         1064  140   127   44     9%     1505
1_min_crosstable-2018-10-03.csv
(11.05 KiB) Downloaded 756 times
Results so far at 5 minute time limit, based on 1396 games with 52 engines:

Code: Select all

Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          4544  11    112   46     67%    4415
LM_GX47         4504  49    111   44     66%    4370
LZ_ELF_6t       4496  56    104   48     65%    4381
LZ_157          4457  89    123   32     59%    4385
LZ_173          4307  94    93    66     55%    4242
LZ_174          4291  102   97    64     66%    4135
LZ_141          4280  90    90    78     58%    4188
ray_ELF_12t     4145  102   105   46     46%    4173
LZ_phoenix      4133  121   122   36     47%    4154
LZ_174_6t       4120  84    82    92     57%    4066
ray_173_12t     3969  103   108   46     39%    4054
LM_Z2           3933  95    97    72     50%    3885
LM_B5           3932  112   110   44     57%    3862
LZ_116          3881  86    88    94     51%    3848
ray_173_6t      3874  99    101   48     44%    3924
LM_E8           3794  114   112   46     54%    3761
LM_W11          3786  110   107   56     61%    3682
ray_173_2t      3706  119   123   44     50%    3693
ray_W11_12t     3649  125   114   44     68%    3503
ray_ELF         3487  113   114   54     35%    3679
leela           3414  98    100   88     44%    3446
ray_173         3392  107   115   50     30%    3585
LZ_zed          3372  108   106   52     48%    3410
LZ_91           3289  95    98    71     35%    3446
dream_ponder    3232  117   116   40     58%    3118
ray_W11         3172  105   105   50     46%    3218
dream           3130  129   129   29     52%    3114
LM_E8_c         3035  109   104   52     58%    2977
LZ_116_c2t      2988  107   105   62     58%    2912
LM_W11_c        2874  116   114   36     53%    2856
LM_GX47_c       2842  105   105   44     48%    2864
oakfoam_nn      2839  92    89    70     53%    2833
leela_c         2759  91    92    72     51%    2738
leela_c2t       2740  82    83    78     49%    2746
LZ_91_c2t       2648  105   101   56     59%    2568
LM_Z2_c         2601  109   110   38     47%    2624
LM_B5_c         2599  114   121   34     38%    2683
LZ_57           2597  113   114   38     45%    2645
leela_c1t       2535  95    100   68     41%    2624
pachi_nn        2429  106   113   64     39%    2542
pachi           2137  112   108   80     58%    2034
LZ_57_c2t       2093  132   121   40     70%    1900
leela_nonet     2086  137   150   42     36%    2186
fuego           1865  107   105   72     65%    1691
pachi_1t        1858  119   115   54     65%    1690
leela_nonet_1t  1856  124   117   52     69%    1652
gnugo           1500  131   -57   106    20%    1791
michi           1466  219   -92   40     55%    1432
oakfoam1        1287  360   -270  28     43%    1430
oakfoam         1068  543   -489  26     27%    1386
oakfoam_book    998   595   -558  32     13%    1435
matilda         975   625   -581  26     15%    1407
5_min_crosstable-2018-10-03.csv
(8.51 KiB) Downloaded 806 times
This week I'm going to start playing some matches with 20 minutes per player per game. This will be with a smaller collection of engines, so that we'll get some results this year.
Last edited by xela on Sat Nov 10, 2018 4:03 am, edited 2 times in total.
Uberdude
Judan
Posts: 6727
Joined: Thu Nov 24, 2011 11:35 am
Rank: UK 4 dan
GD Posts: 0
KGS: Uberdude 4d
OGS: Uberdude 7d
Location: Cambridge, UK
Has thanked: 436 times
Been thanked: 3718 times

Re: Home-made Elo ratings for some engines

Post by Uberdude »

157 hero! :clap:
User avatar
pnprog
Lives with ko
Posts: 286
Joined: Thu Oct 20, 2016 7:21 am
Rank: OGS 7 kyu
GD Posts: 0
Has thanked: 94 times
Been thanked: 153 times

Re: Home-made Elo ratings for some engines

Post by pnprog »

Hi!
Very interested in the thread :)
xela wrote:
  • I realised that DreamGo has pondering turned on by default. So I've renamed dream from last week's update to dream_ponder, and added a new dream with pondering off. It looks like pondering is worth about 100 rating points in 5-minute games, and less in fast games.
But when DreamGo is playing with pondering on, my understanding is that:
  • Not only it will increase its level
  • But it will also decrease its opponent's level, by taken away some of the computing power the opponent needs, no? This in turn will make the opponent appear weaker, and then explains for the big difference in ELO?
Like, imagine I run the tournament on a simple computer: 1000MHz CPU, one thread, no GPU ; then it's like comparing:
  • DreamGo (1000Mhz) VS Pachi (1000MHz)
  • DreamGo (1000MHz + pondering at 500MHz) VS Pachi (500MHz)
We can expect the level of Pachi to be significantly weaker, while facing a DreamGo boosted a little by pondering?

For the dream_ponder entry, what you would like to have is:
  • DreamGo (1000MHz + pondering at 1000MHz) VS Pachi (1000MHz)
Or I misunderstand something?
I am the author of GoReviewPartner, a small software aimed at assisting reviewing a game of Go. Give it a try!
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

pnprog wrote:But when DreamGo is playing with pondering on, my understanding is that:
  • Not only it will increase its level
  • But it will also decrease its opponent's level, by taken away some of the computing power the opponent needs, no? This in turn will make the opponent appear weaker, and then explains for the big difference in ELO?
Correct. In fact, the difference in Elo ratings between dream and dream_ponder is actually smaller than I expected.
pngprog wrote: For the dream_ponder entry, what you would like to have is:
  • DreamGo (1000MHz + pondering at 1000MHz) VS Pachi (1000MHz)
Or I misunderstand something?
Yes, that would be a better way to test it. The fact is that I intended to run all engines without pondering, to avoid this type of complication. The inclusion of dream_ponder was an accident! I decided to leave it in the ratings list, rather than deleting it, because it's an interesting accident :-)
User avatar
pnprog
Lives with ko
Posts: 286
Joined: Thu Oct 20, 2016 7:21 am
Rank: OGS 7 kyu
GD Posts: 0
Has thanked: 94 times
Been thanked: 153 times

Re: Home-made Elo ratings for some engines

Post by pnprog »

So now, reading the EGF rating system page on Sensei, they indicate/define one stone in strength is equivalent to 100 Elo, and that would make LeelaZero around 27 stones stronger than Gnugo. Something like 21 dan amateur player :bow:

More seriously, if we fix Gnugo at a certain level (like 1500Elo/5k) what else data do we need to make our Elo scale comparable to the EGF rating?
The inclusion of dream_ponder was an accident! I decided to leave it in the ratings list, rather than deleting it, because it's an interesting accident :-)
I am not really knowledgeable about those Elo ratings, but if we introduce a bot with skewed performance/rating, won't it affect the rating of all the bots on the scale? Like decrease the rating of the bots weaker than dream_ponder, and increase the rating of the bots stronger than dream_ponder?
I am the author of GoReviewPartner, a small software aimed at assisting reviewing a game of Go. Give it a try!
xela
Lives in gote
Posts: 652
Joined: Sun Feb 09, 2014 4:46 am
Rank: Australian 3 dan
GD Posts: 200
Location: Adelaide, South Australia
Has thanked: 219 times
Been thanked: 281 times

Re: Home-made Elo ratings for some engines

Post by xela »

pnprog wrote:So now, reading the EGF rating system page on Sensei, they indicate/define one stone in strength is equivalent to 100 Elo, and that would make LeelaZero around 27 stones stronger than Gnugo. Something like 21 dan amateur player :bow:

More seriously, if we fix Gnugo at a certain level (like 1500Elo/5k) what else data do we need to make our Elo scale comparable to the EGF rating?
I think BayesElo is similar to EGF ratings, but not exactly the same. For a good comparison, we'd need to run the EGF rating algorithm on my engine vs engine games, or else collect some EGF tournament results and run BayesElo on those results to compare with EGF ratings. That's a whole other research project that I'm not going to start this year :-)

I think "LeelaZero around 27 stones stronger than Gnugo" is about right, but it could be anywhere between 20 and 35 stones really.
pnprog wrote:I am not really knowledgeable about those Elo ratings, but if we introduce a bot with skewed performance/rating, won't it affect the rating of all the bots on the scale? Like decrease the rating of the bots weaker than dream_ponder, and increase the rating of the bots stronger than dream_ponder?
No, I don't think it matters. What does a "skewed performance rating" mean anyway? The bot was stronger than I thought it would be? But the BayesElo software doesn't read my mind, it only looks at the game results. Dream_ponder beats weaker bots, and loses against stronger ones, same behaviour as any other bot. I don't think it makes a difference to the ratings whether it gets those results by playing good moves, or by sabotaging the opponents (stealing memory or CPU cycles).

In any case, I've been anchoring the ratings to put GnuGo at 1500 every time, so this should help to keep things stable.
Post Reply