OK, time for some actual results at 20 minutes per game. To start with, I decided to do this as a "win and continue" series of 8-game matches, starting with pachi_nn and introducing opponents about 100-200 Elo points above the previous winner (going by my dodgy projected ratings, to see just how bad they are). I'd expect each new engine to win 5-3 or 6-2. I also decided that if an engine wins its match 8-0, I should backtrack and look for something slightly weaker, in the interest of making the ratings a bit more accurate. Once I get to the top of the list, then I'll go back and add some more games to try and reduce the error margins, and maybe add a couple more engines if anything looks especially interesting. Without gnugo in the list, I've decided to anchor pachi_nn's rating at 2400, so the ratings are still in the same ballpark as my other lists.
Round 1: pachi_nn vs oakfoam_nn. This was the first surprise: pachi_nn won the match 5-3. It seems that oakfoam has some problems with time management. It plays at about the right pace in 1-minute or 5-minute games, but in 20-minute games it only uses 6 or 7 minutes total, so it's giving pachi a bit of an advantage. It seemed to be ahead in the opening and early middlegame of each game, and then managed to misread something and lose.
Round 2: LZ_91_c2t d pachi_nn 6-2.
Round 3: leela_c2t d LZ_91_c2t 5-3
Round 4: LM_W11_c d leela_c2t 5-3
Round 5: LM_W11_c d dream 6-2, another surprise, was expecting dream to do a bit better against a CPU-only engine. Here there were two games with "disputed" scores (both engines agreed that LM_W11_c had won, but gave different winning margins): each game involved a seki.
Round 6: LZ_zed d LM_W11_c 5-3
Round 7: leela vs LZ_zed was a 4-4 tie. I decided to add a tiebreaker match: leela d LM_W11_c 6-2. This put leela on top of the list, because it did a better job of beating up LM.
Round 8: ray_ELF d leela 8-0. Backtrack: ray_173 d leela 6-2. Again there was one game with disputed score, another seki. ray_173 d ray_ELF 5-3, not what I expected! At this point, BayesElo had both ray_173 and ray_ELF on exactly the same rating (3082): 173 had won the head to head, but ELF did better against leela, and these factors cancelled out. There was also another game with disputed result (agreed that ELF won, but disagreed on the amount), but not a seki this time; instead, the scoring was messed up because ray_173 passed before the game was actually finished. (No harm done, it was losing anyway.)
Round 9: Instead of running another tiebreaker match, I decided to just give the next engine 6 games each against the tied leaders:
- ray_173 vs LM_W11 3-3
- ray_ELF d LM_W11 4-2
Another surprise: I'd expected LM_W11 to be stronger.
Round 10: ray_173_6t d ray_ELF 8-0
Remember that ray defaults to one thread, but in 1-minute or 5-minute games it gets a little bit stronger given extra threads, but not by a huge amount. It looks like it gets a lot more benefit from those extra threads in slower games! (LZ uses two threads by default, and seems to actually get weaker given extra threads, at least in short games. But maybe it's worth retesting this theory in longer games?)
Backtrack: ray_173_2t d ray_ELF 5-3; ray_173_6t d ray_2t 7-1. In the 6t vs 2t match, there were two games with disputed result: both players passed early and disagreed on who was ahead. I decided to step in as referee and, looking at the positions, awarded one game to each player.
Round 11: LM_Z2 d ray_173_6t 7-1. Two more games where ray passed early from a losing position.
And throwing all this into BayesElo, the rating list so far is:
Code: Select all
Name Elo Elo+ Elo- games score avg_opp
LM_Z2 3726 312 203 8 88% 3505
ray_173_6t 3505 165 154 24 67% 3350
ray_173_2t 3212 169 177 16 38% 3309
ray_ELF 3113 112 111 38 47% 3140
ray_173 3087 143 133 22 64% 2991
LM_W11 3053 166 179 12 42% 3100
leela 2823 119 125 32 38% 2922
LZ_zed 2796 156 149 16 56% 2758
LM_W11_c 2692 109 109 32 50% 2697
leela_c2t 2618 150 150 16 50% 2620
dream 2553 197 256 8 25% 2692
LZ_91_c2t 2547 156 150 16 56% 2509
pachi_nn 2400 146 151 16 44% 2442
oakfoam_nn 2336 193 217 8 38% 2400
To be continued...