Inspired by the excellent Engine Tournament, I'm trying to calculate some Elo ratings for a few engines. (I know that CGOS has already done this, but it's very hard to get information about exactly what hardware, software and configuration was used for those engines.)
The good news about using Elo, compared with a league tournament format: it doesn't just tell you "this engine is stronger than that one", it also measures how big a difference it is. Using BayesElo, you can even get error bounds, so you can see roughly how accurate the ratings are.
The bad news: you need a much larger number of games to get accurate ratings. I won't be able to run 50 engines at an hour per player per game and play the 1000 or so games you'd need for high quality data.
What I've done so far: play a bunch of games at 1 minute absolute time, for a quick check that I've configured everything correctly (actually I caught a few mistakes this way), and to get a ballpark estimate of the ratings. Then more games at 5 minutes, for something that I hope is slightly more accurate.
Soon I plan to start a series at 20 minutes per player per game, so we have some data at roughly human-like time controls. I'll have to limit this series to about 15 or 20 engines, otherwise it will takes years to generate enough data. But first there's a few more different engines and configurations I want to try out.
My system:
The "Elo" column is the rating. Elo- and Elo+ are error bounds (to be pedantic, they're Bayesian credible intervals, not to be confused with frequentist confidence intervals). So for example, in the 1-minute ratings with LZ_ELF at Elo=3574, Elo- = 175, Elo+ = 176, this means that BayesElo thinks there's a 95% chance of the true rating being between 3399 and 3750. In the 5-minute ratings, you'll see some negative numbers near the top and bottom. I think this is a symptom of a skewed probability distribution: BayesElo can tell that LZ_ELF is "a lot stronger" than the other engines, but there isn't enough data to measure exactly how much stronger. This time last week I was seeing a lot more minus signs, and they're gradually going away as I add more data.
I've offset the ratings to put gnugo at 1500 each time, on the principle that gnugo is theoretically around 5K. This should mean that the BayesElo ratings are more or less in line with EGF ratings (plus or minus a couple of hundred rating points), and also not too far away from Remi Coulomb's ratings for pros. Looking at fuego and pachi, this seems to be in the right ballpark. So we have some weak evidence that the strongest CPU-only engines on a home PC can play at around top amateur or low pro level, and the good GPU-accelerated engines are already superhuman, at least in 5-minute games.
It will take a couple of months for me to get similar data for 20-minute games. I'll update here some time (don't hold your breath, I'm very good at procrastination!)