Open source platform to improve fuego or pachi: fishtest

mohzus · #1

Hi guys,
In the chess programming world there is currently a revolution being held, an open source platform used to improve a particular chess engine (stockfish) made it gain enough strength within a few months to become number 2 in the world and very close to overpass the number 1 (which is a commercial engine, Houdini 3).
The platform goal is to let people let their computers run games between 2 different versions of say, pachi. The data (win, losses, number of games, etc.) are sent automatically to the server that calculates the rating differences between the 2 programs/versions being tested. If a test passes (i.e. shows to be an improvement of at least X elo points) at a particular time control, it must still be run at a different time control. If the test passes at both time controls, it means there's a high likelihood of having made an improvement on the original code and so the original code gets upgraded.
As I am no programmer, I am not really sure how to create a fishtest platform (the entire code can be found there: https://github.com/glinscott/fishtest) but since everything needed is open source, I am sure a programmer could set it up under a few minutes.
I am wondering if there is anyone interested in setting this up for pachi and/or fuego.
It would be good for the programmers of pachi and fuego since they could test with accuracy any small changes made in their code. They could be sure with a high % if the changes are "bad" or "good" ones.

P.S.:Another pages of interest are the testing framework of stockfish: http://tests.stockfishchess.org/tests which currently shows that 44 different people are "giving" their computers to test different versions of stockfish.
And their google group forums: https://groups.google.com/forum/?fromgr ... ishcooking.

Thanks for any input.

emeraldemon · #2

Sounds like a very interesting idea. Maybe worth mentioning on the fuego or pachi development lists? Makes me want to play some chess vs. stockfish and see how easily it crushes me. I wonder if it can still win with a no-queen handicap...

LGolem · #3

By the way, today finished 4 game match between Pachi and Fuego (both last versions) with cca 20 min per move on LittleGolem. Pachi won all 4 games:

http://www.littlegolem.net/jsp/game/gam ... id=1582850
http://www.littlegolem.net/jsp/game/gam ... id=1582851
http://www.littlegolem.net/jsp/game/gam ... id=1582852
http://www.littlegolem.net/jsp/game/gam ... id=1582853

Mike Novack · #4

Probably not needed.

The situation may be different than with chess. The question is, how large must the sample size be in games between version A and version B for a difference in results of size X give us confidence Y that A is better than B (or vice versa). Because go games are much longer than chess games (number of moves) there might be a much larger chance that the situation were the improvement applies will show up in at least some of the games. There is no equivalent for the part of a chess playing program "evaluate who is ahead at this point of the game" (at the limit of it's "look ahead"). These programs aren't making mistakes "who won that game" and that is what MCTS is using to build its statisitics.

a) Trials at different time controls probably not needed. All of these programs are using MCTS for their evaluator and all will be over the threshold for this algorithm to work (on even obsolescent hardware at short time controls). Giving more time to each will improve the chances of MCTS selecting the correct move but that probability of improvement will be the same for both. In any case, on up to date hardware (a current workstation class machine) all of them will be nowhere near a "knee of the curve" even at controls of 10-20 seconds per move.

b) So you would need to argue that much more than a sample size of say 1000 games is needed. You don't need a distributed testing system for that. Sure, if you needed 100,000 games or even 10,000 games. But 1000 could easily be done in a month on an upt to date workstation. And sorry (ROFLOL) it isn't faster coding in the open source environment and it takes us programmers a lot more than a few minutes to do anything worthwhile. These projects might be coding a new version over a month, testing/debugging another month, and then this month testing if an improvement (of course could have several changes in development at the same time -- maybe once a year merging those that appear to be an improvement, retest, and a release. Nobody is going to bother with a new release that isn't at least a stone or so better << and remember, with go, the difference between two equals and one a stone better than the other is them winning about the same number each or one winning about 2 out of 3 games. You don't need a huge sample size of games to determine that difference with darned good certainty >>

BTW -- all this might not be so for one of the better go playing programs (it might be more susceptible to improvement by minute steps in its AI "create set of plausible moves" stage) but since that isn't one the "free software" programs there would probably be little interest in providing "help from the field".

pasky · #5

Hi! I'd just like to note that time settings matter tremendously; you could say that the effect of most enhancements on strength is a + b * log(time). Ideally, b=1, but it is actually very common to have a=high but b negative, etc. I.e., some enhancements are almost invisible with short time but tremendous with long time, others are great with short time settings but *detrimental* with long time settings. See also http://pasky.or.cz/go/acg13-pachi.pdf where we touch some of these issues.

Regarding playtesting, you do not want to play 1000 games just once to verify an improvement, but usually you want to play 1000 games many times to confirm various settings work as expected and what's the right mix. (Then with CLOP to tune various parameters to best values, the question of a number of games required gets even more complex.)

Boidhre · #6

The issue with chess engines is that results against professional human players stopped being useful for judging their strength quite some time ago. Whereas with go engines you can just stick a new version up on KGS or wherever and get your feedback from just its server rating still.

Edit: Thinking about it, human opponents might be better too in go, just because some will throw more curveballs when facing an AI and may expose odd flaws in the engine that an AI playing go "honestly" may not find.

Mike Novack · #7

pasky wrote:

Hi! I'd just like to note that time settings matter tremendously; you could say that the effect of most enhancements on strength is a + b * log(time). Ideally, b=1, but it is actually very common to have a=high but b negative, etc. I.e., some enhancements are almost invisible with short time but tremendous with long time, others are great with short time settings but *detrimental* with long time settings. See also http://pasky.or.cz/go/acg13-pachi.pdf where we touch some of these issues.

Sorry, can't access that (rural location; I have little bandwidth from here)
.

But what is the sort of time differences under discussion? We aren't (or shouldn't be) considering time differences over an order of magnitude as that is the approximate difference between the fast and the slow time controls used by humans playing go. Behavior under hugely wider ranges of time may have great theoretical interest. These enhancements (to the basic algorithm) are pretty much all because of the difference between finite and unlimited time.

pasky · #8

First, one must realize that difference in time settings is (almost) equivalent to difference in CPU power. Usually, we playtest Pachi with 500s S.D. on 15x15, single threaded. On a beefier 8-thread desktop with time settings equivalent to 20 minutes S.D. (fast-paced game on 15x15), Pachi will have effectively ~20 times more thinking time. During a serious slow game tournament on 32-thread computation server, Pachi will have effectively ~650 times more thinking time.

These are massive differences and testing a feature with fast time settings and finding it's (even massively) detrimental with normal time settings (or has a different optimum) is not uncommon. The opposite is probably also true but it's very hard to find features that don't pass the fast game test but are in fact good.

Mike Novack · #9

Yes of course, we should have pointed this out to the rest of the people here that when you and I are talking about time we didn't mean absolute time but time adjusted for machine power -- so the power of the machine being used comes into play.

But let's not confuse "number of threads" with crunch power. Instead perhaps compare in terms of any of the standard "total crunch" benchmarks in common use. For example, your 32 thread machine, what would that be defined in those terms? ( the name of a benchmark and a number)

For the proposed project, it would be necessary to accumulate data partitioned into "equivalent time" as opposed to time control used (taking the time control and the power of the machine being used)

When I was referring to "within an order of magnitude" that would be for any given machine. Using a weak machine to test for behavior on a strong machine means using longer "real time" time settings in the ratio of the strength of the machines.

pasky · **#10**

A natural computational power metric to use when working on an MCTS algorithm is the number of simulations per second; you can assume these are roughly comparable. (In fact, I often do use the same server with 32 single-thread test games in parallel or for a 32-threaded tournament Pachi instance.)

Mike Novack · **#11**

pasky wrote:

A natural computational power metric to use when working on an MCTS algorithm is the number of simulations per second; you can assume these are roughly comparable. (In fact, I often do use the same server with 32 single-thread test games in parallel or for a 32-threaded tournament Pachi instance.)

Misunderstanding?

This is probably going to be too technical for most here so maybe we should take it elsewhere. I thought I was asking a simple question about the power of your hardware (so it could be compared to other hardware). Not about the practical performance of a particular software implementation on it which is another matter entirely. Number of simulations per unit time achieved is a measure of the hardware and the software. Go playing software not my line of country but making programs achieve better performance* was.

If we wanted to know how Pachi could be expected to perform on machine X (or how time needed to be adjusted to have the same performance as on your 32 thread machine) it is the hardware we need to compare.

I
* Or more to the point, fixing whatever was causing the program to have unacceptable performance. Sometimes what you are trying to do is necessarily expensive of time. But sometimes somebody made an unwise choice of data representation or used an unnecessarily time expensive way of doing something.

pasky · **#12**

Well, frankly, I don't have results from any other benchmarks at that hardware at hand, because measuring #simulations of Pachi is literally the benchmark I use and it's the most relevant benchmark for comparing Pachi performance on various hardware as it measures exactly the aspects that matter for Pachi, unsurprisingly. ;-)

(Really - when I buy new hardware, the first thing to look at its performance is run Pachi on it.)

Mike Novack · **#13**

OK then, could do this in reverse (and use Pachi performance as the metric).

Do you have that sort of information? The number of simulations per second for desktops/workstations with various different CPUs in them? I was simply assuming that you wouldn't have that data since it would require having been able to run Pachi on those machines and just how many of them could you have had access to and time to try.

But even reporting that for a couple of them might make it possible for us to estimate what would be likely for others (as obtaining relative crunch benchmark data for those chips would be easy).

pasky · **#14**

I'll try to dig out some data and update this topic a little later.

In general, however, if you just want to have comparable results from different hardware, the most natural thing to do is to specify a fixed number of simulations - then, Pachi will have equal strength on any hardware where you run it (the game will just take longer on slower hardware).

Mike Novack · **#15**

True, but I'm looking at the reverse questions.

For example, what hardware is necessary for Pachi to have strength X at normal time controls?

leichtloeslich · **#16**

pasky wrote:

specify a fixed number of simulations

May I ask how to do that?

I took a quick glance at uct_state_init() in uct/uct.c but couldn't find anything.

pasky · **#17**

leichtloeslich wrote:

pasky wrote:

specify a fixed number of simulations

May I ask how to do that?

I took a quick glance at uct_state_init() in uct/uct.c but couldn't find anything.

Using -t, you can either specify number of seconds, or number of simulations prepended by =. E.g. -t =2000 will make Pachi spend 2000 simulations per move.

Open source platform to improve fuego or pachi: fishtest

Who is online