Paper discussion: Mastering the Game of Go

emeraldemon · #1

Full title is actually "Mastering the Game of Go with Deep Neural Networks and Tree Search"

I know there are other programming and AI enthusiasts on the forum, so I'd like to start a thread to discuss specifically the research paper.

The paper is here: https://storage.googleapis.com/deepmind ... ing-go.pdf

I read through the paper and I wanted to give my summary of what they did and my thoughts, to start us off.
I'm intentionally skimming the "deep convolutional neural network" part, and most of the learning algorithms.
It seems like I would have to read the previous work to understand how this is different from an "ordinary" multi-layer neural network.

Phase 1: Train SL Policy Network on 30 million board positions from KGS 6 - 9 dans.

The neural network takes input features from the board position, and outputs the probility of each move on the board being the actual next move.

The feature set is not just the board position, but also features like how many liberties that move would have, how many stones it would capture or put in self-atari, and whether the ladder works. The network predicts KGS moves correctly 57% of the time (and 55% with only board state and move history, so maybe the extra features don't help that much).

The learning algorithm for this part is "stochastic gradient ascent". I think I sort of know what that means but I don't want to try to figure it out right now.

They also trained a "fast policy" using only the local diamond-shaped pattern around the move and some other features like how far it is from the previous move.

Click Here To Show Diagram Code: [go]$$ Fast policy pattern $$ | . . . . . . $$ | . . a . . . $$ | . a a a . . $$ | a a X a a . $$ | . a a a . . $$ | . . a . . . $$ -------------[/go]

The fast policy only guesses 24% of the KGS moves correctly, but is about 1000 times faster.

Phase 2: Reinforcement Learning

We have a network that predicts KGS dan moves. But what we want is a network that predicts the BEST move. So using the SL policy network from the last phase as the starting seed, the researchers play the network against itself many times and use reinforcement learning to adjust the network weights towards winning moves.

Again I'm skimming the reinforcement learning algorithm.

One important idea is that each change in the network weights creates a new "bot" to play against, and the self-play is against a random previous network, to keep it from learning to beat a specific network instead of getting generally stronger.

This "RL" network beats the SL network 80% of the time, and beats pachi 85% of the time without any search at all. Clearly the network has learned some very powerful shape information at this point.

Phase 3: Value Network
Instead of predict the next move, this network needs to predict who is winning. They first trained on the KGS dataset, but the network overfit the data and essentially memorized who won each game. So they had the RL policy play against itself to generate 30 million different games, and trained on that, solving the overfitting problem.

They compare the accuracy of this Value Network to monte-carlo evaluation using the Fast Policy network, which I think is more-or-less how pachi and fuego evaluate board positions. The value network is more accurate.

Phase 4: Search

AlphaGo traverses through the move tree choosing moves randomly based on the estimated value of the move, the probability predicted by the network, and how often that move has been visited.

When we hit a node in the tree that we haven't visited before,
1) Evaluate it with SL policy network (not RL, more on this later). This gives the initial probabilities of choosing those moves in the search
2) Evaluate it with Value network to estimate who is winning
3) Evaluate it with a single fast policy rollout, i.e. play to the end of the game using the fast network

2 and 3 are averaged together to get a value estimate for the position, and each node up in the tree has this estimate averaged into its overall move value.

Not sure if I do a great job explaining this, there's a helpful diagram in figure 3.

A few interesting things about this. The "better" RL policy that was trained to win is apparently worse as a search heuristic than the SL policy trained directly from the KGS data. The authors guess that this is because the SL policy is more diverse and gives the search more options.

The other interesting thing is the mixing of the learned network's evaluation with the policy rollout. The authors do experiments on just one or the other evaluation and find that mixing both is best, although either alone is still enough to be competitive.

Evaluating the networks is expensive, and AlphaGo runs the network evaluations in parallel on the GPUs while the tree search is done on the CPUs.

I can't help but notice that the reinforcement learning only ends up helping the value network, and the value network is only part of the evaluation, with the policy rollout.
If you wanted to try a simpler version of what they did, you could do something like this:
1) Train SL network and fast playout network on some training data, same as the paper
2) Skip the reinforcement learning
3) skip the value network
4) Use their monte-carlo algorithm with SL network priors and fast playout network only

According to their results that version would still be a little stronger than Zen or Crazy stone.

I also wonder what would happen if you tried just the reinforcement learning without using any human training data, starting from random networks playing each other. Would it ever get off the ground?

I am seriously tempted to try a mini version of their work. Google recently released an open source library for machine learning and neural networks:
https://www.tensorflow.org/

I couldn't find anything saying whether this specific library was used in AlphaGo, but it seems to use the same techniques. Seems like it could be doable.

pookpooi · #2

does anyone know how many game alphago play itself in one day? google blog says thousand, but BBC, Nature and the video clip has David Silver said himself 'millions' game in one single day. I'm trying to find thing in paper but couldn't

emeraldemon · #3

The full AlphaGo doesn't play against itself, at least in this paper. For the reinforcement learning part only the policy networks play against each other, so each move is just evaluating the neural network, no search. On p.25 it says "The policy network was trained in this way for 10,000 mini-batches of 128 games, using 50 GPUs, for one day." So that's about 1 million games in a day, but again just the network, not the full alpha-go.

pasky · #4

Just to train the SL network as in the paper, you'd need 150 gpu-weeks of time. And this extra accuracy grinding matters a lot, see Fig 2. Of course, maybe after 1 gpu-week you'd already be quite a bit above 50%, but I think the best open source solution right now is just about 50%.

I think that if you just train a good SL network and use it in MCTS instead or in addition to playouts, you are actually reproducing the recent Facebook paper, which is exactly that. (It's the "darkforest" KGS 5d.)

See also http://computer-go.org/pipermail/comput ... 08524.html for more comments on reproducing this result.

emeraldemon · #5

Thanks pasky, I was also wondering some about the asynchronous evaluation, which you mentioned in the computer-go post. Would it be possible to have something like alpha-go on a 1-gpu machine, or is there no way to get it to scale down?

Spaceman-Spiff · #6

emeraldemon wrote:

Thanks pasky, I was also wondering some about the asynchronous evaluation, which you mentioned in the computer-go post. Would it be possible to have something like alpha-go on a 1-gpu machine, or is there no way to get it to scale down?

From the paper:

It's possible. It's just less effective.

pasky · #7

First, there's a huge difference whether we talk about *training* the networks, or just *running* the program with pretrained networks. The requirements I mentioned are for training.

For running the network, you can certainly scale it down - it will just be able to search to a smaller depth. The graph above shows big decrease between 1 and 2 GPUs, but I wouldn't worry about that much; I think smoothing that up won't be so hard - based on how irregular it is, it might have to do with the expansion node queues getting too long or whatever and you aren't going to have 40 threads anyway. Guessing (wild but educated), I'd place an 8-thread, 1-GPU version at 2500 Elo?

Paper discussion: Mastering the Game of Go

Who is online