Full title is actually "Mastering the Game of Go with Deep Neural Networks and Tree Search"
I know there are other programming and AI enthusiasts on the forum, so I'd like to start a thread to discuss specifically the research paper.
The paper is here:
https://storage.googleapis.com/deepmind ... ing-go.pdfI read through the paper and I wanted to give my summary of what they did and my thoughts, to start us off.
I'm intentionally skimming the "deep convolutional neural network" part, and most of the learning algorithms.
It seems like I would have to read the previous work to understand how this is different from an "ordinary" multi-layer neural network.
Phase 1: Train SL Policy Network on 30 million board positions from KGS 6 - 9 dans. The neural network takes input features from the board position, and outputs the probility of each move on the board being the actual next move.
The feature set is not just the board position, but also features like how many liberties that move would have, how many stones it would capture or put in self-atari, and whether the ladder works. The network predicts KGS moves correctly 57% of the time (and 55% with only board state and move history, so maybe the extra features don't help that much).
The learning algorithm for this part is "stochastic gradient ascent". I think I sort of know what that means but I don't want to try to figure it out right now.
They also trained a "fast policy" using only the local diamond-shaped pattern around the move and some other features like how far it is from the previous move.

Fast policy pattern
- Click Here To Show Diagram Code
[go]$$ Fast policy pattern
$$ | . . . . . .
$$ | . . a . . .
$$ | . a a a . .
$$ | a a X a a .
$$ | . a a a . .
$$ | . . a . . .
$$ -------------[/go]
The fast policy only guesses 24% of the KGS moves correctly, but is about 1000 times faster.
Phase 2: Reinforcement LearningWe have a network that predicts KGS dan moves. But what we want is a network that predicts the BEST move. So using the SL policy network from the last phase as the starting seed, the researchers play the network against itself many times and use reinforcement learning to adjust the network weights towards winning moves.
Again I'm skimming the reinforcement learning algorithm.
One important idea is that each change in the network weights creates a new "bot" to play against, and the self-play is against a random previous network, to keep it from learning to beat a specific network instead of getting generally stronger.
This "RL" network beats the SL network 80% of the time, and beats pachi 85% of the time without any search at all. Clearly the network has learned some very powerful shape information at this point.
Phase 3: Value Network Instead of predict the next move, this network needs to predict who is winning. They first trained on the KGS dataset, but the network overfit the data and essentially memorized who won each game. So they had the RL policy play against itself to generate 30 million different games, and trained on that, solving the overfitting problem.
They compare the accuracy of this Value Network to monte-carlo evaluation using the Fast Policy network, which I think is more-or-less how pachi and fuego evaluate board positions. The value network is more accurate.
Phase 4: Search AlphaGo traverses through the move tree choosing moves randomly based on the estimated value of the move, the probability predicted by the network, and how often that move has been visited.
When we hit a node in the tree that we haven't visited before,
1) Evaluate it with SL policy network (not RL, more on this later). This gives the initial probabilities of choosing those moves in the search
2) Evaluate it with Value network to estimate who is winning
3) Evaluate it with a single fast policy rollout, i.e. play to the end of the game using the fast network
2 and 3 are averaged together to get a value estimate for the position, and each node up in the tree has this estimate averaged into its overall move value.
Not sure if I do a great job explaining this, there's a helpful diagram in figure 3.
A few interesting things about this. The "better" RL policy that was trained to win is apparently worse as a search heuristic than the SL policy trained directly from the KGS data. The authors guess that this is because the SL policy is more diverse and gives the search more options.
The other interesting thing is the mixing of the learned network's evaluation with the policy rollout. The authors do experiments on just one or the other evaluation and find that mixing both is best, although either alone is still enough to be competitive.
Evaluating the networks is expensive, and AlphaGo runs the network evaluations in parallel on the GPUs while the tree search is done on the CPUs.
I can't help but notice that the reinforcement learning only ends up helping the value network, and the value network is only part of the evaluation, with the policy rollout.
If you wanted to try a simpler version of what they did, you could do something like this:
1) Train SL network and fast playout network on some training data, same as the paper
2) Skip the reinforcement learning
3) skip the value network
4) Use their monte-carlo algorithm with SL network priors and fast playout network only
According to their results that version would still be a little stronger than Zen or Crazy stone.
I also wonder what would happen if you tried just the reinforcement learning without using any human training data, starting from random networks playing each other. Would it ever get off the ground?
I am seriously tempted to try a mini version of their work. Google recently released an open source library for machine learning and neural networks:
https://www.tensorflow.org/I couldn't find anything saying whether this specific library was used in AlphaGo, but it seems to use the same techniques. Seems like it could be doable.