Leela Zero Stuck

Javaness2 · #1

I guess most people are following http://zero.sjeng.org/
At the minute it is still only using 6 blocks, AlphaGo was using 20, and is presumably still in its trial mode.
Since 2018 the AI hasn't been able to improve, which is interesting. Could AlphaZero have been more complicated than GCP imagined?

(I am only posting this so as to make sure a successful training run completes just before I hit Submit)

Aram · #2

There's a new network now, which is about 200 ELO stronger in self-play, so a good sized improvement.

LeelaZero is already 4-dan on Fox server with over 200 games played against humans. (With an older network)
https://www.reddit.com/r/cbaduk/comment ... go_server/

The network is indeed much much smaller than AlphaGoZero, at only 5x64

This means that the network will eventually stall and stop improving, but it still seems to be going strong.
If you look at AlphaGoZeros paper you'll see that it also had a lot of zigzag movements on trying to find the correct network.
You have to remember that the zigzag movement actually represents "pauses" or game amounts where it was "stuck".
Deepmind just had a lot of more resources to put into it, and the scale is a bit different, so it's hard to see that from their graphs.
1-3 days of no new networks is not stuck in my opinion.

The good thing with a small network is that it'll make game generation quicker (needed for making it stronger), but also,
when the game generation is done, this network will run very well on even modest GPU hardware.

You have to remember that the self-play games are with randomness and noise and only 1600 playouts. So their quality isn't very high if you look at them. Even the match games between networks are at 1600 playouts. Increase playouts and the strength of the bot goes up by a lot. If the network would be a huge Google DeepMind type network, extremely few people would be able to eventually run it on their home hardware.

Also, the current 5x64 network has been stated to be a first trial run. Seeing how far the 5x64 can go, before switching to a larger network. The larger network, the more resources you need to train it (self-play games), and the slower it'll go. If you have bugs along the way, it becomes extremely painful. So in my opinion it's extremely smart that they started with a smaller network.

We'll see how far it can get, and ones its strength is maxed out people can run it at a high dan level on their own home hardware.
The project will have ironed out all the bugs (one for example made the progress over the first 800k games extremely slow), and then they can restart on a larger network which will have more headroom to improve.

Please do consider running the exe file (or the linux binary) to help generate self-play games. It's extremely easy, just download here:
http://zero.sjeng.org/

Javaness2 · #3

Does the lower block size prevent the reading of larger scale ladders? I wondered if it introduced a tactical horizon in that respect - but I may misunderstand the theory rather a lot there.

Aram · #4

The neural net, maybe, maybe not? I'm no qualified to say, i've heard opinions in both directions.
But, it's not only the neural net it has, it also has the playouts to help it.

We already know it plays long ladders, and in this recent networks there's already proof that it plays ladder breakers on a shorter scale.

How about longer ladders?

Take a look at this game from an match game between the last networks (which loved ladders) and a network candidate that failed to concur the old net. Move #51 forward.

http://zero.sjeng.org/viewmatch/1300be6 ... viewer=wgo

That's a ladder breaker on a long ladder, no?

Who knows how much information it can cram into that "small" neural net, and how it'll use different parts. Will be interesting to see

Vesa · #5

Aram wrote:

Take a look at this game from an match game between the last networks (which loved ladders) and a network candidate that failed to concur the old net. Move #51 forward.

http://zero.sjeng.org/viewmatch/1300be6 ... viewer=wgo

That's a ladder breaker on a long ladder, no?

No! (It's not a ladder breaker...)

Cheers,
Vesa

Uberdude · #6

Funny how that "ladder breaker" isn't actually one; seems like the networks have learned the idea of ladder breakers as attachments in the diagonal path of a ladder but don't actually read if they work. But what is white's best response? Nobi up or down keeps a strong shape and you can then tenuki the next black move without worrying much, but is it soft? Hane on top or below refuses to be pushed around, but then counter hane and the local fight gets hotter and the ladder might get broken soon so you'll have to complete the ladder capture or suffer a local loss from the tenuki. Also I've not been following LeelaZero much but it does seem to like capturing stones unnecessarily like 43 here.

moha · #7

Uberdude wrote:

Funny how that "ladder breaker" isn't actually one; seems like the networks have learned the idea of ladder breakers as attachments in the diagonal path of a ladder but don't actually read if they work.

This seems to be a range issue. This is not a long ladder (around 8 spaces), just in the vision range of the current shallow network. The consequences of the "ladder breaker" would probably extend beyond.

Uberdude · #8

I remember a year and a bit ago when Zen/CS added neural networks and struggled with life and death of big dragons people suggested this was because the 2 eyes of the dragon were distance x (Manhattan or with diagonals?) apart but the network only had depth y which was less than x, and you can only respond to a feature an extra space away by going down one level of the network to get that neighbouring information. Is this the gist of it? So if Leela Zero is only 6 layers deep that means it can't respond to effects across a distance of greater than 6? I'd find that surprising (and also wouldn't some emergent features not be in the top layer but only several down, so if it starts to understand eyes say 3 layers down then you've only got 3 more layers so distance of 3 for making relationships about eyes) as I'd expect (maybe incorrectly) it do so things like change whether to pincer or back off based on colour of neighbouring corner, which is about 12 spaces away.

AlphaGo Zero used 20 and 40 ~~layers~~ blocks (1 block = 2 layers), both numbers nicely larger than 19. These were residual rather than normal layers, which I don't fully understand but I think means values/information from upper layers can be more effectively passed down to lower layers without the lower ones needing to train their weights which helps somehow. Not sure what the original AlphaGo Fan/Lee used, maybe 48 regular convolutional layers?

moha · #9

Well, the essence is similar. The distances are in diagonals, and proportional to the number of layers, not blocks (two layers per block usually - except early Alphago which only used about 12 plain convolutional layers IIRC, no residuals). You are also right that the more complex your function is, the more layers necessary to process it, so the earlier the information needs to be present. So these things are always plus-minus a few layers, i.e. just rough guesses.

But I think the issue with life and death is a different one, at least besides this. In that case there is also the problem that L&D needs accurate reading, with often only a single correct move at each point. Neural networks work with probabilities, and if the correct answer is within the top five guesses (in image classifications for example) that is often deemed correct. But to read out even a local position, whether the correct move is the 2nd or 3rd guess can make the difference.

lightvector · **#10**

Uberdude - Leela Zero's theoretical radius of influence in its convolutions is 13, not 6.

* One 3x3 convolution at the start (radius 1)
* 6 residual blocks each containing two 3x3 convolutions (radius 6 * 2)

This is theoretically *barely* enough to solve some ladders that cross the whole board, since the distance from a corner star point to the diagonal opposite corner star point is 12, but it's probably not trivial for the neural net to learn the right weights across all these layers to do this, having some extra wiggle room would make it much easier.

There's also some further dense fully-connected layer in the policy head, which could theoretically do arbitrary-range computations, but that part of the neural net is heavily channel-constrained - only 2 channels - so I don't expect it to do much with ladders.

In traditional non-residual-net layer counting, the convolutional parts of AlphaGo 20 block and 40 blocks would be 41 and 81 layers.

Gomoto · **#11**

Thinking the depth of the neural network is DIRECTLY related to the distance of the stones is akin to magical thinking.

yoyoma · **#12**

Small correction: LZ currently uses a 5x64 network -- 5 residual blocks, 64 filters. So redoing the calculation lightvector did: 1+5*2=11 instead of 1+6*2=13. Also it's possible for it to propagate information from each corner/side of the board towards the middle and then do some summarizing in the middle.

That's some theory. What is possible in practice is harder to say.

Another thing to point out: The same problem applies to counting liberties on large chains. It's just as difficult for the NN to propogate "I have 2 liberties over here" vs "I have 1 liberty here" through a chain that is very long as it is to read a ladder.

lightvector · **#13**

By the way, just for fun, I've run some experiments with supervised learning of a policy net on pro games. It turns out if the training target is "identify all stones in inescapable atari" rather than "identify the location of the next pro move", the neural net does learn to generalize and solve long-distance ladders, and when the neural net is fully convolutional (no fully-connected layers in the output head) the distance over which the neural net can solve the ladders is indeed close to the max theoretical radius of influence of the convolutions.

Github repo and some summary notes here. In my experiments, I *did* use things like liberties as input features though. https://github.com/lightvector/GoNN

johnsmith · **#14**

Issue on github: https://github.com/gcp/leela-zero/issues/514

Question: can the network only look at max 5 stones away?

gcp:
A 3x3 convolution can propagate information spatially at most 1 stone away.
There are 11 3x3 convolutions in the 64 x 5 stack. (Don't forget the one in the input layer!)
This means the stack is big enough so that in the final layers it can correlate information from 2 opposite ends of the board (2 x 11 away) in the central squares.
The policy and value heads contain a fully connected layer that spans the entire 19 x 19 board.
So the answer to the original question is a clear no: she can see much further.

Aram · **#15**

New 0.10.0 version released!!

Speedups and bugfixes to the Leela Zero engine.
GPU version now checks OpenCL implementation for correctness.
CPU-only version for people without usable GPU.
AutoGTP now fetches self-play or match tasks and parameters from the server.
Updated OpenBLAS to 0.3.0-dev with support for newer CPUs.

https://github.com/gcp/leela-zero/releases

Aram · **#16**

Worth reading for those wondering about what GCP has planned for the future:

https://github.com/gcp/leela-zero/issues/591

Uberdude · **#17**

LeelaZero having ladder troubles against a human on OGS:

moha · **#18**

Uberdude wrote:

LeelaZero having ladder troubles against a human on OGS

I would expect the network to realize it's own vision range problem (from selfplay), and preemptively avoid unclear ladders. It's interesting this does not happen.

Bill Spight · **#19**

moha wrote:

Uberdude wrote:

LeelaZero having ladder troubles against a human on OGS

I would expect the network to realize it's own vision range problem (from selfplay), and preemptively avoid unclear ladders. It's interesting this does not happen.

Perhaps mutual blindness? Neither version of LeelaZero whether the ladder works, and so it is never played out, and LeelaZero never learns. (Or it will take a long time.) As for avoiding unclear ladders, not much is clear, is it? Especially in the opening.

moha · **#20**

Bill Spight wrote:

moha wrote:

Uberdude wrote:

LeelaZero having ladder troubles against a human on OGS

I would expect the network to realize it's own vision range problem (from selfplay), and preemptively avoid unclear ladders. It's interesting this does not happen.

Perhaps mutual blindness? Neither version of LeelaZero whether the ladder works, and so it is never played out, and LeelaZero never learns. (Or it will take a long time.) As for avoiding unclear ladders, not much is clear, is it? Especially in the opening.

There is a point in playing out unread ladders though, for randomization. So during selfplay the side behind may play it.

Leela Zero Stuck

Who is online