Exploring LZ's search algorithm: worked examples

Bill Spight · **#41**

dfan wrote:

Bill Spight wrote:

xela wrote:

LZ's neural nets are not symmetric.

Why not? I can guess, but I'd rather not.

How could they be? Can you define the mathematical constraint you'd like to put on the convolutional kernels?

Well, considering that the square go board, unlike the chess board, has no essentially identifiable sides, or quadrants, or semi-quadrants, and thus has eightfold symmetry, why shouldn't the neural nets have eightfold symmetry?

Quote:

One thing you could theoretically do to handle symmetries is to canonicalize every input position to the network. Create a total ordering on board positions (e.g., sort positions by the state of the A1 point, then the A2 point if the two positions are the same at A1, then the A3 point, etc.). Then when you want to evaluate a position, take the eight transformations of it (including the identity), select the one that results in the "smallest" position, apply that transformation, send it through the network, and then apply the inverse transformation to the policy. That sounds like kind of a waste of energy, though.

Similarly, you could define the output of the policy network to be the average of all eight policies on the transformed positions, but now you're effectively doing eight times as much work to evaluate each position.

I don't know what assumptions you are making, to make things so difficult.

lightvector · **#42**

There are known ways of enforcing symmetry (search for "equivariant neural nets"). They are quite bad compared to simply randomly applying one of the 8 symmetries to your training data on every data sample as you feed the data to the neural net. With the latter method, the neural net of course won't be fully symmetric in practice, as it will start off not symmetric, and will randomly be learning the different symmetries for each position in a different order, etc. The "true" ideal target you're training it to converge to in the limit of infinite training is symmetric though.

A fun consequence of this is that usually the more thoroughly the neural net "understands" a position, the more closely all 8 symmetries will agree on that position, so you can use the divergence between the 8 symmetries as a crude indicator of the neural net's certainty on a position.

Why is enforcing symmetry bad?

Intuition: Perhaps it's better to allow the neural net to learn potentially asymmetric things on its way to converging to the ideal symmetric function, it's less likely to get stuck in a local hill as the asymmetric states can act as more bridges for the neural net to better transition between different nearly-symmetric states.

Also a practical issue: The current specific ways of enforcing symmetry via equivariant nets enforce a certain property of "equivariance" in the weights of the neural net at each layer. This construction has the neural net apply symmetry-rotated/reflected versions of every set of weights internally at every layer. If any isolated bit of the weights that the neural net would have ended up learning would have been roughly symmetric anyways (e.g. maybe in an early layer the neural net has a "pseudoeye" detector that looks for an empty point surrounded by four adjacent stones of the same color - a fully symmetric local shape), then holding computational cost of the net fixed this results in an 2 or 4 or 8-fold waste of the capacity of that bit of the neural net to have symmetrized copies of those weights that are already naturally symmetric.

Maybe you think: instead of trying some complicated equivariance thing involving mirrored copies of asymmetric weights, why not just simply enforce that the convolutional filters at every layer are symmetric directly? This is even worse. Many symmetric functions are most naturally expressed as combinations of asymmetric functions - e.g. things like count liberties northward - count liberties westward - count liberties eastward - count liberties southward - add all together. All four operations individually are asymmetric, since each one only counts in a specific direction, but their combination is symmetric. If you enforce literal symmetry at every internal layer, you prevent the neural net from doing things like this - "equivariance" what we call it when you still allow the net to do this, where north,west,east,south can all individually compute asymmetric things temporarily, even over many layers, but are symmetric to *each other*.

dfan · **#43**

Bill Spight wrote:

dfan wrote:

Bill Spight wrote:

Why not? I can guess, but I'd rather not.

How could they be? Can you define the mathematical constraint you'd like to put on the convolutional kernels?

Well, considering that the square go board, unlike the chess board, has no essentially identifiable sides, or quadrants, or semi-quadrants, and thus has eightfold symmetry, why shouldn't the neural nets have eightfold symmetry?

I was speaking from an implementation point of view (thus my question about the form of the constraint on the convolutional kernels), rather than the mathematics of the desired output (which I agree "should be" symmetrical). Anyway, lightvector answered this (at least) as well as I can.

Quote:

One thing you could theoretically do to handle symmetries is to canonicalize every input position to the network. Create a total ordering on board positions (e.g., sort positions by the state of the A1 point, then the A2 point if the two positions are the same at A1, then the A3 point, etc.). Then when you want to evaluate a position, take the eight transformations of it (including the identity), select the one that results in the "smallest" position, apply that transformation, send it through the network, and then apply the inverse transformation to the policy. That sounds like kind of a waste of energy, though.

Similarly, you could define the output of the policy network to be the average of all eight policies on the transformed positions, but now you're effectively doing eight times as much work to evaluate each position.

I don't know what assumptions you are making, to make things so difficult.

I'm not sure what this sentence means, but I'm also not sure it's worth pursuing! If you want to spell it out more I will try to respond, but I also don't mind just dropping it.

Bill Spight · **#44**

Thanks to you, lightvector, and dfan for your replies.

I said I'd rather not guess, but here is my guess, anyway. Enforcing symmetry is quite likely to slow learning. For instance, suppose that altering the weight for Q-17 on the full board gives the most improvement for a training game or set of training games. But if you applied the same alteration to all 8 3-4 points, it might even make the player play worse. OTOH, slower learning might not be, in the end, a bad thing. For instance, enforcing symmetry would reduce path dependency by joining many paths into one.

lightvector wrote:

There are known ways of enforcing symmetry (search for "equivariant neural nets"). They are quite bad compared to simply randomly applying one of the 8 symmetries to your training data on every data sample as you feed the data to the neural net. With the latter method, the neural net of course won't be fully symmetric in practice, as it will start off not symmetric, and will randomly be learning the different symmetries for each position in a different order, etc. The "true" ideal target you're training it to converge to in the limit of infinite training is symmetric though.

A fun consequence of this is that usually the more thoroughly the neural net "understands" a position, the more closely all 8 symmetries will agree on that position, so you can use the divergence between the 8 symmetries as a crude indicator of the neural net's certainty on a position.

Very interesting. :cool:

Quote:

Why is enforcing symmetry bad?

Intuition: Perhaps it's better to allow the neural net to learn potentially asymmetric things on its way to converging to the ideal symmetric function, it's less likely to get stuck in a local hill as the asymmetric states can act as more bridges for the neural net to better transition between different nearly-symmetric states.

Interesting. On its face that is quite different from my intuition that enforcing symmetry would be likely to smooth out the fitness landscape.

Quote:

Also a practical issue: The current specific ways of enforcing symmetry via equivariant nets enforce a certain property of "equivariance" in the weights of the neural net at each layer. This construction has the neural net apply symmetry-rotated/reflected versions of every set of weights internally at every layer. If any isolated bit of the weights that the neural net would have ended up learning would have been roughly symmetric anyways (e.g. maybe in an early layer the neural net has a "pseudoeye" detector that looks for an empty point surrounded by four adjacent stones of the same color - a fully symmetric local shape), then holding computational cost of the net fixed this results in an 2 or 4 or 8-fold waste of the capacity of that bit of the neural net to have symmetrized copies of those weights that are already naturally symmetric.

Maybe you think: instead of trying some complicated equivariance thing involving mirrored copies of asymmetric weights, why not just simply enforce that the convolutional filters at every layer are symmetric directly? This is even worse. Many symmetric functions are most naturally expressed as combinations of asymmetric functions - e.g. things like count liberties northward - count liberties westward - count liberties eastward - count liberties southward - add all together. All four operations individually are asymmetric, since each one only counts in a specific direction, but their combination is symmetric. If you enforce literal symmetry at every internal layer, you prevent the neural net from doing things like this - "equivariance" what we call it when you still allow the net to do this, where north,west,east,south can all individually compute asymmetric things temporarily, even over many layers, but are symmetric to *each other*.

Now that you bring up counting liberties, I have wondered how the networks learn to do so. I kind of doubt whether they learn a counting operation. And even if they did, how do they learn how to order the counting? After all, there are 120 ways to count 5 dame.

Ordering the counting is necessary for efficiency. I wondered if they count somewhat like parrots do. Parrots can count to 6 in the sense that they can distinguish between there being 5 objects in a small enough space and there being 6 objects there, but the difference between 6 objects and 7 objects gives them trouble. I have imagined that the bots gradually learn to distinguish between N and N-1 dame as N increases.

Mike Novack · **#45**

Bill Spight wrote:

I wondered if they count somewhat like parrots do. Parrots can count to 6 in the sense that they can distinguish between there being 5 objects in a small enough space and there being 6 objects there, but the difference between 6 objects and 7 objects gives them trouble. I have imagined that the bots gradually learn to distinguish between N and N-1 dame as N increases.

Not just parrots. We humans can count small numbers of things if we want to, because we have learned counting as a method for when the number of things is too great for "number recognition". But we do not ordinarily count small numbers of objects. We recognize how many (just like the parrots). I believe that there have been recent papers about AI neural nets being able to learn this ability. Keep in mind, OUR brains ARE neural nets.

Bill Spight · **#46**

dfan wrote:

Bill Spight wrote:

Well, considering that the square go board, unlike the chess board, has no essentially identifiable sides, or quadrants, or semi-quadrants, and thus has eightfold symmetry, why shouldn't the neural nets have eightfold symmetry?

I was speaking from an implementation point of view (thus my question about the form of the constraint on the convolutional kernels), rather than the mathematics of the desired output (which I agree "should be" symmetrical).

Quote:

I don't know what assumptions you are making, to make things so difficult.

I'm not sure what this sentence means, but I'm also not sure it's worth pursuing! If you want to spell it out more I will try to respond, but I also don't mind just dropping it.

I've got the general idea, thanks.

As far as I am concerned, going into detail about implementation would not mean much to me until I get my new axe. Then I can play around with different ideas.

Bill Spight · **#47**

Mike Novack wrote:

Bill Spight wrote:

I wondered if they count somewhat like parrots do. Parrots can count to 6 in the sense that they can distinguish between there being 5 objects in a small enough space and there being 6 objects there, but the difference between 6 objects and 7 objects gives them trouble. I have imagined that the bots gradually learn to distinguish between N and N-1 dame as N increases.

Not just parrots. We humans can count small numbers of things if we want to, because we have learned counting as a method for when the number of things is too great for "number recognition". But we do not ordinarily count small numbers of objects. We recognize how many (just like the parrots). I believe that there have been recent papers about AI neural nets being able to learn this ability. Keep in mind, OUR brains ARE neural nets.

Rainman excepted, I think humans may not in general be as good as parrots. For instance, there are human languages with no numbers greater than three.

xela · **#48**

Bill Spight wrote:

Rainman excepted, I think humans may not in general be as good as parrots. For instance, there are human languages with no numbers greater than three. :o

Can't think of too many parrot languages with numbers greater than three either.

Maharani · **#49**

xela wrote:

Bill Spight wrote:

Rainman excepted, I think humans may not in general be as good as parrots. For instance, there are human languages with no numbers greater than three.

Can't think of too many parrot languages with numbers greater than three either.

Lol

jann · **#50**

NNs like to start from random (non-random inits learn slower, if at all), and enforcing symmetries reduces randomness. Some papers showed that most NN convergences are direct results of separatable, luckily initialized weight subsets (the rest may even be left out). And because of the randomised use of the net in selfplay, if just one symmetry learns / finds out sth useful, that will also be taught to other symmetries soon.

xela · **#51**

Bill has suggested an interesting position for analysis. It's an example of "LZ ignores a good move".

The game is Sakaguchi-Hayashi, GoGoD 1861-12-18e. After :w30:

there's an interesting fight happening in the bottom left.

Click Here To Show Diagram Code: [go]$$Bc Black to play $$ +---------------------------------------+ $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . X . . . . . . . . . O . . . . | $$ | . . . . . . . . . . . . . . . . X . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . O O O X . . . . . . . . . . . . | $$ | . O b O X X X X . . . . . . . . . . . | $$ | . a X X O O . . . . . . . . . . X . . | $$ | . . X O . O . O X . . . . . O . . . . | $$ | . . X O O . c . X . . . . . . O . . . | $$ | . . e . . f d . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ +---------------------------------------+[/go]

Looking at the policy network, a is the "first instinct" move (70-80%), and all of b through d are interesting (3-10%). Possibly e and f are worth a quick look too (around 1%). LZ-258, ELF and KataGo 1.2 all broadly agree on the policy values. (Installing KataGo 1.3 is on my to-do list!)

What would you play here? Spoilers behind the cut.

Here's the full game for your entertainment. More detailed AI analysis to come soon...

xela · **#52**

Here's a trace of 10,000 playouts (SGf summary only, the CSV files for this many playouts are huge!) I'll give you three versions:

LZ-258 from the starting position: it spends most of its time exploring B5, and only gives 43 playouts to G3. (Playing with different random number seeds, I can persuade it to give over 60 visits to G3, but the evaluations don't change much.)
LZ-258 analysing the position after G3 has been played.
LZ with the ELF network from the starting position. It still looks at B5 first, but starts paying serious attention to G3 after about 2,000 playouts, and looks at G3 exclusively from playout 4,319 onwards.

Remember that the difference between LZ-258 and LZ-ELF here is only the network. It's still the same software running the same search algorithm. So how do the different networks affect the choice of B5 or G3?

The policy values tell us that both networks will look at B5 first. But that's not the full story. LZ-258 gets to G3 at playout 93, and on a first look thinks that it's a 60% winrate for white, not that much different from the initial 59% for B5. But after a few playouts, G3 starts to look worse, so LZ-258 gives up on it. LZ-ELF starts off similarly, looking at G3 from playout 21, and scoring G3 at 46% for white compared with 49% for B5, again in the same ballpark. But on more playouts, LZ-ELF rates B5 as getting a lot worse for black, while the rating for G3 doesn't change much. So the difference isn't between those two positions specifically, but further down the tree.

Looking at the variations after B5, I haven't yet found a massive difference between 258 and ELF. But here's something interesting for the G3 variations:

Click Here To Show Diagram Code: [go]$$Bc $$ | . . . . . . . . . . $$ | . . . O O O X . . . $$ | a O 6 O X X X X . . $$ | . 5 X X O O . . . . $$ | b . X O . O . O X . $$ | . . X O O . 1 . X . $$ | . 3 2 4 . . . . . . $$ | . c . . . . . . . . $$ +--------------------[/go]

Both networks have a as the first instinct. But it's not actually a good move. ELF spends 15 playouts on it, and then gives 994 playouts to c (and has another look at a later on). Meanwhile it's been spending a lot of time on B5 at the start, so it's not until around playout number 3,000 that c is seen to be clearly better than a in this diagram.

LZ-258 gives 18 playouts to a, but then its second choice is b, which also doesn't work well (and is discarded after four playouts). Starting from the initial position, LZ-258 will never even look at c (the policy value is around 1%). By playout number 5,000 it's given up on G3 (at least for the time being; it will come back by playout number 100,000 if you let it run that long). Maybe this is a little blind spot in LZ-258's network that skews the evaluations slightly. Not the full story, there are still a lot of other variations to look at.

Finally, G3 in the initial position has a policy value of around 7% for LZ-258, or around 10% for ELF. Does that 3% difference have a big influence on the search? I suspect not -- we've seen examples of where 1% moves can get a lot of playouts if the evaluations are promising -- but it's hard to say for sure (I've tried scribbling down some equations based on the UCT formulae, and the maths gets pretty messy). I might come back to that later. Or I might get distracted by something else...

xela · **#53**

I was going to post some ladder positions here, but it got complicated, so I started a separate thread for them.

Exploring LZ's search algorithm: worked examples

Who is online