It is currently Tue Dec 07, 2021 5:26 am

All times are UTC - 8 hours [ DST ]




Post new topic Reply to topic  [ 6 posts ] 
Author Message
Offline
 Post subject: Neural networks optimising mapping functions
Post #1 Posted: Sun Oct 17, 2021 6:10 am 
Lives in gote

Posts: 312
Liked others: 47
Was liked: 249
Rank: UK 2d Dec15
KGS: mathmo 4d
IGS: mathmo 4d
This post is designed for a mathematical + comsci audience only. It is to demonstrate my understanding, and share some thoughts.

As far as I recall:
There are around 3^361 possible Go board position, but more like 10^170 according to Tromp.
The number of weights in a 40 block neural net based on Alphago (like katago) is around 10^8.
The number of neurons in a human brain is around 10^11.

Do the dimensions match? Is dimensions the right way to think about the problem?

For example given n input bits (e.g. to represent board positions), you want to output 1 number to represent the value of that position. The dimension of the function space is 2^n * 1 = 2^n.
However, does the number of weights in a neural network roughly correspond to dimensions. The key thing is that neural network outputs are not linear in their weights or inputs, so thinking of the functions as a vector space probably doesn't lead very far. Non-linearity is critical because otherwise linearity would imply that if a stone A helps player X, and a stone B helps player X, then having both stones A and B on the board will help player X. This is false, for example if A and B together is self-atari or suicide.

On the other hand, matrices still play a key role in neural networks. The linearity point is why I was surprised when I was first told by a Deepmind employee that the AI was based on matrices. However the study of matrices is very deep, with highly optimised (e.g. parallelised) algorithms to work with them - adding, multiplying etc.
It is also easier to think in terms of linear things (since we are familiar with adding up normal numbers and matrices are just collections of numbers), and the calculus formulae are much simpler to write down with matrices (merely the chain rule: dAx/dx = A).

Neural networks in 5 paragraphs:

If there are n input bits these n numbers are called the zeroth layer (with n input neurons). Each next layer consists of say m neurons, each of which is a function of the neurons in the previous layer. After a good number of layers (I think it is 80 in a 40block net?), the output layer consists of your output, e.g. the value and policy numbers (just 2 neurons). There is a weight (a real number) for every relation between a neuron in one layer and a neuron in the next indicating how they are correlated.

In image processing, CNNs (which alphago uses), not all possible neurons are linked between neighbouring layers. Instead "nearby" neurons are linked. There are well developed mathematical tricks to make the processing as simple as possible while still preserving good enough links. In Go, this helps mirror the ruleset, the fact that a stone's immediate impact on the board is only on the liberty count of its 4 neighbours. Only through the connectedness of the board does this influence slowly propagate out. Note that a ladder is the only Go shape that can propagate influence arbitrarily far through empty space without decreasing in value, but that is a special tactical case where both sides have only one reasonable move to play in order to win that fight. Normally the influence of a stone/position decreases exponentially with distance. In this way, Go is a primarily local game.

The non-linearity is inserted by an "activation function" F at each neuron. So each neuron is a linear sum of the previous layer's neurons (sum of weight w_k times neuron k's output), which is then passed F to produce the neuron's output. There are fancy functions such as tanh, sigmoid (for which the derivative is very simple, this is used in ELO as well), RELU and much more variants.

The training is done by using a loss function (e.g. cross entropy for sigmoid or least squares) on training games and gradient descent. If a position resulted in a loss, the value output for that position should be closer to 0% than 100%. The loss function tends to be linear, so that if a position results in a win p% of the time, the value output for that position should converge to p% (if only training on that position). The weights are adjusted by a size and direction (+ or -) according to how much they contribute in the current net to changing the value output for that position. Since large weights will have most impact on the output, they will also be modified the most. This leads to problems such as gradient explosion or vanishing gradient, and dead neurons. However, normally no neuron is completely dead (that is too focused on the discrete dimension we can see and forgetting the relations between the neurons). With different inputs, different neurons will be more or less important.

There is a constant feedback loop of adapting neurons as more training is done, learning and forgetting. Sometimes good simple rules learnt previously are replaced by more efficient better neuron rules. The random history of the net's past weights can affect how it fits the mapping function so keeping several different nets at one time can be a good idea rather than pure self-play. The idea is to get the net to converge to optimal play as best as possible.


RELU, gradients
Thinking in terms of sigmoid activation functions is quite confusing. However AFAIK, RELU seems to work just as well or even better, often with faster training times.
RELU(x) = x if x>0 else 0
The point is that this keeps most of the linearity but provides a cut-off point below which everything is a flat zero. This is a lot easier to analyse carefully.

We can think of each neuron as a function of the input space (extend {0,1}^n to R^n since the functions can take any number in R as input). It is convenient to represent this scalar function as a gradient field in R^n.

In the zeroth layer, each neuron represents a dimension, so the n neurons represent the n basis vectors.

In the first layer, if activation is not used, then each neuron is first a linear sum of the inputs, so as a function it is a hyperplane in the input x output space. The gradient is constant in the whole space. (picture n=2, a plane with where every point maps to a fixed vector). A RELU activation introduces a cutoff line, perpendicular to the gradient below which the gradient is zeroed out. Crossing a cutoff line leads to a fixed jump in gradient depending on which direction you cross it in. When n>2, the cutoff is a hyperplane of dimension n-1, but for now we stick to visualising n=2.

In the second layer, you have a linear combination of outputs in the first layer with some cutoff. In this way (pre-activation), you can have a function with several different cutoff lines inherited from the first layer. These don't have to be parallel, and the gradient can point the opposite direction as well, cancelling out gradients. The activation step leads to a new cutoff "curve" which is always perpendicular to the gradient there, but which separates the spaces into >0 and <0, replacing the gradient with zero if the function is <0. Many segments of this curve will be parallel to cutoff lines in the first layer, but where nonzero gradients from different neurons overlap, the curve may bend to smooth out the differences in gradient.

Is there still a sense in which if a neuron is created with m weights, then varying those m weights means the function space has dimension m? I think so yes. Especially with the standard technique by adding a constant bias neuron at each layer, and connecting it with a weight to all neurons in the next layer, this means that addition by an arbitrary constant is possible in a neuron, meaning the cutoff line can be shifted back and forth as well.


Metrics and simplifying models
Go has its ELO ratings as a model, but as katago helps demonstrate, good strategy against weaker players can be very different from good strategy against stronger players.
This is because it is a two player game and your results depend very much on your opponent. When you have three player games, things tend to get even more complicated, and standard game theory notions of "optimal play" regardless of your opponent's actions stop making sense because if your opponents always cooperate, you always lose.

In a one player game, the final score works as a good metric.

Let us a consider a one player perfect information game with two choices at each of n-1 moves with no repeated positions. The game ends after those n-1 moves and is scored. Then there are 2^0+2^1+...+2^(n-1)=2^n-1 possible states. It is natural to use n bits to encode the state.

How much different does a well designed encoding make compared to an arbitrary one? For example, on a go board, similar local shapes crop up a lot, but is this enough to make a big impact?

To simplify things, we assume that the neural network only outputs a value number in R which is trained to match expected score. During game play, the bot reads ahead one move and uses cross entropy to calculate the probability of playing the next move. Say the two possible expected scores from the current position are a and b. Then the probability of playing to a is e^a/(e^a+e^b). A multiplier alpha can be inserted so it is e^(alpha*a)/(e^(alpha*a)+e^(alpha*b)).

If the network is big enough (say 2^n weights), then you expect this to train to perfect accuracy with zero loss (but not necessarily perfect play). If there were 2^n independent weights training on each position, this is guaranteed to converge to perfection (if the weights don't bomb out and die with zero gradient). This is because all positions are reached with nonzero probability, and the 2^(n-1) final scored states will be trained to perfection since they are trained to match a fixed number. Other states will have more fluctuations since the expected score from such positions will depend on the probability distribution of moves chosen from that position which depends on the imperfect neural network. However, once the 2^(n-1) final states are perfectly scored (within eps<<1), distribution is fairly fixed, so the 2^(n-2) states that are one move before the end of the game will be training to match a fixed number and so on, gradually making the expected score for previous states all the back to the start more accurate. It is unclear to me how the non-independence of the weights in a neural network affects training accuracy.


A simple example
Using the assumptions above, lets set n=3 and use a very simple network with no hidden neurons, just 3 weights from input to output. (in theory we should need 7 weights to map all 7 states in general).

If we ignore (0,0,0), let's say the game tree is

root = (1,0,0)
(parent: children) pairs:
(1,0,0) : (0,1,0) (1,1,0)
(0,1,0) : (0,0,1) (0,1,1)
(1,1,0) : (1,0,1) (1,1,1)

The rightmost nonzero bit indicates the level, and the bits to the left of that indicate the parent.
with final scores A,B,C,D in (0,0,1),(0,1,1),(1,0,1),(1,1,1) respectively.

I wrote a simple program (I just started learning tensorflow yesterday) to train on this model. RELU lead to a dead network around 10% of the time, so I switched to LeakyReLU. I can't make any general observations yet, but I do note that on nodes which lead to bad scores, they don't get many playouts, and tend to have much more inaccurate value estimates. This is because they contribute less to the loss function, with less occurrences.


Have I really learnt anything yet? I seem to have made a small step towards answering my question on dimensions but otherwise, I still don't really understand the interaction between network architecture and ability to train to the right answer. I also don't much understand what changes for a 2 player game, or how the input data format can help or hinder information content/flow.

I suppose that strength in a 2 player game is about the confidence and skill of choose a line where you have to play m moves perfectly in a row in order to profit (compared to a simple variation), as well lots of ways to filter out bad moves to increase the lower bound for how bad a move you might play. Much of "aggression" is probably to assume that you will play such tight (紧凑) sequences more perfectly than your opponent. In particular, being able to live in your opponent's moyos and save weak cutting stones.

What I really want to see is some relation between the network structure (e.g. neurons in a bottleneck) with a small number of key concepts (human or otherwise) that we can measure and interpret with some kind of symbolic logic/arithmetic. But I don't know if that is reasonable or what that would look like or how to make progress.


Last edited by dhu163 on Mon Oct 18, 2021 11:18 am, edited 1 time in total.
Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #2 Posted: Sun Oct 17, 2021 4:23 pm 
Lives in gote

Posts: 686
Liked others: 109
Was liked: 836
Rank: maybe 2d
It sounds like you've dug into this quite a lot! There are lots of minor things that are stated a little inaccurately, or ways things are phrased that are oversimplified, but also yes a lot of your understanding is reasonable and good.

The most notable thing of these minor things: the total number of possible Go board positions is not a relevant number, it's a vast, vast overestimate that you shouldn't pay much attention to.

Note: technically we should talk about game tree complexity, not state space size, but for simplicity I'm going to conflate and ignore that for this post.

For starters, even if you're trying to rigorously prove a solution to the game or to play optimally, with good algorithms you generally only care about the union of the set of positions reachable by either:

* Black plays optimal or near-optimal moves, while white plays anything. (To prove black's best result >= x, you must consider every white resistance, but for each of black's turn you need only demonstrate ONE good enough move).
* White plays optimal or near-optimal moves, while black plays anything. (To prove black's best result <= x, you must consider every black resistance, but for each of white's turns you need only demonstrate ONE good enough move).

You probably have this intuition as a Go player too - when analyzing a position, subvariations that can only result from multiple obvious mistakes by both sides are irrelevant to determining the overall result. So at any given moment we are entirely losing the branching for from black, or losing entirely losing the branching factor from white, we roughly square-root the number of reachable positions that we care about (and good algorithms can obtain nearly a sqrt speedup as they go, even if they don't know for sure at the outset what moves will be optimal).

Now we can go further. In the context of neural nets, normally we also abandon the idea of rigorously solving the game, because with a neural net, even if it does learn to play optimally, you don't have much ability to rigorously prove that those particular millions of numeric parameters happen to encode an optimal player other than to just resort back to exhaustive search anyways. So rather than rigorous solution, we simply care that *in practice* it appears to be unbeatable vs anyone or anything. Or sometimes we care only about the lower standard of merely beating the best human players.

Then you can get another *vast* improvement beyond even the square root. If the game tree were absolutely arbitrary with no correlation whatsoever between the outcomes of different moves, no repeated patterns or strategic regularities, etc, then square root via exhaustive search is best you can do. But game trees for real life games are not arbitrary - they arise out of particular rules and mechanics, resulting in exponentially vast amounts of repeated patterns, correlations, regularities, strategic principles, etc. Just like humans, neural nets will learn these principles ("eyes", "good shape vs bad shape", etc) that generalize across vast numbers of positions, and then refine them and learn the exceptions, and then refine them further and learn the exceptions to the exceptions, and so on.

Again, this should be intuitive to you as a Go player. Consider a 5x5 board. The optimal result is black opening in the center and then owning the whole board. If we do the same ultra-crude estimation, we have sqrt(3^25) ~= 900k. Do you have to memorize the status of 900k completely distinct and independent things in order to always win the entire board on 5x5 as black? No, of course not. While fully exhaustive proof of black's win would require showing how black refutes :w2: A1, :w2: A2, :w2: A3,... all separately and exhaustively, and for each of them, how you refute every :w4: and then every :w6: and so on... if we only want optimal-in-practice rather than rigorous proof, we only need to read out or study about a limited tree of critical lines where White resists strongly, and for refuting all the other obviously bad moves like A1, A2,... it's enough to play according to simple strategic shapes and principles that *generalize* across all these exponentially many possibilities, and by even the most pessimistic ways of quantifying and counting this fuzzy value, there are going to be far fewer than 900k shapes and principles to learn.

I would guess with a good algorithm and architecture, you could get to optimal play on 5x5 from the opening position with a net that has somewhere between 100 and 5000 parameters. Which is a pretty massive improvement over sqrt(3^25), much less 3^25.

Anyways, the point is - the size of the state space or game tree is really more of an idle curiosity than a useful number, if all you care about is practical play. It's such an absurdly vast overestimate, and the improvements that neural nets can give you are so dependent on the nature of the game and the details of your algorithm and architecture, that even using it as a starting number from which you quantify what improvements you can get is not useful - you're better off ignoring it entirely and just looking at or trying small practical experiments to empirically measure what you care about directly.

Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #3 Posted: Tue Oct 19, 2021 7:38 am 
Lives with ko

Posts: 242
Liked others: 23
Was liked: 148
Rank: DGS 2 kyu
Universal go server handle: Polama
Even with linear activation functions, it doesn't have to follow that (A is a good move) + (B is a good move) = (playing A and B is good). At the input to the network, we enter the state of the game. We then multiply the representation of that state against various weights and sum into the next layers. It would be completely reasonable for a network to have learned "if I played at A, then the positions close by are less valuable now", which might include B.

Imagine an input neuron X, and then a complicated network, and then an output neuron Y. The value of X (1 or 0), gets multiplied by various weights and summed into other values and multiplied and summed and smears out through the network and eventually influences the value of Y. But since everything is linear, we can look at all the paths X took to get to Y, and sum them up, and get a function like Y = 5X - 8Z + 3W + ...

So essentially, a multi-layer network with linear weights ends up collapsing down into a simple linear equation anyways.

Once you add the nonlinearities, the paths the weight X takes are all complicated, and don't just sum directly into a final influence weight, so you can represent much more intricate decision boundaries.

Quote:
Do the dimensions match? Is dimensions the right way to think about the problem?


Not really? First, the weights in a network are analogous to synapses in the brain. So although the "neuron" counts are getting close, it's not a meaningful comparison.
Second, you only need to represent the full dimensionality of your problem if your problem is pure noise. If every Go position was a 100% unique problem where you could apply absolutely nothing from your previous experience, you'd need enough capacity to memorize the right answer to every problem. In actuality, there are general concepts, and common shapes, and locality where a corner group is dead even if we change stones around in the far corner. So playing at some level X (even if X is perfect play) needs some tiny fraction of the space it would take to encode the right answer for every single possible position.


Quote:
I still don't really understand the interaction between network architecture and ability to train to the right answer.


Nonlinearities make the math work. Empirically, the larger the network the better it performs. The theoretical underpinnings of that aren't great, but it shows up in practice. If a network was just barely large enough to solve a problem, then only a very specific, small number of weights would work. If the network is far larger then needed, then there's lots of ways to parametrize it to solve the problem, and so it's "easier" to train. Why this continues to matter so much at ever increasing sizes is harder to explain.

The network is also the way you connect up the layers. This usually encodes some human intuition about a problem, like: "if I rotate a picture, it should still be the same picture" or "it should be easy for a network to throw away unneeded information as it processes the words in a sentence". You could think about information theory, and the complexity needed to represent various decision boundaries in a given architecture, but there's quite a bit of "throw stuff at the wall and see what sticks"

Quote:
What I really want to see is some relation between the network structure (e.g. neurons in a bottleneck) with a small number of key concepts (human or otherwise) that we can measure and interpret with some kind of symbolic logic/arithmetic. But I don't know if that is reasonable or what that would look like or how to make progress.


Explainable AI, and connecting symbolic logic with deep learning, are major outstanding challenges in the field. People try lots of things (like looking for statistical correlations with individual neurons firing and some feature of the board, like the presence of a dead group), but it's really still early days. So if you care enough to dig to the cutting edge, you can advance human knowledge! But on the flip side, it's not the sort of question where there's ready answers.

Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #4 Posted: Tue Oct 19, 2021 9:15 am 
Lives in gote

Posts: 686
Liked others: 109
Was liked: 836
Rank: maybe 2d
Yep, that's a pretty nice overview. One quibble though:

Quote:
Even with linear activation functions, it doesn't have to follow that (A is a good move) + (B is a good move) = (playing A and B is good). At the input to the network, we enter the state of the game. We then multiply the representation of that state against various weights and sum into the next layers. It would be completely reasonable for a network to have learned "if I played at A, then the positions close by are less valuable now", which might include B.


If you are talking about the "policy" function (i.e. predicting where a next good move is), then sure, you could have linear weights such that a stone at either of A or B reduces the policy output value at the other location for it being the next move. Each location can have an initial positive value for being empty, and a negative weight for the presence of a stone at the other location. But dhu163 didn't use the phrase "good move", he instead talked about about whether having one or more stones is better or worse for a player, which is easier to interpret as a statement about the value function.

And for the value function, dhu163 is completely right. Suppose we have board states X, and A (which is X + one stone) and B (which is X plus a different stone) and C (which is X plus both of the stones from A and B), and suppose we use a normal input encoding consisting of the presence/absence of a stone of each color on each location of the board, such that when encoded as vectors/tensors, C = X + (A-X) + (B-X)

Then if the value function f is linear, we have f(C) - f(X) = f(X + (A-X) + (B-X)) - f(X) = f(A-X) + f(B-X) = (f(A) - f(X)) + (f(B) - F(X)). In other words, the amount of value by which C is better than X is precisely the sum of the amount that A is better than X and B is better than X.

Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #5 Posted: Thu Oct 21, 2021 12:03 pm 
Lives with ko

Posts: 242
Liked others: 23
Was liked: 148
Rank: DGS 2 kyu
Universal go server handle: Polama
Excellent point lightvector, you are correct that a linear value network won't represent the idea that stones can be good or bad in combination with each other, and that that's closer to dhu163's point. In essence we're saying A XOR B is good (exclusive or), and one of the key observations during the development of neural networks was that linear functions can't represent XOR.

Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #6 Posted: Fri Oct 22, 2021 6:07 am 
Lives in gote

Posts: 312
Liked others: 47
Was liked: 249
Rank: UK 2d Dec15
KGS: mathmo 4d
IGS: mathmo 4d
Thanks for all the explanations.

Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 6 posts ] 

All times are UTC - 8 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group