Life In 19x19 http://www.lifein19x19.com/ |
|
Neural networks optimising mapping functions http://www.lifein19x19.com/viewtopic.php?f=18&t=18414 |
Page 1 of 1 |
Author: | dhu163 [ Sun Oct 17, 2021 6:10 am ] |
Post subject: | Neural networks optimising mapping functions |
This post is designed for a mathematical + comsci audience only. It is to demonstrate my understanding, and share some thoughts. (edit: 2022/09/11 I am trying to write some of this up more formally. Disclaimer: much of the below was speculative musings without enough thought to back it up. Don't take any of it too seriously) (edit: 2022/09/25 On the other hand, while some was nonsense, much really was seeing how NNs are quite a general discrete potential model, especially for energy dissipating through a crystal, whose fractal nature can even be used to model Go. I am more confident much can be written up now. Even some that I said was difficult to quantify, I now think can be quantified and measured. On the other hand, I still expect that almost all this is old news scientifically, though it is new to me. This thread now seems to be a demonstration that curious whimsical musings can turn into papers) As far as I recall: There are around 3^361 possible Go board position, but more like 10^170 according to Tromp. The number of weights in a 40 block neural net based on Alphago (like katago) is around 10^8. The number of neurons in a human brain is around 10^11. Do the dimensions match? Is dimensions the right way to think about the problem? For example given n input bits (e.g. to represent board positions), you want to output 1 number to represent the value of that position. The dimension of the function space is 2^n * 1 = 2^n. However, does the number of weights in a neural network roughly correspond to dimensions. The key thing is that neural network outputs are not linear in their weights or inputs, so thinking of the functions as a vector space probably doesn't lead very far. Non-linearity is critical because otherwise linearity would imply that if a stone A helps player X, and a stone B helps player X, then having both stones A and B on the board will help player X. This is false, for example if A and B together is self-atari or suicide. On the other hand, matrices still play a key role in neural networks. The linearity point is why I was surprised when I was first told by a Deepmind employee that the AI was based on matrices. However the study of matrices is very deep, with highly optimised (e.g. parallelised) algorithms to work with them - adding, multiplying etc. It is also easier to think in terms of linear things (since we are familiar with adding up normal numbers and matrices are just collections of numbers), and the calculus formulae are much simpler to write down with matrices (merely the chain rule: dAx/dx = A). Neural networks in 5 paragraphs: RELU, gradients Metrics and simplifying models A simple example Have I really learnt anything yet? I seem to have made a small step towards answering my question on dimensions but otherwise, I still don't really understand the interaction between network architecture and ability to train to the right answer. I also don't much understand what changes for a 2 player game, or how the input data format can help or hinder information content/flow. I suppose that strength in a 2 player game is about the confidence and skill of choose a line where you have to play m moves perfectly in a row in order to profit (compared to a simple variation), as well lots of ways to filter out bad moves to increase the lower bound for how bad a move you might play. Much of "aggression" is probably to assume that you will play such tight (紧凑) sequences more perfectly than your opponent. In particular, being able to live in your opponent's moyos and save weak cutting stones. What I really want to see is some relation between the network structure (e.g. neurons in a bottleneck) with a small number of key concepts (human or otherwise) that we can measure and interpret with some kind of symbolic logic/arithmetic. But I don't know if that is reasonable or what that would look like or how to make progress. |
Author: | lightvector [ Sun Oct 17, 2021 4:23 pm ] |
Post subject: | Re: Neural networks optimising mapping functions |
It sounds like you've dug into this quite a lot! There are lots of minor things that are stated a little inaccurately, or ways things are phrased that are oversimplified, but also yes a lot of your understanding is reasonable and good. The most notable thing of these minor things: the total number of possible Go board positions is not a relevant number, it's a vast, vast overestimate that you shouldn't pay much attention to. Note: technically we should talk about game tree complexity, not state space size, but for simplicity I'm going to conflate and ignore that for this post. For starters, even if you're trying to rigorously prove a solution to the game or to play optimally, with good algorithms you generally only care about the union of the set of positions reachable by either: * Black plays optimal or near-optimal moves, while white plays anything. (To prove black's best result >= x, you must consider every white resistance, but for each of black's turn you need only demonstrate ONE good enough move). * White plays optimal or near-optimal moves, while black plays anything. (To prove black's best result <= x, you must consider every black resistance, but for each of white's turns you need only demonstrate ONE good enough move). You probably have this intuition as a Go player too - when analyzing a position, subvariations that can only result from multiple obvious mistakes by both sides are irrelevant to determining the overall result. So at any given moment we are entirely losing the branching for from black, or losing entirely losing the branching factor from white, we roughly square-root the number of reachable positions that we care about (and good algorithms can obtain nearly a sqrt speedup as they go, even if they don't know for sure at the outset what moves will be optimal). Now we can go further. In the context of neural nets, normally we also abandon the idea of rigorously solving the game, because with a neural net, even if it does learn to play optimally, you don't have much ability to rigorously prove that those particular millions of numeric parameters happen to encode an optimal player other than to just resort back to exhaustive search anyways. So rather than rigorous solution, we simply care that *in practice* it appears to be unbeatable vs anyone or anything. Or sometimes we care only about the lower standard of merely beating the best human players. Then you can get another *vast* improvement beyond even the square root. If the game tree were absolutely arbitrary with no correlation whatsoever between the outcomes of different moves, no repeated patterns or strategic regularities, etc, then square root via exhaustive search is best you can do. But game trees for real life games are not arbitrary - they arise out of particular rules and mechanics, resulting in exponentially vast amounts of repeated patterns, correlations, regularities, strategic principles, etc. Just like humans, neural nets will learn these principles ("eyes", "good shape vs bad shape", etc) that generalize across vast numbers of positions, and then refine them and learn the exceptions, and then refine them further and learn the exceptions to the exceptions, and so on. Again, this should be intuitive to you as a Go player. Consider a 5x5 board. The optimal result is black opening in the center and then owning the whole board. If we do the same ultra-crude estimation, we have sqrt(3^25) ~= 900k. Do you have to memorize the status of 900k completely distinct and independent things in order to always win the entire board on 5x5 as black? No, of course not. While fully exhaustive proof of black's win would require showing how black refutes A1, A2, A3,... all separately and exhaustively, and for each of them, how you refute every and then every and so on... if we only want optimal-in-practice rather than rigorous proof, we only need to read out or study about a limited tree of critical lines where White resists strongly, and for refuting all the other obviously bad moves like A1, A2,... it's enough to play according to simple strategic shapes and principles that *generalize* across all these exponentially many possibilities, and by even the most pessimistic ways of quantifying and counting this fuzzy value, there are going to be far fewer than 900k shapes and principles to learn. I would guess with a good algorithm and architecture, you could get to optimal play on 5x5 from the opening position with a net that has somewhere between 100 and 5000 parameters. Which is a pretty massive improvement over sqrt(3^25), much less 3^25. Anyways, the point is - the size of the state space or game tree is really more of an idle curiosity than a useful number, if all you care about is practical play. It's such an absurdly vast overestimate, and the improvements that neural nets can give you are so dependent on the nature of the game and the details of your algorithm and architecture, that even using it as a starting number from which you quantify what improvements you can get is not useful - you're better off ignoring it entirely and just looking at or trying small practical experiments to empirically measure what you care about directly. |
Author: | Polama [ Tue Oct 19, 2021 7:38 am ] |
Post subject: | Re: Neural networks optimising mapping functions |
Even with linear activation functions, it doesn't have to follow that (A is a good move) + (B is a good move) = (playing A and B is good). At the input to the network, we enter the state of the game. We then multiply the representation of that state against various weights and sum into the next layers. It would be completely reasonable for a network to have learned "if I played at A, then the positions close by are less valuable now", which might include B. Imagine an input neuron X, and then a complicated network, and then an output neuron Y. The value of X (1 or 0), gets multiplied by various weights and summed into other values and multiplied and summed and smears out through the network and eventually influences the value of Y. But since everything is linear, we can look at all the paths X took to get to Y, and sum them up, and get a function like Y = 5X - 8Z + 3W + ... So essentially, a multi-layer network with linear weights ends up collapsing down into a simple linear equation anyways. Once you add the nonlinearities, the paths the weight X takes are all complicated, and don't just sum directly into a final influence weight, so you can represent much more intricate decision boundaries. Quote: Do the dimensions match? Is dimensions the right way to think about the problem? Not really? First, the weights in a network are analogous to synapses in the brain. So although the "neuron" counts are getting close, it's not a meaningful comparison. Second, you only need to represent the full dimensionality of your problem if your problem is pure noise. If every Go position was a 100% unique problem where you could apply absolutely nothing from your previous experience, you'd need enough capacity to memorize the right answer to every problem. In actuality, there are general concepts, and common shapes, and locality where a corner group is dead even if we change stones around in the far corner. So playing at some level X (even if X is perfect play) needs some tiny fraction of the space it would take to encode the right answer for every single possible position. Quote: I still don't really understand the interaction between network architecture and ability to train to the right answer. Nonlinearities make the math work. Empirically, the larger the network the better it performs. The theoretical underpinnings of that aren't great, but it shows up in practice. If a network was just barely large enough to solve a problem, then only a very specific, small number of weights would work. If the network is far larger then needed, then there's lots of ways to parametrize it to solve the problem, and so it's "easier" to train. Why this continues to matter so much at ever increasing sizes is harder to explain. The network is also the way you connect up the layers. This usually encodes some human intuition about a problem, like: "if I rotate a picture, it should still be the same picture" or "it should be easy for a network to throw away unneeded information as it processes the words in a sentence". You could think about information theory, and the complexity needed to represent various decision boundaries in a given architecture, but there's quite a bit of "throw stuff at the wall and see what sticks" Quote: What I really want to see is some relation between the network structure (e.g. neurons in a bottleneck) with a small number of key concepts (human or otherwise) that we can measure and interpret with some kind of symbolic logic/arithmetic. But I don't know if that is reasonable or what that would look like or how to make progress. Explainable AI, and connecting symbolic logic with deep learning, are major outstanding challenges in the field. People try lots of things (like looking for statistical correlations with individual neurons firing and some feature of the board, like the presence of a dead group), but it's really still early days. So if you care enough to dig to the cutting edge, you can advance human knowledge! But on the flip side, it's not the sort of question where there's ready answers. |
Author: | lightvector [ Tue Oct 19, 2021 9:15 am ] |
Post subject: | Re: Neural networks optimising mapping functions |
Yep, that's a pretty nice overview. One quibble though: Quote: Even with linear activation functions, it doesn't have to follow that (A is a good move) + (B is a good move) = (playing A and B is good). At the input to the network, we enter the state of the game. We then multiply the representation of that state against various weights and sum into the next layers. It would be completely reasonable for a network to have learned "if I played at A, then the positions close by are less valuable now", which might include B. If you are talking about the "policy" function (i.e. predicting where a next good move is), then sure, you could have linear weights such that a stone at either of A or B reduces the policy output value at the other location for it being the next move. Each location can have an initial positive value for being empty, and a negative weight for the presence of a stone at the other location. But dhu163 didn't use the phrase "good move", he instead talked about about whether having one or more stones is better or worse for a player, which is easier to interpret as a statement about the value function. And for the value function, dhu163 is completely right. Suppose we have board states X, and A (which is X + one stone) and B (which is X plus a different stone) and C (which is X plus both of the stones from A and B), and suppose we use a normal input encoding consisting of the presence/absence of a stone of each color on each location of the board, such that when encoded as vectors/tensors, C = X + (A-X) + (B-X) Then if the value function f is linear, we have f(C) - f(X) = f(X + (A-X) + (B-X)) - f(X) = f(A-X) + f(B-X) = (f(A) - f(X)) + (f(B) - F(X)). In other words, the amount of value by which C is better than X is precisely the sum of the amount that A is better than X and B is better than X. |
Author: | Polama [ Thu Oct 21, 2021 12:03 pm ] |
Post subject: | Re: Neural networks optimising mapping functions |
Excellent point lightvector, you are correct that a linear value network won't represent the idea that stones can be good or bad in combination with each other, and that that's closer to dhu163's point. In essence we're saying A XOR B is good (exclusive or), and one of the key observations during the development of neural networks was that linear functions can't represent XOR. |
Author: | dhu163 [ Fri Oct 22, 2021 6:07 am ] |
Post subject: | Re: Neural networks optimising mapping functions |
Thanks for all the explanations. |
Author: | dhu163 [ Tue Dec 28, 2021 4:50 pm ] |
Post subject: | Re: Neural networks optimising mapping functions |
I would like to note: If no bias is introduced, then with n input bits and only RELU activation, all zero input is always mapped to zero output in every neuron. And I think it turns out that however many hidden neurons you have, you always get an output where the cut-off hyper-planes go through zero. Any two such planes only intersect at zero so the gradient at x only depends on the ray from the origin (0 to x to infinity). But to produce such a neuron only one hidden layer is necessary. That is, a sequence of nets each with only one hidden layer can produce in the limit any output neuron that can be produced by an arbitrary number of hidden layers. I don't know what happens with a bias, but I assume that changes things completely so that it is completely general. I can see you can create any function if you can create a delta function around zero (and then shift it with a bias), and something close to a delta function can be created in the limit in the 2nd hidden layer with only 2n offset neurons in the first hidden layer. If correct, it seems that theoretically, only 2 hidden layers are required to produce any neuron you want. I have oversimplified, but perhaps this is not far from the truth, needs more thought. edit: (more thought) (edit 20220910: this first result is quite confusing but seems true (for input dimension at least 2) with a slight rewording (the key is about compact support). If I find a clearer proof, I will add it to a longer paper on my wordpress. Any function on input dimension 1 can be approximated with just 1 hidden layer as my experiments suggest. Also function should be replaced by continuous function with supremum norm.) But even then, I think the metric/topology of neural nets and how different activation functions/architectures influence training is surely still an interesting topic of study, because it helps map to what you want to use it for, but also very hard (but non-equilibrium thermodynamics seems hard and also to have been a hot topic for a while and has produced some results). For example:
|
Author: | dhu163 [ Wed Dec 29, 2021 8:49 am ] |
Post subject: | Re: Neural networks optimising mapping functions |
Some tensorflow experiments seemed initially to show I am wrong about convergence with 1 input and 1 hidden layer. For example, I tried an ideal function of sin(5x) and it still mapped 0 to 1. It didn't get much success mapping near x=0 nor x=1 though it was fairly accurate in the middle. My guess is a sort of prisoner's dilemma issue is occurring because two hidden layer neurons have to work together in order to get the initial kink, but each such neuron is close to orthogonal to the error term. Alternatively, we can say that sum of at least two neurons is required, but training causes the coefficient of one to decrease, which is the wrong direction. This is presumably when a 2nd hidden layer will help a lot, by encouraging teamwork like a team captain. A caveat is that competition between team captains for team members attention can cause trouble if they pull in opposing directions since that makes the neuron freeze. My experiments show given n neurons per layer, there is a good minimum number of hidden layers h that works well. n h <4 h<10: RMS error of 0.2 is possible but it can't fit the initial kink and maps 0 to 1, h>10: struggled to reduce RMS error below 0.6 even with many layers. I think there may be a vanishing gradient issue with more layers as the input area gets "colder", so it doesn't notice what the input even is and just outputs a constant value. Is this akin to temperature? 5 8 30 5 (6 converges faster) 100 3 10000 1 (with 5 times more rounds of training data) but even then RMS error still 0.08, several orders of magnitude worse than other tries. I think that more neurons per layer always improves accuracy (if same training data and increment small enough), but more layers can make things worse especially with few neurons per layer. My explanation is: 20220520 edit |
Author: | dhu163 [ Wed Dec 29, 2021 11:54 am ] |
Post subject: | Re: Neural networks optimising mapping functions |
Back to Is it useful to apply "energy/entropy" to neural net training? I try to compare some metaphors/analogies to familiar physics. Please note this is merely an outsider's perspective. Firstly, some popular science style thoughts on life/earth energy/entropy. Slightly scatter-minded. Entropy in a computer Energy/power in a neural net Loss satisfies a heat equation Circuit analogy of a neural net Supply/demand in a neural net Fragility Gradient vanishing Contribution of a neuron ReLU vs Sigmoid What happens at a bottleneck in a neural net? |
Author: | dhu163 [ Wed Dec 29, 2021 12:35 pm ] |
Post subject: | Re: Neural networks optimising mapping functions |
Architecture: 1 neuron per layer, 1 input, 1 output Some quantifying results Entropy and non-equilibrium If we continue the analogy to supercooled water freezing at a knock, this seems to be like a situation where we have a fairly flat set of nearby energy states (with many liquid molecules running around bumping into each other) which probably has higher entropy (more possible states) than the fixed frozen state (where some dimensions of kinetic energy are dropped), however there is a vortex with an energy level just above which leads to a wormhole down a well, at the bottom of which is the fully frozen state, and a lower energy than the liquid. However, reaching the vortex requires some energy input first. Time for a thought experiment. Consider dropping a stone into a pond. This is clearly a non-equilibrium situation but why? How does the entropy increase after normal physics is followed? Legacy functionality needs protection of vital parts to prevent rot |
Author: | dhu163 [ Thu Dec 30, 2021 8:28 am ] |
Post subject: | Re: Neural networks optimising mapping functions |
I've just written up a few formulae trying to get conservation laws/inequalities during neural network training. I've gotten something I've called thoroughput though unfortunately that is already a NN term. It is simple enough that I think it's likely people understand the theory quite well already, but I enjoy working things out myself anyway. https://dhu163go.wordpress.com/ai/ In the process, I've noticed that backpropagation produces wave effects that are quite akin to quantum double-slit effects. Edit: 20220102 Classical info theory vs info theory needed for NNs A note for myself Sketchy thoughts on NN/entropy following on from above, a layer has local degenerate symmetry if input weights W1 and output weights W2 satisfy: vW1=0 and W2v=0 has a non-zero solution. This symmetry might not be stable, but it will be locally maintained in the short term. It is interesting to work out when the left kernel of W1 and the right kernel of W2 come close to alignment and how often that occurs. In combination with a bias, it is also relevant. Neural networks are like the surface of crumpled rope or proteins. Am I going mad? + a message for the next generation tides |
Author: | dhu163 [ Sun Mar 13, 2022 2:32 pm ] |
Post subject: | Re: Neural networks optimising mapping functions |
Some more poetic thoughts on neural networks. Review of thoughts so far. Diagnostics Stability on one hidden layer with ReLU "Conceptual unit theory" A remaining question I don't fully understand is how this works in a two player game. What features are there that allow a network to keep progressing relentlessly. What is the healer's job? |
Author: | dhu163 [ Tue Mar 15, 2022 5:16 pm ] |
Post subject: | Re: Neural networks optimising mapping functions |
Two-player game A silly encoding. More metaphors Energy Misc Observations Testable Hypothesis: A higher learning rate produces a more even distribution of neuron flow (or at least a lower maximum) (i.e. Gaussian), whereas a lower learning rate allows things to settle produces larger dominant neurons (like language, Zipf distributions). |
Page 1 of 1 | All times are UTC - 8 hours [ DST ] |
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group http://www.phpbb.com/ |