Katago's Inefficient Reversion

RobertJasiek · #1

Usually, the load is roughly 16% CPU (8C/16T), 95% GPU, 5000 visits/s. If the next move has two reversion candidates with about the same percentages and scores, the load goes up to roughly 85% CPU, 85% GPU, 100,000 visits/s, maintains high load for several or dozens of seconds and then fluctuates between low and high load for similar durations.

This seems to indicate that Katago does not have a conceptual understanding of reversion. I think that Katago could be accelerated a lot in cases of reversion if almost duplicate / multiple (or: somehow "very similar") huge subtrees are pruned by pointers to one copy of a subtree. Alternatively, if the candidates' values fluctuate around the same tight average for a threshold number of visits (say, 1,000,000), assume reversion without verifying it. EDIT: verifying reversion by same final intermediate position is probably the proper method.

I have not seen three or more reversion candidates at a particular next move yet. I have seen reversion sequences of lengths 3 to ca. 15. A long "joseki" (ca. 50 moves) contained ca. a dozen moments of reversion.=

EDIT

lightvector · #2

Quote:

Usually, the load is roughly 16% CPU (8C/16T), 95% GPU, 5000 visits/s. If the next move has two reversion candidates with about the same percentages and scores, the load goes up to roughly 85% CPU, 85% GPU, 100,000 visits/s, maintains high load for several or dozens of seconds and then fluctuates between low and high load for similar durations.

This seems to indicate that Katago does not have a conceptual understanding of reversion.

Why do you say that, and why do you call it inefficient? It sounds to me that if you are seeing the number of visits per second suddenly increase to 20x the normal rate, then that is a sign of efficiency and successful acceleration of the search, rather than inefficiency.

The search can only accelerate if it no longer needs to query the GPU much. Unless you are very deep into endgame and the large majority of variations explored are reaching an actual terminal state with both players passing - in which case the position can be scored directly and so no GPU evaluation is needed - the main other reason why the search would not need to be querying the GPU much is because the search has recognized that different branches of the search can share GPU evaluations without redoing them. This can both happen in the neural net caching, and by branches sharing the same actual nodes in the search where possible. Either can cause the visits per second to increase a lot if it happens enough - for example if a shared node gets a new visit, then all of its multiple parents can count +1 visit (as soon as they update to realize this has happened, and so long as they would "want" the upweighting of that shared node), and this can compound recursively, allowing the work of one visit to count as many visits at the root, causing the nominal visits per second to be much higher.

RobertJasiek · #3

Suppose we are in position P with two move candidates PA and PB both leading to the same follow-up situation Q (same follow-up position with the same follow-up player having the next turn) via different intermediate reversal sequences. Call QA the situation Q reached from PA and call QB the situation Q reached from PB.

During early search, PA and PB vales differ somewhat and search is normal, low. After some time, PA and PB vales are close. Thereafter, they permanently fight for being blue with either being blue changing frequently and indefinitely, and search fluctuates between high and low rather often and indefinitely. KataGo tries hard to find the better of PA and PB. KataGo does so by blowing up the subtrees QA and QB (and maybe the branches of the paths from PA to QA and PB to QB).

Typically, first the PA - QA subtree might have something like 5 million visits and the PB - QB subtree might have something like 1 million visits. KataGo will blow them up to, for example, 6 million visits and 6 million visits, respectively. It will continue to blow them up to each 10, 15, 20 etc. million visits.

Everything indicates that KataGo continues this fight for blue without ever understanding that reversion occurs. If KataGo understood reversion, it should detect reversion and prune the PB - QB subtree having 1 million visits by pointing to the PA - QA subtree having 5 million visits within a fraction of a second, but KataGo does not do it. Instead, it fights for blue indefinitely. Most of the analysis seems to be directed to this fake fight for blue - instead recognition of reversion should simplify the tree and search should resume normal, low like in a tree without reversion.

Such analysis is inefficient because all the seemingly more efficient, extra search just indefinitely prolongs a missing understanding and recognition of reversion, it seems.

Even if the reversal sequences are short and, for a human, simple, KataGo also uses high analysis indefinitely. (If all the extra analysis were on the branches of the reversal sequences, KataGo would also be wasting its effort.)

The problem recurs if the (sorry: either) subtree branches by more, later reversions. For them, the search behaviour recurs: unnecessarily blown up sub-subtrees instead of pointers to reversed follow-up positions. When each of PA and PB has, say, some 5 million visits, I notice this by going from P to Q and seeing the same kind of behaviour at Q.

Do you think that KataGo implements reversion? If so, why does it take (even much!) more than a fraction of a second to increase the PB - QB subtree from, say, 1 million visits to 5 million visits? Without implemented reversion, KataGo can be improved by it. With allegedly implemented reversion but more than a fraction of a second to (almost-)equalise numbers of visits, KataGo must have a related bug.

Cassandra · #4

What do you think:

How shall KataGo realize that there MIGHT BE an ENFORCED change in the order of moves?

RobertJasiek · #5

See above: by a threshold. Neural nets do not find the absolute truth but are empirical. Therefore, they may make the assumption that a particular, empirically sufficiently large threshold was good enough. Actually, we humans do the same, albeit for a tiny threshold.

Cassandra · #6

RobertJasiek wrote:

See above: by a threshold. Neural nets do not find the absolute truth but are empirical. Therefore, they may make the assumption that a particular, empirically sufficiently large threshold was good enough. Actually, we humans do the same, albeit for a tiny threshold.

What "threshold" do you have in mind for comparing POSITIONS that appear sometime later in the variation tree?

How do you want KataGo to make sure that the move sequences that lead to these positions are ENFORCED?

lightvector · #7

Just to be clear on terminology, if the same position is reached by multiple sequences, but that position has not occurred before in the history, then it is normally called "transposition" in the modern tree search and algorithm literature, not "reversion". The word "reversion" shares the same stem as "reverse", i.e. to go backwards, and could be better if you were talking about a long cycle (e.g. superko) in which there was some return to a *prior* state. But it sounds like you are talking about "transposition", in which no long cycles are involved, rather two or more parallel paths of moves meet back together or cross over one another in the positions they lead to, hence the stem "trans" suggesting to "travel" or "cross".

RobertJasiek wrote:

Suppose we are in position P with two move candidates PA and PB both leading to the same follow-up situation Q (same follow-up position with the same follow-up player having the next turn) via different intermediate reversal sequences. Call QA the situation Q reached from PA and call QB the situation Q reached from PB.

No. Unless there is some ko or superko-related reason why they cannot be combined (which is usually not the case), QA and and QB will already be the same node Q in the search tree. This is *why* you see a larger visits/s. If the transposition were not detected, then PA and PB would be much slower to increase in visits because they would each be repeating the work instead of sharing the work. Sharing the work is what is causing the faster search.

RobertJasiek wrote:

During early search, PA and PB vales differ somewhat and search is normal, low. After some time, PA and PB vales are close. Thereafter, they permanently fight for being blue with either being blue changing frequently and indefinitely, and search fluctuates between high and low rather often and indefinitely. KataGo tries hard to find the better of PA and PB.

Flickering is not evidence that KataGo "tries hard". Flickering is simply a reflection of the fact that two moves are almost equal, so either one could be chosen. It implies nothing about the "effort" in computing those moves, it is merely showing that two floating point values are almost identical in the move selection algorithm, so that either one may be epsilon larger at a given moment. The individual flickers can be due to tiny value differences. PA and PB will each individually be exploring some other moves that do not transpose, and that tiny fraction of playouts will act as slight noise on top of the dominant shared value from Q.

RobertJasiek wrote:

KataGo does so by blowing up the subtrees QA and QB (and maybe the branches of the paths from PA to QA and PB to QB).

No, the single subtree Q is being blown up, and collectively PA and PB's total visit count are increasing more rapidly because incrementing a visit to Q usually allows each of PA and PB to also increment a visit the next time each is updated.

RobertJasiek wrote:

Everything indicates that KataGo continues this fight for blue without ever understanding that reversion occurs.

Again it sounds like you may be misunderstanding fluctuations as evidence of there being a "fight" or something inefficient or effortful happening in an algorithm. Instead two moves are sharing almost all of their work and therefore have almost the same value. Therefore at any given moment if the GUI asks KataGo which one would be chosen, it can get an answer that differs arbitrarily at different times based on insignificant things.

RobertJasiek wrote:

Even if the reversal sequences are short and, for a human, simple, KataGo also uses high analysis indefinitely.

What is "high" analysis? Do you mean high CPU usage? High CPU usage often indicates that the search is proceeding much more quickly, because GPU is normally the bottleneck. For example, if we magically were able to double the speed of the GPU code by finding some new optimizations, then KataGo would be much more efficient than before, but CPU usage would increase, because now more CPU would be required to feed positions faster to the GPU to keep the GPU at full or near-full usage. As long as GPU is the bottleneck, the CPU use only indicates the *relative* compute from CPU vs GPU, it does not definitely indicate the absolute efficiency or inefficiency of either one.

RobertJasiek wrote:

If KataGo understood reversion, it should detect reversion and prune the PB - QB subtree having 1 million visits by pointing to the PA - QA subtree having 5 million visits within a fraction of a second, but KataGo does not do it.

RobertJasiek wrote:

Do you think that KataGo implements reversion? If so, why does it take (even much!) more than a fraction of a second to increase the PB - QB subtree from, say, 1 million visits to 5 million visits? Without implemented reversion, KataGo can be improved by it. With allegedly implemented reversion but more than a fraction of a second to (almost-)equalise numbers of visits, KataGo must have a related bug.

We don't increase PB from 1 million visits to 5 million visits instantly because there is no urgency or importance to do so. Q is still shared and both of PA and PB are reading its latest value. When work is done on Q, both PA and PB benefit from the updated value. In such a situation, why does the numerical visit count of PA vs PB matter? It doesn't. Q is still searched the same either way.

Maybe PB does catch up, but only gradually. Perhaps the way the PUCT exploration formula happens to behave in a particular situation is that every time we increment PB by 2-3 visits (which happen very fast because PB sees that Q is already at 5 million visits and doesn't need to do any GPU evaluations), we also visit PA again and this takes some time because to increase PA more we need to increase Q more and that costs a GPU evaluation. Assuming the are no transpositions elsewhere and no other relevant effects, this would manifest to you as the number of total visits per second increasing by a factor of 3-4 while PB very gradually catches up to PA in visits, because per GPU evaluation we get 1 new visit on PA and we get the 2-3 GPU-less catch-up visits on PB. If the overall search is still bottlenecked by GPU evaluations on Q, there is no inefficiency.

Depending on the position, PB probably does also need to send some small fraction of a percent of its visits into other moves in order to catch up from 1 million to 5 million (again, soundness requires that in general as visits -> infinity, every move gets explored infinitely often just in case some deep hard-to-discover variation actually turns out to be better). So we can't simply do "PB.visits += 4000000" without spending a little bit of GPU time on these alternative moves. But again there may be no urgency to do that even if it would be cheap. If most of the compute is focused on expanding the shared Q (via PA-Q) rather updating PB's visits, and therefore PB's visits only catch up gradually even though the amount of work to catch up PB by searching its alternative moves would be small, that's fine. Spending more compute on Q is more important.

RobertJasiek · #8

TERMINOLOGY

Reversion / reversal / reverse is the go term for the (usually, non-cyclic) process / sequences from P to Q. I did not know or forgot that Q is called a transposition in informatics. So both terms describe essentially the same kind of thing. As additional confusion, combinatorial game theory uses its term reversal with the different meaning of pruning a 3-move-sequence. For the utmost confusion, currently Sensei's Library only explains the latter, although it is a minor usage.

EXPLANATION

Thanks!

IMPLEMENTATION

I am glad to hear that reversion aka transposition is implemented at all and basically correctly. Now I understand that implementation details focus on keeping GPU load a bit lower at the expense of delayed up-to-date GUI information on visits. Ok, we users just need to know this (and recognise reversions) so that we do not misinterpret displayed numbers and colours:)

SPEED

There are three major search speeds: a) normal, b) local life and death region, c) reversion. I do not know if (b) and (c) have about the same behaviour but maybe they do. Their speed in visits/s can be roughly 20 times as fast as (a). Although you have given a few hints, I do not really understand this very different order of magnitude. I would understand factor 2 but how is factor 20 possible?! And why cannot normal speed be equally fast?

RobertJasiek · #9

Cassandra wrote:

How [...] to make sure that the move sequences that lead to these positions are ENFORCED?

(The term is "force(d)", not "enforced".)

First, you speak of "might be [en]forced", now you say "are [en]forced";) Again, this is not about proving a theorem (or defining hypothetical-strategy or ko) but about empirics. After lightvector's explanations, it turns out to be immaterial though:)

lightvector · **#10**

What is "local life and death region"? Are you setting up a whole-board position where the endgame is all finished but the only thing unfinished is a life and death problem? Or is it just a normal whole-board position where the life and death position is the key to swinging the game? Or is it using some specific feature to restrict the search to only one region and forbidding moves elsewhere?

RobertJasiek · **#11**

It is just a normal whole-board position where the life and death position is the key to swinging the game. This is what I have observed while studying openings and josekis. When progress of sequences depends on an unsettled local life and death region, Katago solves the L+D problem and the fans become faster because the CPU load goes to 20%~100% until apparently Katago has solved it and can proceed most sequences with other opening moves outside the local region.

I guess if I created an artificial position with a rather settled rest of the board and an unsettled local L+D region, Katago's thinking would be similar.

My suspicision is that local L+D can have reversions | transpositions easily. Therefore, Katago might think (distribute load to the CPU) similarly in ordinary positions with reversions and local L+D regions with reversions.

***

Back to speeds: Where I expected a subtree to get another 4 million visits in a fraction of second, maybe Katago distributes this to typically a few dozen seconds. Instead of ~10 million visits/s for a fraction of a second, here it becomes 12,000 ~ 192,000 visits/s for a few dozen seconds. If I guess correctly, hardware and software libraries are not suddenly 20 times as efficient on average but it is just a matter of how pointers are redirected over time and how to give more load to the CPU while the GPU remains bottlenecked. Am I right?

RobertJasiek · **#12**

Here is an example with two occurrences of positions with three reversion candidates. This results in 8,000 ~ 55,000 visits/s. Furthermore, reversions are indicated in variations.

In the following example, a reversion is dropped only after more than 3 million playouts of a reversion move candidate when Katago finds a better move. If something can go wrong, it will:) Again, reversions are indicated in variations.

lightvector · **#13**

RobertJasiek wrote:

If I guess correctly, hardware and software libraries are not suddenly 20 times as efficient on average but it is just a matter of how pointers are redirected over time and how to give more load to the CPU while the GPU remains bottlenecked. Am I right?

Mostly. Pointers aren't redirected, the nodes are shared at the very start, from the very moment that variation is first explored. There is never any duplication of the node that has to be fixed later by swapping pointers, it is always correct to begin with. But normally when the search is shallow there are a lot more visits exploring many different alternatives because the search algorithm hasn't converged on any single paths yet. So it may be e.g. that only a small percent of the work is shared early on in the search because a lot is verifying different non-shared branches and you don't really notice the difference in speed.

But yes, the GPU is never running at 20x the speed. The GPU is running at roughly a constant speed, and the increases in visits/s are due to those GPU evals being able to be used for counting visits on multiple paths back to the root from nodes or positions in common, instead of only a single path back to the root. Visits can be counted more quickly or more slowly depending on how the PUCT formula happens to distribute exploration. For example if moves A and B eventually lead to the same common node under optimal play but move A is obviously forcing while move B allows a large number of tactical possibilities to the opponent that are all risky and it is difficult to show that none of them work, then PUCT may put more visits into move A than B even though the main line for both would result in the same position. But whenever B does get some visits, the portion of those that explore the path to the shared position instead of alternatives will be very fast and not repeat any GPU evals (the playout for that visit will terminate instantly upon hitting the shared node and report that B can increment itself), because A has already given that position sufficiently many visits to support B's reuse of it.

Schachus · **#14**

Just out of curiosity: How would this word with superko rules, where it might later matter how you got to a particular position?

RobertJasiek · **#15**

Each move of each reversion sequence abides by the used superko rule.

lightvector · **#16**

RobertJasiek wrote:

Each move of each reversion sequence abides by the used superko rule.

Yes, but the value of situation Q may depend on the sequence that it was reached from, for example if you reach situation Q via move sequence A, it might be that a *later* followup from situation Q might have an illegal move because the move would result in a intermediate board configuration that occurred within A, whereas if you reached situation Q via move sequence B, then the later followup from situation Q would have that same move be legal.

A simple example this often occurs when you carefully time the capture of a ko to occur last after you play various mutually forcing exchanges (sequence A), so that the opponent has to find the first external threat. If instead if you captured the ko early then the opponent would play all the same mutually forcing exchanges and then recapture the ko (sequence B), forcing you to find the first external threat. The board position and player to move just after the ko capture in sequence A is identical to the board position and player to move after the opponent plays the last of the mutual forcing exchanges and you respond in sequence B, but the ko recapture is legal for the opponent if that state was reached via B, and illegal if reached by A. You cannot combine the nodes in this case because the game theoretic value may differ due to the difference of legality of ko recapture, if the single ko threat makes a difference.

Schachus wrote:

Just out of curiosity: How would this word with superko rules, where it might later matter how you got to a particular position?

KataGo uses a heuristic that theoretically is not sound, but in practice I know of zero cases where optimal play from the root node depends on passing through nodes where the heuristic incorrectly combines two nodes that need to be kept separate.

Consider a situation where a move was just played on a location L in Go.
Let S be the number of stones in the group of stones containing L (or 0, if the move was a self-capture)
Let E be the number of empty spaces in all empty regions that touch the group of stones containing L (or in the empty region that contains L itself, if the move was a self-capture).
Let C be the length of the smallest cycle that contains the move played. (i.e. the total number of turns in the shortest possible repeating loop of board situations under alternating play and/or passes that includes the move made).

Theorem: C >= S + E

Proof: Exercise for the reader.

Application: KataGo considers the "transposition-protected-state" of a node to be the current board situation together with the maximal sequence of board situations in recent history so long as all moves in that recent history have S + E <= 11. (The value of 11 is a default, and is configurable). Two nodes are mergeable if their transposition-protected-state exactly matches. In other words, KataGo "forgets" or "clears" the history any time there is a move with S + E > 11 for the purposes of transposition detection.

Notes: Simple ko captures have S+E = 2. The moves in https://senseis.xmp.net/?EternalLife have S+E <= 4. Almost all moves in the opening or midgame of a normal game have S+E = large, due to touching the huge open space on the board, making it practically impossible that they are part of a cycle, and a reasonable number of endgame moves have S+E = large, because they touch a large-sized territory or a connect to a group with a large number of stones, making it also practically impossible that they are part of a cycle.

I would be curious if anyone can construct a position in which optimal play from that position onward depends on a long cycle that contains a move with S + E > 11, such that this heuristic might merge two nodes that should not be merged. Does anyone know of a fight in which one of the main lines depends on a cycle where some move in that cycle touches a decently large empty space and/or forms a decently large group of stones?

RobertJasiek · **#17**

For years, the known maximum S+E has been 8, see the quadrupel ko stones cycle.
https://home.snafu.de/~jasiek/ko.pdf

Schachus · **#18**

Good to know, very interesting!

Harleqin · **#19**

I'm not sure if I understand correctly: do you mean the known maximum or the maximum known? I. e. do we know that there cannot be more, or is it just the largest found yet?

RobertJasiek · **#20**

The largest found thus far.

If it were the exact theoretical maximum, you would have seen a proof of mine:) However, I expect such to be extremely difficult to establish and prove, except for the trivial upper bound 361.

Katago's Inefficient Reversion

Who is online