KataGo self-play on a Macbook Pro

gcao · #1

Hi,

I would like to try the self-play of KataGo on my Macbook Pro. My goal is not to create a very strong NN, but to create a NN that is playable on amateur level. I noticed that the Readme recommends to run on 4 machines. However I don't have the recommended setup. So if I run them from same machine what will happen? How do I tweak the config to make the programs play nicely with each other?

Thank you for any suggestions.
Cao

lightvector · #2

There are two options.

* One is you could still try to run them all simultaneously. In this case, you'll need to provide appropriate command line options to train.py to use less memory - Tensorflow will tend to hog the entire GPU's memory unless you specify otherwise. You will also probably want to slow down train.py by forcing it to wait extra time in-between training epochs doing nothing - because otherwise training will outpace selfplay. Commonly, I think you want 5x-50x of your compute to be on selfplay compared to training, at least on 19x19. You'll also need a ton of RAM probably.

* Perhaps a better option is to run everything sequentially. For selfplay, you can specify a fixed number of games to play in the .cfg file. Take a look at how the shuffle loop works in the script and run shuffle as a single invocation rather than in a loop, calling it once each time after you finish a new set of games. For train.py, as of the tip of the master branch on Github, you can specify a number of epochs to train before the script terminates. Similarly, run model exporting not in a loop, but rather only once after train.py finishes saving the next model and quits. If you choose to use gatekeeper, gatekeeper also has a command line option to terminate once it has nothing to do.

Either of these two ways, run the various KataGo commands and python scripts with "-help" for more info about the relevant command line options you need to use, and don't be afraid to take a look at the implementation of any of the bash scripts. You should be prepared to have to dig into some of these details and get your hands dirty with things - there will be parameters you should want to tweak based on your setup: choose how many shuffle threads to use depending on how many CPU cores you have, adjust shuffle and training batch size smaller if it's taking too much GPU memory, set selfplay to use fewer playouts if you want just short-term learning speed and don't mind long-term weaker strength, etc, etc.

Start only one piece at a time and make sure its working/producing data, and/or check its log files for obvious errors, watch a system monitor to make sure you aren't running your machine out of memory or that you're actually using the CPU/GPU as expected - before moving on to the next thing.

xela · #3

gcao wrote:

My goal is not to create a very strong NN, but to create a NN that is playable on amateur level.

By the way, there are some weaker KataGo networks at https://d3dndmfyhecmj0.cloudfront.net/g ... index.html . I think the first of the 10-block networks is already at least high dan level. You might want to try some of the 6-block networks.

lightvector · #4

Even 6 blocks 96 channels can reach high amateur dan level. However, 6 blocks also is few enough that the neural net will do very strange things with large dragons, due to simply being incapable of perceiving the whole group. (6 blocks = ~12 layers, so the max distance any stone can influence another is about distance 12-14, and less if some nontrivial computation needs to be done, and less if the group winds around a little), causing very non-human-like errors.

You might consider something like 10 blocks 64 channels or 10 blocks 48 channels or something like that, to get wider board perception but also still have few parameters and remain at weaker levels.

gcao · #5

Thanks a lot for the information. Will try it out and see how it goes.

gcao · #6

Thanks a lot for the information. It's good to know it's possible to run the selfplay+training on one machine. I'll give it a try and see how it goes.

BTW, my end goal is to adapt the program to handle Daoqi / Toroidal Go(https://senseis.xmp.net/?ToroidalGo), before that, I want to get training to work on my computers.

gcao · #7

I'm getting this error:
Training data json file does not exist, waiting and trying again later: .../KataGo/shared/shuffleddata/20200222-143315/train.json

After searching the whole codebase, I didn't find where train.json is created. I've run self-play and shuffle/export before the training step. Did I miss anything?

Here are the commands I ran

Code:

    cpp/katago selfplay -output-dir shared/selfplay -models-dir shared/models -config-file cpp/configs/selfplay1.cfg

    ./selfplay/shuffle_and_export_loop.sh CAO ../shared/ ../shared/scratch 4 1

    ./selfplay/train.sh ../shared/ CAO b6c96 main -lr-scale 1.0

lightvector · #8

That's normal, it's just letting you know that the shuffler has not produced any shuffled data yet. Since there is no data, training will not proceed, instead it will wait until there is data. If all is working, and you're running in asynchronous everything-at-once mode, then eventually there should be enough data. Or there could be an error, and there will never be enough.

The *reason* there is no data - you can investigate. Did the self-play actually generate data? Take a look at the logs, and take a look at the directory where it should have output data.

And did it generate enough data? Take a look at the shuffler's logs (the shuffle loop script should write an "outshuffle.txt" file where you ran it) for how much data the shuffler thinks it found. Shuffler is configured by default to start out with a training window of 250K rows. Fewer than that and it will not proceed. Too-small of training windows will lead to more overfitting, but you can also still decrease the initial window size if you like to proceed anyways with less than that much data.

I'll make the message about the train.json slightly clearer. Sorry about the confusion. Generally when debugging it helps to be looking at the logs and/or output all the way along, not just at the step you think isn't working.

(edit: some clarifications, less stupid phrasing)

lightvector · #9

Pushed some improvements to the docs:
https://github.com/lightvector/KataGo/c ... 7f6f9e4d67

Hope that helps. Let me know if you have further questions! :study:

gcao · **#10**

Thank you!

I'm able to run self-play/shuffle/export/train sequentially after I changed a few parameters(games=100, move-per-games=500, min-rows=1). I understand this won't create a playable model. I just want to get this process to run end-to-end first.

I noticed even when there are only 100 games played, the training takes very long time (much longer than the other steps). One epoch took 5 hours, The other steps took several minutes only. I have a very low-end GPU(Intel Iris Pro GPU with 1.5GB memory). That may have slowed down the training. However I wonder whether I missed anything else.

I inspected train.py as well. Looks like it runs an endless loop. Is that true? If I want to run it as part of play-shuffle-training process, do I just remove the "while True" loop.

lightvector · **#11**

You can configure train.py to stop after a certain number of epochs. Run it with -help to see all the arguments.

Or look here, this is the argument you want:
https://github.com/lightvector/KataGo/b ... ain.py#L48

An epoch is defined as 1 million training steps by default, it doesn't not depend on the amount of data. It really couldn't, since in normal operation more data is continuously being generated as you train, there's no well-defined notion of when you're "done" passing over the data. So probably the 5 hours were spent making a very large number of passes over the tiny amount of data you have, it didn't matter that it was a small amount. You can also change how many steps are considered one epoch with another of the command line flags.

KataGo self-play on a Macbook Pro

Who is online