KataGo First Benchmarks

RobertJasiek · #1

RTX 4070 Ryzen 7700
katago benchmark -model kata1-b18c384nbt-s6386600960-d3368371862.bin.gz

KataGo 1_13_0 OpenCL

numSearchThreads = 20
visits/s = 1184.02
nnEvals/s = 1011.44
nnBatches/s = 103.52
avgBatchSize = 9.77
6.9 secs
EloDiff +171 (recommended)

KataGo 1_13_0 CUDA (some files copied)

numSearchThreads = 40
visits/s = 450.17
nnEvals/s = 413.07
nnBatches/s = 21.03
avgBatchSize = 19.64
18.6 secs
EloDiff +464 (recommended)

KataGo 1_13_0 CUDA Megapack/Lizzie (most files copied)

numSearchThreads = 32
visits/s = 419.52
nnEvals/s = 374.08
nnBatches/s = 24.14
avgBatchSize = 15.50
19.8 secs
EloDiff +461 (recommended)

KataGo 1_13_1 TensorRT (all missing files copied from LizzieYZY)

numSearchThreads = 40
visits/s = 2879.17
nnEvals/s = 2627.19
nnBatches/s = 132.87
avgBatchSize = 19.77
2.9 secs
EloDiff +343 (recommended)

Code:

Engine     OpenCL   CUDA     TensorRT
visits/s   1184.02  450.17   2879.17
Speed I    2.63     1        6.40
Speed II   1        0.38     2.43

1) Is visits/s a good measure of KataGo speed for a given position and model net, or what other value should I compare?

2) Why is CUDA much slower than OpenCL?

3) Does visits mean playouts?

4) How good or bad are these values compared to other desktop or laptop GPUs?

5) Do I interpret these results correctly that TensorRT gives me the fastest engine, except for any launch delays?

6) What do the other measured values tell me?

kvasir · #2

Did you specify a larger number of playouts (-v 10000 or so) when running these commands? The default appears to be -N 800 but larger batch sizes won't improve performance unless there is a larger number of playouts.

You can also use the 'genconfig' command to do the same search for a good configuration parameter, that has the advantage that it can output a config file that you can then use.

In my experience the TensorRT and CUDA are fairly similar but the best batch size is not the same. I think I remember that TensorRT could perform well with smaller batches than CUDA (I might not remember correctly). I think you should make sure that the number of visits in the test (-v xxxxxx on the command line) is something representative of your preferred workload, not something much larger or much smaller than the number of playouts that you wish to reach quickly. You could alternatively specify the time for each test in seconds with a different flag (I think -i xx).

Btw you can use the '[hide][code]' tag combo to post the entire output from katago without filling the screen and forcing everyone to scan through it. It might help if it is not clear that everything is working as expected.

==Edit it's -v not '-N' or '-n' and that explains why it took so long

RobertJasiek · #3

I will play with parameters later, try your and other suggestions. Thus far, I have only run the basic "katago benchmark -model" command line. Since I do not really know what values to expect, I have expected none. However, if I naively use the TensorRT visits/s, one 4070 would be thrice as fast as two 2080TIs (with 2 instead of 6 hours to reach 20 million) - can't be, sounds way too good.

RobertJasiek · #4

For the same engine and the genconfig command, there are, in particular, the two questions

max RAM cache up to ~3GB in addition to whatever current search is using =
number of visits to test/tune performance with =

The cache I have a) left at default or b) set to 30. In both cases, I have set visits = 10000. The results have been very similar. The cache refers to the RAM but does it really mean RAM or maybe VRAM? I have 64GB RAM and 12GB VRAM. So I can assign any amount but does it matter? During genconfig execution, HWinfo monitored ca. 5.4 ~ 9 GB RAM and 0.7 ~ 2.3 GB VRAM used. So I guess the RAM cache parameter is almost immaterial.

I have played around a bit with the visits parameter but do not have a strategy yet how to modify this for repeated genconfig tests. Which strategy should I use for tuning?

OpenCL and CUDA differ for small tested numbers of threads: OpenCL loads most of the GPU quickly while CUDA starts around 54% and needs larger numbers of threads to load most of the GPU. When eventually CUDA does, the GPU hotspot becomes hotter closer to the thermal limit (a modest 84°C) of my defensive graphics card model. (TUF cards are built for longevity rather than heat records.)

RobertJasiek · #5

In an attempt of developing some strategy for my tuning before experimenting with alternative libraries, I find it convenient to tune TensorRT before OpenCL and CUDA because so far TensorRT is the fastest so allows me to run more tuning tests per my time.

RobertJasiek · #6

Katago OpenCL Tuning

I have tested whether using the OpenCL.dll library of KataGo's directory differs from using the OpenCL.dll library of the Windows system directory.

genconfig Query Parameters

Code:

GB   = default
visits   = 10000
seconds   = default

Using C:\katago_OpenCL\OpenCL.dll 2.2.2.0

Presumably, this file is optimised by Nvidia. ProcessExplorer confirms its use.
C:\Windows\System32\OpenCL.dll is not also used.

Code:

numSearchThreads  visits/s

05                0839.35
10                1251.00
12                1330.61
16                1458.56
20                1582.83
24                1662.11
32                1683.90
40                1808.78 (recommended)
48                1784.88
64                1824.11

Using C:\Windows\System32\OpenCL.dll 3.0.3.0 put in C:\katago_OpenCL

Alternatively (*), delete C:\katago_OpenCL\OpenCL.dll.
ProcessExplorer confirms use of C:\Windows\System32\OpenCL.dll.

I do not know if this file is optimised by Nvidia.

Code:

numSearchThreads  visits/s

05                0748.23                 0752.72 (*)
10                1178.34                 1263.46 (*)
12                1276.82
16                1449.62
20                1565.77
24                1631.13
32                1700.51
40                1793.91 (recommended)
48                1797.41
64                1871.14

Conclusion

The results are very similar and the recommendation is the same.

KataGo First Benchmarks

Who is online