KataGo First Benchmarks

For discussing go computing, software announcements, etc.
Post Reply
RobertJasiek
Judan
Posts: 6273
Joined: Tue Apr 27, 2010 8:54 pm
GD Posts: 0
Been thanked: 797 times
Contact:

KataGo First Benchmarks

Post by RobertJasiek »

RTX 4070 Ryzen 7700
katago benchmark -model kata1-b18c384nbt-s6386600960-d3368371862.bin.gz


KataGo 1_13_0 OpenCL

numSearchThreads = 20
visits/s = 1184.02
nnEvals/s = 1011.44
nnBatches/s = 103.52
avgBatchSize = 9.77
6.9 secs
EloDiff +171 (recommended)

KataGo 1_13_0 CUDA (some files copied)

numSearchThreads = 40
visits/s = 450.17
nnEvals/s = 413.07
nnBatches/s = 21.03
avgBatchSize = 19.64
18.6 secs
EloDiff +464 (recommended)

KataGo 1_13_0 CUDA Megapack/Lizzie (most files copied)

numSearchThreads = 32
visits/s = 419.52
nnEvals/s = 374.08
nnBatches/s = 24.14
avgBatchSize = 15.50
19.8 secs
EloDiff +461 (recommended)

KataGo 1_13_1 TensorRT (all missing files copied from LizzieYZY)

numSearchThreads = 40
visits/s = 2879.17
nnEvals/s = 2627.19
nnBatches/s = 132.87
avgBatchSize = 19.77
2.9 secs
EloDiff +343 (recommended)

Code: Select all

Engine     OpenCL   CUDA     TensorRT
visits/s   1184.02  450.17   2879.17
Speed I    2.63     1        6.40
Speed II   1        0.38     2.43

1) Is visits/s a good measure of KataGo speed for a given position and model net, or what other value should I compare?

2) Why is CUDA much slower than OpenCL?

3) Does visits mean playouts?

4) How good or bad are these values compared to other desktop or laptop GPUs?

5) Do I interpret these results correctly that TensorRT gives me the fastest engine, except for any launch delays?

6) What do the other measured values tell me?
kvasir
Lives in sente
Posts: 1040
Joined: Sat Jul 28, 2012 12:29 am
Rank: panda 5 dan
GD Posts: 0
IGS: kvasir
Has thanked: 25 times
Been thanked: 187 times

Re: KataGo First Benchmarks

Post by kvasir »

Did you specify a larger number of playouts (-v 10000 or so) when running these commands? The default appears to be -N 800 but larger batch sizes won't improve performance unless there is a larger number of playouts.

You can also use the 'genconfig' command to do the same search for a good configuration parameter, that has the advantage that it can output a config file that you can then use.

In my experience the TensorRT and CUDA are fairly similar but the best batch size is not the same. I think I remember that TensorRT could perform well with smaller batches than CUDA (I might not remember correctly). I think you should make sure that the number of visits in the test (-v xxxxxx on the command line) is something representative of your preferred workload, not something much larger or much smaller than the number of playouts that you wish to reach quickly. You could alternatively specify the time for each test in seconds with a different flag (I think -i xx).

Btw you can use the '

Code: Select all

' tag combo to post the entire output from katago without filling the screen and forcing everyone to scan through it. It might help if it is not clear that everything is working as expected.

==Edit it's -v not '-N' or '-n' and that explains why it took so long
Last edited by kvasir on Sat Jun 10, 2023 8:46 am, edited 2 times in total.
RobertJasiek
Judan
Posts: 6273
Joined: Tue Apr 27, 2010 8:54 pm
GD Posts: 0
Been thanked: 797 times
Contact:

Re: KataGo First Benchmarks

Post by RobertJasiek »

I will play with parameters later, try your and other suggestions. Thus far, I have only run the basic "katago benchmark -model" command line. Since I do not really know what values to expect, I have expected none. However, if I naively use the TensorRT visits/s, one 4070 would be thrice as fast as two 2080TIs (with 2 instead of 6 hours to reach 20 million) - can't be, sounds way too good.
RobertJasiek
Judan
Posts: 6273
Joined: Tue Apr 27, 2010 8:54 pm
GD Posts: 0
Been thanked: 797 times
Contact:

Re: KataGo First Benchmarks

Post by RobertJasiek »

For the same engine and the genconfig command, there are, in particular, the two questions

max RAM cache up to ~3GB in addition to whatever current search is using =
number of visits to test/tune performance with =

The cache I have a) left at default or b) set to 30. In both cases, I have set visits = 10000. The results have been very similar. The cache refers to the RAM but does it really mean RAM or maybe VRAM? I have 64GB RAM and 12GB VRAM. So I can assign any amount but does it matter? During genconfig execution, HWinfo monitored ca. 5.4 ~ 9 GB RAM and 0.7 ~ 2.3 GB VRAM used. So I guess the RAM cache parameter is almost immaterial.

I have played around a bit with the visits parameter but do not have a strategy yet how to modify this for repeated genconfig tests. Which strategy should I use for tuning?

OpenCL and CUDA differ for small tested numbers of threads: OpenCL loads most of the GPU quickly while CUDA starts around 54% and needs larger numbers of threads to load most of the GPU. When eventually CUDA does, the GPU hotspot becomes hotter closer to the thermal limit (a modest 84°C) of my defensive graphics card model. (TUF cards are built for longevity rather than heat records.)
RobertJasiek
Judan
Posts: 6273
Joined: Tue Apr 27, 2010 8:54 pm
GD Posts: 0
Been thanked: 797 times
Contact:

Re: KataGo First Benchmarks

Post by RobertJasiek »

In an attempt of developing some strategy for my tuning before experimenting with alternative libraries, I find it convenient to tune TensorRT before OpenCL and CUDA because so far TensorRT is the fastest so allows me to run more tuning tests per my time.
RobertJasiek
Judan
Posts: 6273
Joined: Tue Apr 27, 2010 8:54 pm
GD Posts: 0
Been thanked: 797 times
Contact:

Re: KataGo First Benchmarks

Post by RobertJasiek »

Katago OpenCL Tuning


I have tested whether using the OpenCL.dll library of KataGo's directory differs from using the OpenCL.dll library of the Windows system directory.


genconfig Query Parameters

Code: Select all

GB	= default
visits	= 10000
seconds	= default

Using C:\katago_OpenCL\OpenCL.dll 2.2.2.0

Presumably, this file is optimised by Nvidia. ProcessExplorer confirms its use.
C:\Windows\System32\OpenCL.dll is not also used.

Code: Select all

numSearchThreads  visits/s

05                0839.35
10                1251.00
12                1330.61
16                1458.56
20                1582.83
24                1662.11
32                1683.90
40                1808.78 (recommended)
48                1784.88
64                1824.11

Using C:\Windows\System32\OpenCL.dll 3.0.3.0 put in C:\katago_OpenCL

Alternatively (*), delete C:\katago_OpenCL\OpenCL.dll.
ProcessExplorer confirms use of C:\Windows\System32\OpenCL.dll.

I do not know if this file is optimised by Nvidia.

Code: Select all

numSearchThreads  visits/s

05                0748.23                 0752.72 (*)
10                1178.34                 1263.46 (*)
12                1276.82
16                1449.62
20                1565.77
24                1631.13
32                1700.51
40                1793.91 (recommended)
48                1797.41
64                1871.14

Conclusion

The results are very similar and the recommendation is the same.
Post Reply