It is currently Thu Apr 18, 2024 3:58 pm

All times are UTC - 8 hours [ DST ]




Post new topic Reply to topic  [ 6 posts ] 
Author Message
Offline
 Post subject: KataGo First Benchmarks
Post #1 Posted: Thu Jun 08, 2023 7:07 am 
Judan

Posts: 6139
Liked others: 0
Was liked: 786
RTX 4070 Ryzen 7700
katago benchmark -model kata1-b18c384nbt-s6386600960-d3368371862.bin.gz


KataGo 1_13_0 OpenCL

numSearchThreads = 20
visits/s = 1184.02
nnEvals/s = 1011.44
nnBatches/s = 103.52
avgBatchSize = 9.77
6.9 secs
EloDiff +171 (recommended)

KataGo 1_13_0 CUDA (some files copied)

numSearchThreads = 40
visits/s = 450.17
nnEvals/s = 413.07
nnBatches/s = 21.03
avgBatchSize = 19.64
18.6 secs
EloDiff +464 (recommended)

KataGo 1_13_0 CUDA Megapack/Lizzie (most files copied)

numSearchThreads = 32
visits/s = 419.52
nnEvals/s = 374.08
nnBatches/s = 24.14
avgBatchSize = 15.50
19.8 secs
EloDiff +461 (recommended)

KataGo 1_13_1 TensorRT (all missing files copied from LizzieYZY)

numSearchThreads = 40
visits/s = 2879.17
nnEvals/s = 2627.19
nnBatches/s = 132.87
avgBatchSize = 19.77
2.9 secs
EloDiff +343 (recommended)


Code:
Engine     OpenCL   CUDA     TensorRT
visits/s   1184.02  450.17   2879.17
Speed I    2.63     1        6.40
Speed II   1        0.38     2.43



1) Is visits/s a good measure of KataGo speed for a given position and model net, or what other value should I compare?

2) Why is CUDA much slower than OpenCL?

3) Does visits mean playouts?

4) How good or bad are these values compared to other desktop or laptop GPUs?

5) Do I interpret these results correctly that TensorRT gives me the fastest engine, except for any launch delays?

6) What do the other measured values tell me?

Top
 Profile  
 
Offline
 Post subject: Re: KataGo First Benchmarks
Post #2 Posted: Thu Jun 08, 2023 8:35 am 
Lives in sente

Posts: 905
Liked others: 22
Was liked: 168
Rank: panda 5 dan
IGS: kvasir
Did you specify a larger number of playouts (-v 10000 or so) when running these commands? The default appears to be -N 800 but larger batch sizes won't improve performance unless there is a larger number of playouts.

You can also use the 'genconfig' command to do the same search for a good configuration parameter, that has the advantage that it can output a config file that you can then use.

In my experience the TensorRT and CUDA are fairly similar but the best batch size is not the same. I think I remember that TensorRT could perform well with smaller batches than CUDA (I might not remember correctly). I think you should make sure that the number of visits in the test (-v xxxxxx on the command line) is something representative of your preferred workload, not something much larger or much smaller than the number of playouts that you wish to reach quickly. You could alternatively specify the time for each test in seconds with a different flag (I think -i xx).

Btw you can use the '[hide][code]' tag combo to post the entire output from katago without filling the screen and forcing everyone to scan through it. It might help if it is not clear that everything is working as expected.

==Edit it's -v not '-N' or '-n' and that explains why it took so long


Last edited by kvasir on Sat Jun 10, 2023 8:46 am, edited 2 times in total.
Top
 Profile  
 
Offline
 Post subject: Re: KataGo First Benchmarks
Post #3 Posted: Thu Jun 08, 2023 9:15 am 
Judan

Posts: 6139
Liked others: 0
Was liked: 786
I will play with parameters later, try your and other suggestions. Thus far, I have only run the basic "katago benchmark -model" command line. Since I do not really know what values to expect, I have expected none. However, if I naively use the TensorRT visits/s, one 4070 would be thrice as fast as two 2080TIs (with 2 instead of 6 hours to reach 20 million) - can't be, sounds way too good.

Top
 Profile  
 
Offline
 Post subject: Re: KataGo First Benchmarks
Post #4 Posted: Sun Jun 11, 2023 10:02 am 
Judan

Posts: 6139
Liked others: 0
Was liked: 786
For the same engine and the genconfig command, there are, in particular, the two questions

max RAM cache up to ~3GB in addition to whatever current search is using =
number of visits to test/tune performance with =

The cache I have a) left at default or b) set to 30. In both cases, I have set visits = 10000. The results have been very similar. The cache refers to the RAM but does it really mean RAM or maybe VRAM? I have 64GB RAM and 12GB VRAM. So I can assign any amount but does it matter? During genconfig execution, HWinfo monitored ca. 5.4 ~ 9 GB RAM and 0.7 ~ 2.3 GB VRAM used. So I guess the RAM cache parameter is almost immaterial.

I have played around a bit with the visits parameter but do not have a strategy yet how to modify this for repeated genconfig tests. Which strategy should I use for tuning?

OpenCL and CUDA differ for small tested numbers of threads: OpenCL loads most of the GPU quickly while CUDA starts around 54% and needs larger numbers of threads to load most of the GPU. When eventually CUDA does, the GPU hotspot becomes hotter closer to the thermal limit (a modest 84°C) of my defensive graphics card model. (TUF cards are built for longevity rather than heat records.)

Top
 Profile  
 
Offline
 Post subject: Re: KataGo First Benchmarks
Post #5 Posted: Mon Jun 12, 2023 11:14 pm 
Judan

Posts: 6139
Liked others: 0
Was liked: 786
In an attempt of developing some strategy for my tuning before experimenting with alternative libraries, I find it convenient to tune TensorRT before OpenCL and CUDA because so far TensorRT is the fastest so allows me to run more tuning tests per my time.

Top
 Profile  
 
Offline
 Post subject: Re: KataGo First Benchmarks
Post #6 Posted: Tue Jun 13, 2023 12:41 am 
Judan

Posts: 6139
Liked others: 0
Was liked: 786
Katago OpenCL Tuning


I have tested whether using the OpenCL.dll library of KataGo's directory differs from using the OpenCL.dll library of the Windows system directory.


genconfig Query Parameters

Code:
GB   = default
visits   = 10000
seconds   = default



Using C:\katago_OpenCL\OpenCL.dll 2.2.2.0

Presumably, this file is optimised by Nvidia. ProcessExplorer confirms its use.
C:\Windows\System32\OpenCL.dll is not also used.

Code:
numSearchThreads  visits/s

05                0839.35
10                1251.00
12                1330.61
16                1458.56
20                1582.83
24                1662.11
32                1683.90
40                1808.78 (recommended)
48                1784.88
64                1824.11



Using C:\Windows\System32\OpenCL.dll 3.0.3.0 put in C:\katago_OpenCL

Alternatively (*), delete C:\katago_OpenCL\OpenCL.dll.
ProcessExplorer confirms use of C:\Windows\System32\OpenCL.dll.

I do not know if this file is optimised by Nvidia.

Code:
numSearchThreads  visits/s

05                0748.23                 0752.72 (*)
10                1178.34                 1263.46 (*)
12                1276.82
16                1449.62
20                1565.77
24                1631.13
32                1700.51
40                1793.91 (recommended)
48                1797.41
64                1871.14



Conclusion

The results are very similar and the recommendation is the same.

Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 6 posts ] 

All times are UTC - 8 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group