Nvidia RTX 30xx
-
lightvector
- Lives in sente
- Posts: 759
- Joined: Sat Jun 19, 2010 10:11 pm
- Rank: maybe 2d
- GD Posts: 0
- Has thanked: 114 times
- Been thanked: 916 times
Re: Nvidia RTX 30xx
In case this helps people make an informed decision:
* For KataGo and probably Leela Zero and almost any other Go bots, tons of GPU memory could be useful at training time if you are a developer, but it is not useful at runtime for users just using the bot. You only need enough to have the buffers to handle the largest batch you will ever handle at once, and anything more doesn't help. And the amount you need for the largest batch you'll ever handle isn't that big. For many practical use cases, often less than 1 GB. (Handwavey intuition: Go boards are *tiny*. 19x19 is really a very tiny "image", so while they use big fat nets with and lots of rich channels in parallel on that "image" and want to do big batches in parallel... it's still not a heavy memory load on one of these top-of-the-line GPU).
* SLI has no value. The point of it in graphics I presume is to allow the GPUs to cooperate in splitting up the rendering of single scenes and sharing the work. But in Go, you aren't evaluating just one position with the net, you're evaluating millions, and the calculations can be done independently. If you have multiple GPUs, you just send them positions to evaluate in parallel. KataGo should do this if you configure it to use multiple GPUs.
The actual limiting factors are GPU memory bandwidth (especially internally - how much data you can quickly shuttle back and forth between the GPU's RAM and the GPU's calculation units like its tensor cores) and GPU compute throughput (how fast you can actually do the computations once the GPU memory is loaded). Depending on which of these two is more limiting in a given practical situation, sometimes benchmarks can show exciting huge improvements that actually only give mild gains because they improved the one that wasn't limiting. Or sometimes they can be true huge improvements, if they improved the limiting one.
And yes, CPU might also become limiting if the GPU is beefy enough. Having a lot of powerful CPU cores could be important to keep up with doing the MCTS and input feature calculation fast enough to keep the GPU fed.
* For KataGo and probably Leela Zero and almost any other Go bots, tons of GPU memory could be useful at training time if you are a developer, but it is not useful at runtime for users just using the bot. You only need enough to have the buffers to handle the largest batch you will ever handle at once, and anything more doesn't help. And the amount you need for the largest batch you'll ever handle isn't that big. For many practical use cases, often less than 1 GB. (Handwavey intuition: Go boards are *tiny*. 19x19 is really a very tiny "image", so while they use big fat nets with and lots of rich channels in parallel on that "image" and want to do big batches in parallel... it's still not a heavy memory load on one of these top-of-the-line GPU).
* SLI has no value. The point of it in graphics I presume is to allow the GPUs to cooperate in splitting up the rendering of single scenes and sharing the work. But in Go, you aren't evaluating just one position with the net, you're evaluating millions, and the calculations can be done independently. If you have multiple GPUs, you just send them positions to evaluate in parallel. KataGo should do this if you configure it to use multiple GPUs.
The actual limiting factors are GPU memory bandwidth (especially internally - how much data you can quickly shuttle back and forth between the GPU's RAM and the GPU's calculation units like its tensor cores) and GPU compute throughput (how fast you can actually do the computations once the GPU memory is loaded). Depending on which of these two is more limiting in a given practical situation, sometimes benchmarks can show exciting huge improvements that actually only give mild gains because they improved the one that wasn't limiting. Or sometimes they can be true huge improvements, if they improved the limiting one.
And yes, CPU might also become limiting if the GPU is beefy enough. Having a lot of powerful CPU cores could be important to keep up with doing the MCTS and input feature calculation fast enough to keep the GPU fed.
Last edited by lightvector on Sat Sep 05, 2020 8:44 am, edited 1 time in total.
-
RobertJasiek
- Judan
- Posts: 6272
- Joined: Tue Apr 27, 2010 8:54 pm
- GD Posts: 0
- Been thanked: 797 times
- Contact:
Re: Nvidia RTX 30xx
With "usually stronger playing than 9p" I mean "always wins unless we have a constructed position with a ladder / semeai / mathematical endgame / non-standard ko strategy problem or the like".
Time settings: like in typical tournament, casual game or server game.
And no, time is not everything - storage is also a factor:) 10-11GB versus 22-24GB VRAM might make the difference.
Time settings: like in typical tournament, casual game or server game.
And no, time is not everything - storage is also a factor:) 10-11GB versus 22-24GB VRAM might make the difference.
-
RobertJasiek
- Judan
- Posts: 6272
- Joined: Tue Apr 27, 2010 8:54 pm
- GD Posts: 0
- Been thanked: 797 times
- Contact:
Re: Nvidia RTX 30xx
lightvector, many thanks, very helpful!
The following questions are about using nets - not about training them.
RAM: So 8GB VRAM is more than enough. How much RAM of the mainboard do you recommend? More than 8GB, I suppose, but would already 16GB be enough for 8-12GB VRAM or 32GB enough for 22-24GB VRAM? Would more RAM of the mainboard be only useful for training?
"SLI has no value": Does this also mean that having two graphics cards without SLI is useless?
How many real CPU cores do you recommend together with RTX 3080 or 3090?
Memory bandwidth: For RTX 2080TI/3080/3090, according to Nvidia, we have 616/760/936GB/s. So the better cards are slightly better indeed. RTX 3080/3090 are better than RTX 3070 due to DDR6X instead of DDR6. IIRC, RTX 3090 is even better than 3080 here due to more lanes.
"GPU compute throughput": What parameters are relevant for this? Both Tensor TFlops and ALU (aka cuda aka shader) Tflops? For RTX 2080TI/3080/3090, according to Nvidia, we have 114/238/285 Tensor TFlops and 0,42/0,93/1,11 ALU TFlops (64b). So RTX 30xx series appears to be much better than RTX 2080 TI. RTX 3090 is slightly better than RTX 3080. Even if the raw figures promise more than 2x acceleration and in practice only 1.5x can be achieved, it would be a huge improvement, IMO.
"Depending on which of these two is more limiting in a given practical situation, sometimes benchmarks can show exciting huge improvements that actually only give mild gains because they improved the one that wasn't limiting. Or sometimes they can be true huge improvements, if they improved the limiting one.": So RTX 3080 and 3090 might be similar in practice or 3090 might sometimes be significantly better. A gamble game given the steep price increment.
The following questions are about using nets - not about training them.
RAM: So 8GB VRAM is more than enough. How much RAM of the mainboard do you recommend? More than 8GB, I suppose, but would already 16GB be enough for 8-12GB VRAM or 32GB enough for 22-24GB VRAM? Would more RAM of the mainboard be only useful for training?
"SLI has no value": Does this also mean that having two graphics cards without SLI is useless?
How many real CPU cores do you recommend together with RTX 3080 or 3090?
Memory bandwidth: For RTX 2080TI/3080/3090, according to Nvidia, we have 616/760/936GB/s. So the better cards are slightly better indeed. RTX 3080/3090 are better than RTX 3070 due to DDR6X instead of DDR6. IIRC, RTX 3090 is even better than 3080 here due to more lanes.
"GPU compute throughput": What parameters are relevant for this? Both Tensor TFlops and ALU (aka cuda aka shader) Tflops? For RTX 2080TI/3080/3090, according to Nvidia, we have 114/238/285 Tensor TFlops and 0,42/0,93/1,11 ALU TFlops (64b). So RTX 30xx series appears to be much better than RTX 2080 TI. RTX 3090 is slightly better than RTX 3080. Even if the raw figures promise more than 2x acceleration and in practice only 1.5x can be achieved, it would be a huge improvement, IMO.
"Depending on which of these two is more limiting in a given practical situation, sometimes benchmarks can show exciting huge improvements that actually only give mild gains because they improved the one that wasn't limiting. Or sometimes they can be true huge improvements, if they improved the limiting one.": So RTX 3080 and 3090 might be similar in practice or 3090 might sometimes be significantly better. A gamble game given the steep price increment.
-
explo
- Dies with sente
- Posts: 108
- Joined: Wed Apr 21, 2010 8:07 am
- Rank: FFG 1d
- GD Posts: 0
- Location: France
- Has thanked: 14 times
- Been thanked: 18 times
Re: Nvidia RTX 30xx
We (here) don't have access to top pros but I share UbderDude's feeling. I have a GTX 1660 Ti at home and I would bet on it (paired with katago) against any professionnals on any time settings.RobertJasiek wrote:With "usually stronger playing than 9p" I mean "always wins unless we have a constructed position with a ladder / semeai / mathematical endgame / non-standard ko strategy problem or the like".
Time settings: like in typical tournament, casual game or server game.
And no, time is not everything - storage is also a factor:) 10-11GB versus 22-24GB VRAM might make the difference.
If I remember correctly ez4u has posted screens with positions with up to a million playouts and he has a GTX 1650, so you can already get quite far with 6GB VRAM. Last weekend I followed the European team championship finals live, I didn't see any drop in performance after 2-3 hours. My CPU is a 6700k with 16 GB of RAM.
I understand that katago with 2xRTX 2080Ti is much stronger than what I have and I'm tempted to upgrade but deep down I know I don't need the extra power. Any extra refinement to the moves katago would pick is beyond my understanding after a few thousands playouts. I upgraded from a GTX 1050 last year, so performance wise it was twice as fast, but I don't think it made a difference when reviewing my games.
-
RobertJasiek
- Judan
- Posts: 6272
- Joined: Tue Apr 27, 2010 8:54 pm
- GD Posts: 0
- Been thanked: 797 times
- Contact:
Re: Nvidia RTX 30xx
I don't want to compromise (more than a bit) and probably will choose the RTX 3080. Maybe I wait for AMD's Zen3, which is supposed to appear this year (Intel's top CPUs are too expensive and inefficient per watt).
-
lightvector
- Lives in sente
- Posts: 759
- Joined: Sat Jun 19, 2010 10:11 pm
- Rank: maybe 2d
- GD Posts: 0
- Has thanked: 114 times
- Been thanked: 916 times
Re: Nvidia RTX 30xx
Correct, 8GB VRAM should be already more than you'd ever need with any current Go programs and more will not be useful. As for ordinary RAM on your computer, not on the GPU, it mostly only matters if you intend to do large numbers of playouts. There is some fixed overhead for various internal things, as well as some limited overhead per thread, costs for storing the neural net weights itself, maybe we can handwave the total of these as around a GB. Beyond that, the only part of KataGo's memory usage that scales indefinitely is the cost of storing neural net evaluation results.RobertJasiek wrote:lightvector, many thanks, very helpful!
The following questions are about using nets - not about training them.
RAM: So 8GB VRAM is more than enough. How much RAM of the mainboard do you recommend? More than 8GB, I suppose, but would already 16GB be enough for 8-12GB VRAM or 32GB enough for 22-24GB VRAM? Would more RAM of the mainboard be only useful for training?
KataGo uses about 1.5kb per neural net evaluation in the MCTS tree (3kb if you have ownership prediction turned on). It also stores a cache of evaluation results that by default is sized to be about a million entries by default (specifically, 2^20), which you can change by editing your gtp.cfg ("nnCacheSizePowerOfTwo"), so using about 1.5GB. This cache is used to speed up repeated calculations, such as if you have just analyzed some moves in a game, and then you interactively scroll backward in the game to re-analyze some earlier moves. It also speeds up a single search somewhat if there are many upcoming transpositions, but this is not usually a big effect, perhaps 20% speedup on average.
So, if you wanted to search 3 million playouts on a single move, you will need a minimum of 4.5 GB of RAM (9GB if ownership is on). If you want to spend tens to hundreds of thousands of playouts per move for each consecutive move in a game, and then return to earlier moves in an analysis program and benefit from a somewhat faster search the second time due to caching, you'll need a cache big enough to hold the sum of the number of playouts of all the moves in between (and ideally, a reasonable constant factor larger than that).
Beyond that, there's no use for extra RAM. You'll of course want enough for your operating system, your browser, any other things on your computer running at the same time as Go-related stuff. But for Go, having extra RAM beyond what you'll need for the fixed overheads plus the playouts you want plus your cache will not help, and hopefully based on the above you can estimate how much that will be.
No, of course not, you just use both GPUs. As I understand it, SLI is some magic where for certain specific tasks, the GPUs to work together on a *single* task to do it faster. Probably not 2x faster, but somewhat faster. But whether you have SLI or not, obviously if you have multiple tasks instead of just one, you just give both GPUs different tasks in parallel and get a 2x throughput. In an MCTS search, you always have multiple tasks. Different GPUs can evaluate different nodes in the tree. So SLI is unnecessary and useless.RobertJasiek wrote: "SLI has no value": Does this also mean that having two graphics cards without SLI is useless?
Not sure. Only sure way would be to experiment. You might be able to get some idea by running on a weaker GPU after tuning for optimal threads and performance in other ways, and monitoring the CPU usage while a MCTS search is ongoing. The GPU will probably be the bottleneck, but you can see how much load is on the CPU to achieve bottlenecking on the GPU. And then extrapolating based on how much faster the GPU itself is expected to be based on benchmarks or based on other users's reported experience who have bought it before you.RobertJasiek wrote: How many real CPU cores do you recommend together with RTX 3080 or 3090?
-
Mike Novack
- Lives in sente
- Posts: 1045
- Joined: Mon Aug 09, 2010 9:36 am
- GD Posts: 0
- Been thanked: 182 times
Re: Nvidia RTX 30xx
It must be because I am old.
Sorry, but amount of core storage is also a matter of time. I'm old enough to remember "paging" (only a portion of the entire memory space used by a program in core at one time, paged in an out of external storage as needed).
BTW, that applies to the infinite memory requirement of a Turing Machine (or a Wang Machine where it is two unbounded stacks). If you were emulating either of these your computer would not need much core. Neither of these imaginary machines jumps around in memory, so simply page in or out from external storage as either end of what is in core is approached.
Sorry, but amount of core storage is also a matter of time. I'm old enough to remember "paging" (only a portion of the entire memory space used by a program in core at one time, paged in an out of external storage as needed).
BTW, that applies to the infinite memory requirement of a Turing Machine (or a Wang Machine where it is two unbounded stacks). If you were emulating either of these your computer would not need much core. Neither of these imaginary machines jumps around in memory, so simply page in or out from external storage as either end of what is in core is approached.
-
RobertJasiek
- Judan
- Posts: 6272
- Joined: Tue Apr 27, 2010 8:54 pm
- GD Posts: 0
- Been thanked: 797 times
- Contact:
Re: Nvidia RTX 30xx
lightvector, thank you again!
It is crucial whether with 'b' you mean bit or Byte...! Which?
(Strict use should be b = bit, B = Byte, k = 1000, K = 1024, but for M = mega it is ambiguous because m = milli.)
Presumably, I would sometimes want to do, say, 20.000.000 playouts. With up to 3 kb (3000 bit) per neural net evaluation or 375 B (Bytes), this gives almost 8 GB. Plus 1.5 GB cache by default. Plus 6.5 GB for the operating system / other programs. So 16GB would barely do.
"if you wanted to search 3 million playouts on a single move, you will need a minimum of 4.5 GB of RAM": IIUYC, of which 1.5 GB is cache, so 3 GB for the playouts themselves or 1000 B (Bytes) per playout. Now, I rather think that you meant 3 kB (3000 Bytes) when writing 3 kb.
If you meant Bytes, for circa 20.000.000 playouts, I would need 60 GB plus 1.5 GB plus, cheating, 2.5 GB. So 64 GB will do. That's also what 'goame' has suggested.
***
The journal c't gave a graph for CPU cores for deep learning: 4 were too few (above linear increment), 6 considered the minimum (roughly linear increment), 8 better, 10 to 16 still slightly better but slowly approaching constant for >>16 cores (curve becoming somewhat flat clearly below linear increment). Of course, it depends on GPU speed and the type of programs used for deep learning. So I presume that the sweet range is 6 to 16 real cores, where significant factors are money and cooling, e.g.:
It is crucial whether with 'b' you mean bit or Byte...! Which?
(Strict use should be b = bit, B = Byte, k = 1000, K = 1024, but for M = mega it is ambiguous because m = milli.)
Presumably, I would sometimes want to do, say, 20.000.000 playouts. With up to 3 kb (3000 bit) per neural net evaluation or 375 B (Bytes), this gives almost 8 GB. Plus 1.5 GB cache by default. Plus 6.5 GB for the operating system / other programs. So 16GB would barely do.
"if you wanted to search 3 million playouts on a single move, you will need a minimum of 4.5 GB of RAM": IIUYC, of which 1.5 GB is cache, so 3 GB for the playouts themselves or 1000 B (Bytes) per playout. Now, I rather think that you meant 3 kB (3000 Bytes) when writing 3 kb.
If you meant Bytes, for circa 20.000.000 playouts, I would need 60 GB plus 1.5 GB plus, cheating, 2.5 GB. So 64 GB will do. That's also what 'goame' has suggested.
***
The journal c't gave a graph for CPU cores for deep learning: 4 were too few (above linear increment), 6 considered the minimum (roughly linear increment), 8 better, 10 to 16 still slightly better but slowly approaching constant for >>16 cores (curve becoming somewhat flat clearly below linear increment). Of course, it depends on GPU speed and the type of programs used for deep learning. So I presume that the sweet range is 6 to 16 real cores, where significant factors are money and cooling, e.g.:
Code: Select all
3700X / 8 / €260 / 65W
3900X / 12 / €400 / 105W
3950X / 16 / €700 / 105W
-
thirdfogie
- Lives with ko
- Posts: 131
- Joined: Tue May 15, 2012 10:08 am
- Rank: British 3 kyu
- GD Posts: 0
- KGS: thirdfogie
- Has thanked: 151 times
- Been thanked: 30 times
Re: Nvidia RTX 30xx
Robert,
Thanks for starting this thread. The information will be useful when I come to replace my current PC, possibly a year from now.
It seems likely that lightvector meant bytes not bits. Hopefully, he will clear that up.
My question is: why do you want a superhuman Go-playing set-up? Of course, there's nothing wrong with that ambition, but I'm wondering if you plan to check all the statements in your published books. I used Leela Zero in that way to validate my Catalogue of Calamities, but on much less text and at 9 stones weaker in human strength.
Thanks for starting this thread. The information will be useful when I come to replace my current PC, possibly a year from now.
It seems likely that lightvector meant bytes not bits. Hopefully, he will clear that up.
My question is: why do you want a superhuman Go-playing set-up? Of course, there's nothing wrong with that ambition, but I'm wondering if you plan to check all the statements in your published books. I used Leela Zero in that way to validate my Catalogue of Calamities, but on much less text and at 9 stones weaker in human strength.
-
RobertJasiek
- Judan
- Posts: 6272
- Joined: Tue Apr 27, 2010 8:54 pm
- GD Posts: 0
- Been thanked: 797 times
- Contact:
Re: Nvidia RTX 30xx
Teaching by example input from professional players has taught me very little, also because they have differing to contradicting opinions. Programs of superhuman strength should make more useful suggestions. Although I very much prefer to learn from theory, most theory for my learning does not exist (or I and others need to invent it) so for improving beyond 5d still learning by examples cannot be avoided.thirdfogie wrote:why do you want a superhuman Go-playing set-up?
I expect superhuman programs to help me with opening, middle game fighting, advanced use of influence / potential and identifying more of my blunders than I can detect by myself. Furthermore, programs can be used for backtracking to deepen understanding of sources of mistakes. Programs should punish and therefore implicitly reveal knowledge gaps or insufficient skills.
I do not expect much for life+death reading (because book problems will do, reading requires practical effort and programs do not reveal reading and its decision-making well) and endgame (because programs do not teach values and may play suboptimally intentionally).
While my positional judgement is reasonable, programs will implicitly convey their own alternative, also reasonable positional judgement.
Programs are no panacea but will offer me new insights.
Mid-professional level programs would hardly help me so I ensure to build hardware for superhuman level.
It is also possible to analyse programs' play to derive new go theory, especially since I am very strong at generalising something good occurring consistently.
Since I have not written specifically about the opening yet, many middle game examples are from pro games, I check everything as well as I can and quite a lot even relies on proven or otherwise well developed theory, I expect only occasional significant mistakes. However, I am curious about a few particularly sophisticated examples, such as the triple ladder position, for which I expect the programs to fail.check all the statements in your published books.
-
lightvector
- Lives in sente
- Posts: 759
- Joined: Sat Jun 19, 2010 10:11 pm
- Rank: maybe 2d
- GD Posts: 0
- Has thanked: 114 times
- Been thanked: 916 times
Re: Nvidia RTX 30xx
Yes, I meant bytes, not bits in my posts above. But otherwise, yep, it sounds like you understand now roughly how to determine RAM usage.
Just one further detail - if a neural net evaluation is both in the cache and in the current MCTS search tree, it uses only 1 position's worth of RAM (1.5 kB or 3 kB depending on if you're tracking predicted ownership), not two positions' worth. The cache and MCTS tree both just point to same evaluation result in memory, rather than copying it. So generally, you can make the cache also big enough to fill most of the rough amount of RAM you plan to use, without worrying that it will significantly "compete" with the MCTS search tree for space. They will just share pointers to all the same however-many-millions of results.
Just one further detail - if a neural net evaluation is both in the cache and in the current MCTS search tree, it uses only 1 position's worth of RAM (1.5 kB or 3 kB depending on if you're tracking predicted ownership), not two positions' worth. The cache and MCTS tree both just point to same evaluation result in memory, rather than copying it. So generally, you can make the cache also big enough to fill most of the rough amount of RAM you plan to use, without worrying that it will significantly "compete" with the MCTS search tree for space. They will just share pointers to all the same however-many-millions of results.
-
RobertJasiek
- Judan
- Posts: 6272
- Joined: Tue Apr 27, 2010 8:54 pm
- GD Posts: 0
- Been thanked: 797 times
- Contact:
Re: Nvidia RTX 30xx
If a CPU SoC has an integrated graphics card and we have a dedicated (Nvidia) graphics card, will the NN programs automatically use the latter, is this the task of the operating system or must we set the programs' configuration files correctly?
-
RobertJasiek
- Judan
- Posts: 6272
- Joined: Tue Apr 27, 2010 8:54 pm
- GD Posts: 0
- Been thanked: 797 times
- Contact:
Re: Nvidia RTX 30xx
This page has a lot of backgroup information on deep learning:
https://timdettmers.com/2020/09/07/whic ... -learning/
In particular, it states relative performance for convolutional neural nets:
Compared to 3080, thus 3090 is ~ +12% (at +114% the price).
https://timdettmers.com/2020/09/07/whic ... -learning/
In particular, it states relative performance for convolutional neural nets:
Code: Select all
2080TI
3070 ~ +12%
3080 ~ +40%
3090 +57%-
RobertJasiek
- Judan
- Posts: 6272
- Joined: Tue Apr 27, 2010 8:54 pm
- GD Posts: 0
- Been thanked: 797 times
- Contact:
Re: Nvidia RTX 30xx
In case you tried my link and met its server being down yesterday, try again now! That webpage is really worth reading!
- quantumf
- Lives in sente
- Posts: 844
- Joined: Tue Apr 20, 2010 11:36 pm
- Rank: 3d
- GD Posts: 422
- KGS: komi
- Has thanked: 180 times
- Been thanked: 151 times
Re: Nvidia RTX 30xx
+1RobertJasiek wrote:In case you tried my link and met its server being down yesterday, try again now! That webpage is really worth reading!