KataGo gatekeeper throws error CUBLAS_STATUS_NOT_SUPPORTED

For discussing go computing, software announcements, etc.
Post Reply
gcao
Dies in gote
Posts: 25
Joined: Sat Feb 22, 2020 11:03 am
Rank: AGA 6D
GD Posts: 0
Been thanked: 2 times

KataGo gatekeeper throws error CUBLAS_STATUS_NOT_SUPPORTED

Post by gcao »

Hi @lightvector,

Hope this finds you well!

Not sure whether you remember me. Two years ago I spent a few months trying to set up KataGo on my laptop to train a model to play Go and also worked on adapting KataGo to play one of Go's variants - Daoqi. However I wasn't able to get very far because I didn't have a decent GPU and it's too expensive to get one.

Now two years later GPUs are more affordable. So I built a brand new machine with AMD Ryzen 9 5900x + Nvidia GeForce Rtx 3080Ti(12GB) + 64GB RAM. I installed Ubuntu 20.04 with CUDA 11.7.1, CUDNN 8.4.0, Python 3.7, TensorFlow 1.15 etc. I was able to compile KataGo with CUDA backend and run the synchronous_loop.sh. The selfplay, shuffle, train etc worked fine. However the gatekeeper is throwing below error. I understand gatekeeper is optional but this error might occur while I run the model as well I guess. Wonder what I should do to fix this error. Any help would be highly appreciated.

Code: Select all

...
2022-05-24 10:57:03-0400: Game loop thread 127 starting game testing candidate: mbp-s656768-d204361
terminate called after throwing an instance of 'StringError'
  what():  CUBLAS Error, for ginputw file /home/gcao/KataGo2/cpp/neuralnet/cudabackend.cpp, func cublasHgemm( cudaHandles->cublas, CUBLAS_OP_N, CUBLAS_OP_N, outChannels, batchSize, inChannels, alpha, (const half*)matBuf,outChannels, (const half*)inputBuf,inChannels, beta, (half*)outputBuf,outChannels ), line 663, error CUBLAS_STATUS_NOT_SUPPORTED
Aborted (core dumped)
lightvector
Lives in sente
Posts: 759
Joined: Sat Jun 19, 2010 10:11 pm
Rank: maybe 2d
GD Posts: 0
Has thanked: 114 times
Been thanked: 916 times

Re: KataGo gatekeeper throws error CUBLAS_STATUS_NOT_SUPPORT

Post by lightvector »

That's a little surprising. I don't know. Some thoughts:

* I have never tested KataGo with CUDA 11.7.1. You may notice the release is back at 11.1 or 11.2 (https://github.com/lightvector/KataGo/r ... ag/v1.11.0), but I've also successfully used cuda 11.4 (along with cudnn 8.2.4). Does installing a side-by-side downgraded CUDA 11.4 and cudnn 8.2.4 and using that instead work for you?

(As a side note, if you're on Linux, although slightly out of date, https://www.iridescent.io/tech-blogs-in ... right-way/ is a good guide to installing cuda in a way that won't bork future attempts to upgrade/downgrade, easily allows having multiple side-by-side versions installed at once, etc. In general the secret is to use the runfile version - I've used the deb version in the past and it always leaves apt packages in a messy state when I try to change versions. Indeed, the runfile version is also the one you can do without sudo: https://stackoverflow.com/questions/674 ... thout-sudo, i.e. you can do it in an entirely local and self-contained way)

* Does KataGo's OpenCL version work for you and use your GPU successfully? (this might distinguish a GPU/GPU-driver issue from a CUDA-library-level issue).

* Instead of running gatekeeper right away, how about just running plain old KataGo benchmark, or hooking up to any popular game analysis GUI and just doing plain game analysis?

* Does it work if you disable FP16 in the config? (e.g. cudaUseFP16 = false in the config)

* There is some chance some other user in the discord https://discord.gg/45EWcZu7 will have seen a similar error and can help you troubleshoot.
gcao
Dies in gote
Posts: 25
Joined: Sat Feb 22, 2020 11:03 am
Rank: AGA 6D
GD Posts: 0
Been thanked: 2 times

Re: KataGo gatekeeper throws error CUBLAS_STATUS_NOT_SUPPORT

Post by gcao »

Thanks a lot. I did try to run benchmark and got same error. I'll try the downgrade and other suggestions.
gcao
Dies in gote
Posts: 25
Joined: Sat Feb 22, 2020 11:03 am
Rank: AGA 6D
GD Posts: 0
Been thanked: 2 times

Re: KataGo gatekeeper throws error CUBLAS_STATUS_NOT_SUPPORT

Post by gcao »

I tried to set cudaUseFP16 to false. Both gatekeeper and benchmark worked fine.
Post Reply