Thanks for sharing/testing.
You still haven't said if you're using the OpenCL or the CUDA version, but if you're using the CUDA version with a GPU that has tensor cores (such as RTX2080), you want to set cudaUseFP16 and cudaUseNHWC both to true - they currently are not set in your config.
But if you're using the CUDA version on modern yet not quite as cutting-edge GPU that doesn't have tensor cores but still has some FP16 support (for example RTX10** I think?), then setting them either won't work or it won't help much, I think. And if you're using the OpenCL version, that version doesn't have FP16 support at all. It would be straightforward to implement, I've just never gotten around to doing so yet. So assuming you're running ELF on Leela Zero's engine, I would expect ELF to be a little better in these cases, particularly because Leela Zero's engine has code that takes advantage of limited FP16 support even when tensor cores are not available.
Your dynamicScoreUtilityFactor has been modified quite a bit higher from the default - I'm not entirely sure what effect that will have. The default GTP config should have come with is 0.2 and 0.2 for static and dynamic, but you can also try 0.0 and 0.4 which is actually what is used in training. You have 0.2 and 0.5, which puts a lot of weight on score compared to winning/losing.
(Edit: Also numNNServerThreadsPerModel = 2 is interesting if you only have one GPU. If you've specifically benchmarked the difference between setting it to 2 instead of the default of 1, and found it better, great! If you haven't - then I'm not sure why you have a non-default value here).
Besides that your config looks okay. It's hard to compare the numbers you gave due to visits versus playouts difference, assuming you do mean "visits" vs "playouts" the way LZ people usually mean - tree reuse can cause the relationship to vary wildly. But I'd guess both ELF and KataGo should be able to each win a decent number of games against the other. At fixed playouts and smaller numbers of threads on each side I know they are generally fairly similar. And then, which one is better at fixed time is a matter of things like the hardware and implementation details above, which can make as much as a factor of 2 difference in performance one way or another - and which is not small, a factor of 2 is easily more than 100 Elo.
When bots are otherwise close, it's hard to make a blanket statement about what will be best or which bot "is stronger" - messy configuration and hardware details on both sides and simple statistical noise can have a pretty big effect case by case.
Hope that helps?
