Life In 19x19 :: Engine Tournament

I'm glad you're enthusiastic, but I still don't understand why you insist on using such a tiny number of games (only 4 per network!!) and justifying it on the basis of wanting to serve "end users".

If that is the only computation power you can afford, sure. I absolutely respect and appreciate doing the best one can with limited resources. No problem!

But instead if it's a deliberate choice to use fewer games to better match what end users would experience, then it's silly. Rather than deliberately using an error-prone measurement because you think most users will not notice, it's certainly at least no harm to use an accurate measurement (more games) and report the accurate difference. Then each user can decide for themselves if the accurately-reported difference is big enough to care about.

Four games per test is especially few. Consider a bot A that beats B 60% of the time. I would guess most people would consider that not a huge difference, but still a respectable one. However, with only 4 games, the chance that B beats A 3-1 or 4-0 is about 18%! So there is an 18% chance you'd come up with the entirely backwards conclusion.

You've argued many times in the past that "end users" will only use the bot for few games themselves, therefore the way to make the best recommendation is to test using only a few games because it better matches the usage, rather than tests with a large number of games. We can see by the following example that such logic isn't very good:

Suppose we did do a 4 game test and we did get a 3-1 result in favor of B (getting a result that was only 18% likely is very possible!).
Suppose we also did a 1000 game test and this time, the result was that A won 613 games and B won 387 games.

Consider a user who plans to use either bot A or bot B in a tournament where it will play 4 games, and they want the bot with the best chance of doing well. Based on the above two tests, which bot should we recommend to them? Should we trust the 4 game test and recommend B because the tournament will also be 4 games, therefore a 4-game test is the most reliable? Our should we trust the 1000 game test and recommend A because the 1000 game test is overall more accurate measurement?

Obviously we should recommend bot A to them!

We can see here a clear demonstration that the principle "if end users will only notice larger differences and will only be using the bot for a very few games, then the best way to make a good recommendation to to also run tests using only a very few games" is a bad principle. The way to make a good recommendation to an end user that will run few games is to test many times more games than they will use.

Author:	q30 [ Sat Jun 19, 2021 3:45 am ]
Post subject:	Re: Engine Tournament
As the result of finished distributed learning LeelaZero neuronet project, the strongest weight file is the leelaz-model-swa-4-32000_quantized.txt, not the one in best-network.gz (details). It won 55 games and lost 39 games - 59%:41% (when the best-network won 43 games and lost 39 games - 52%:48%). This not a very small statistics proves that big statistics chase to the detriment of real conditions compliance may lead to a bit wrong results.

Author:	q30 [ Sat Jul 10, 2021 1:11 am ]
Post subject:	Re: Engine Tournament
The rate of SAI weight files on 2021 yr beginning (details): "bantamweight" <\|= 2 ^ 23 B (< 12 MiB) - rw19x19.txt; (5) "featherweight" 2 ^ 24 B (12 - 24 MiB) - I haven't; "lightweight" 2 ^ 25 B (24 - 48 MiB) - b12a30551826858ce24a21e48cf4c20fea3c25bacba00c01c9763d020908185e; (4) "welterweight" 2 ^ 26 B (48 - 96 MiB) - 4433b9162a5ad473120d0731b951b649829e0c155e0590f9f1e51a808f5a3263; (3) "middleweight" 2 ^ 27 B (96 - 192 MiB) - 87022c79f36dc8dc81c7b7aa1f1f250858b44d137434304b910dd3a60612716a; (2) "light heavyweight" 2 ^ 28 B (192 - 384 MiB) - af94bb0c79cc88d2ce8ca6459f5d1b603e9053182d5a47e457cc3af9180f66f1; (1) "heavyweight" 2 ^ 29 B (384 - 768 MiB) - I haven't; "super heavyweight" >\|= 2 ^ 30 B (> 768 MiB) - I haven't.

Author:	q30 [ Sat Jul 31, 2021 1:04 am ]
Post subject:	Re: Engine Tournament
SAI is weaker than LeelaZero (details).

Author:	q30 [ Sat Aug 14, 2021 1:11 am ]
Post subject:	Re: Engine Tournament
The rate of GTP engines without using GPU on the 2021 year beginning (details): Top level 1) KataGo 2) LeelaZero 3) SAI High level 4) Leela 5) Rayon 6) Zenith 7) Pachi_DCNN 8) Hiratuka Middle level 9) Ray 10)Pachi 11)MoGo

Author:	q30 [ Sat Oct 02, 2021 4:11 am ]
Post subject:	Re: Engine Tournament
KataGo v.1.8.0 - v.1.9.1: 15 - 15 (details).

Life In 19x19 http://www.lifein19x19.com/

Engine Tournament http://www.lifein19x19.com/viewtopic.php?f=18&t=13322	Page 19 of 20

Author:	q30 [ Sat Nov 13, 2021 3:31 am ]
Post subject:	Re: Engine Tournament
LeelaZero "next" branch v. 24.08.21 - release v. 0.17: 25 - 29 (details).

Author:	q30 [ Sat Dec 04, 2021 5:31 am ]
Post subject:	Re: Engine Tournament
KataGo v.1.10.0 - v.1.9.1: 19 - 17 (details).

Author:	q30 [ Sat Jan 15, 2022 3:51 am ]
Post subject:	Re: Engine Tournament
The strongest SAI weight file of 2021 year is 527ae617c8a61caae4473a69b8eb1411175fc2c3bcd35d13ca42dbf5a98090fa (details).

Author:	q30 [ Sat Jan 29, 2022 2:48 am ]
Post subject:	Re: Engine Tournament
I don't understand, what is the purpose of SAI project: this LeelaZero fork updates weight file, but it is weaker than LeelaZero's year old ones (details), updates source files, but doesn't have compiled for different CPU and GPU types binary releases (like KataGo)... The 2021 year rate of SAI "weight categories" is next: "bantamweight".........<\|= 2 ^ 23 B (< 12 MiB)...- rw19x19.txt.....................................................................................(5) "featherweight".........2 ^ 24 B (12 - 24 MiB).....- I haven't "lightweight".............2 ^ 25 B (24 - 48 MiB).....- b12a30551826858ce24a21e48cf4c20fea3c25bacba00c01c9763d020908185e..(4) "welterweight"..........2 ^ 26 B (48 - 96 MiB).....- 4433b9162a5ad473120d0731b951b649829e0c155e0590f9f1e51a808f5a3263....(3) "middleweight"..........2 ^ 27 B (96 - 192 MiB)....- 87022c79f36dc8dc81c7b7aa1f1f250858b44d137434304b910dd3a60612716a...(2) "light heavyweight"....2 ^ 28 B (192 - 384 MiB)...- 527ae617c8a61caae4473a69b8eb1411175fc2c3bcd35d13ca42dbf5a98090fa..(1) "heavyweight"...........2 ^ 29 B (384 - 768 MiB)...- I haven't "super heavyweight"..>\|= 2 ^ 30 B (> 768 MiB)..- I haven't

Author:	q30 [ Sat Feb 12, 2022 2:54 am ]
Post subject:	Re: Engine Tournament
The last in 2021 year KataGo weight files are stronger than the year beginning ones (details). The KataGo weight files 2021 year rating by "category" is next: "bantamweight".........<\|= 2 ^ 23 B (< 12 MiB)....- g170e-b10c128-s1141046784-d204142634.bin (6) "featherweight".........2 ^ 24 B (12 - 24 MiB)......- I haven't "lightweight".............2 ^ 25 B (24 - 48 MiB)......- g170e-b15c192-s1672170752-d466197061.bin (5) "welterweight"..........2 ^ 26 B (48 - 96 MiB)......- g170e-b20c256x2-s5303129600-d1228401921.bin (4) "middleweight"..........2 ^ 27 B (96 - 192 MiB)....- kata1-b40c256-s10638505984-d2592890214.bin (1) "light heavyweight"....2 ^ 28 B (192 - 384 MiB)...- g170-b30c320x2-s4824661760-d1229536699.bin (2) "heavyweight"...........2 ^ 29 B (384 - 768 MiB)...- kata1-b60c320-s5026470912-d2583431160.bin (3) "super heavyweight"..>\|= 2 ^ 30 B (> 768 MiB)...- I haven't

Author:	lightvector [ Sat Feb 12, 2022 9:35 am ]
Post subject:	Re: Engine Tournament
Partly cross-posting from a github thread where q30 has linked this result: Quote: I'm glad you're enthusiastic, but I still don't understand why you insist on using such a tiny number of games (only 4 per network!!) and justifying it on the basis of wanting to serve "end users". If that is the only computation power you can afford, sure. I absolutely respect and appreciate doing the best one can with limited resources. No problem! But instead if it's a deliberate choice to use fewer games to better match what end users would experience, then it's silly. Rather than deliberately using an error-prone measurement because you think most users will not notice, it's certainly at least no harm to use an accurate measurement (more games) and report the accurate difference. Then each user can decide for themselves if the accurately-reported difference is big enough to care about. Four games per test is especially few. Consider a bot A that beats B 60% of the time. I would guess most people would consider that not a huge difference, but still a respectable one. However, with only 4 games, the chance that B beats A 3-1 or 4-0 is about 18%! So there is an 18% chance you'd come up with the entirely backwards conclusion. You've argued many times in the past that "end users" will only use the bot for few games themselves, therefore the way to make the best recommendation is to test using only a few games because it better matches the usage, rather than tests with a large number of games. We can see by the following example that such logic isn't very good: Suppose we did do a 4 game test and we did get a 3-1 result in favor of B (getting a result that was only 18% likely is very possible!). Suppose we also did a 1000 game test and this time, the result was that A won 613 games and B won 387 games. Consider a user who plans to use either bot A or bot B in a tournament where it will play 4 games, and they want the bot with the best chance of doing well. Based on the above two tests, which bot should we recommend to them? Should we trust the 4 game test and recommend B because the tournament will also be 4 games, therefore a 4-game test is the most reliable? Our should we trust the 1000 game test and recommend A because the 1000 game test is overall more accurate measurement? Obviously we should recommend bot A to them! We can see here a clear demonstration that the principle "if end users will only notice larger differences and will only be using the bot for a very few games, then the best way to make a good recommendation to to also run tests using only a very few games" is a bad principle. The way to make a good recommendation to an end user that will run few games is to test many times more games than they will use.

Author:	lightvector [ Sat Feb 12, 2022 3:32 pm ]
Post subject:	Re: Engine Tournament
Maybe a last try to explain it more intuitively: suppose you are trying to serve only users who only care about a large enough effect that they might notice themselves it in 3-5 games. If you run only 3-5 games yourself, you simulate what a single such user would notice. This is already of some usefulness. But we know there will be significant variation in the results different users will experience, because the way we are measuring has some randomness. So even if that single user would notice some difference or not, due to this random variation and luck maybe some other users would get a different result. If we want our result to be confidently useful to many users, not just one user, we should simulate many users, not just one user. Maybe we could simulate what 5-10 different users each would see. In other words, we might want to run 3-5 games, 5-10 times. And there you go. As long as we can afford it, if we want to be reliable and responsible in our conclusion, we should run at least several times more games than the minimum a single user would need to have a chance to notice something.

Author:	q30 [ Sat Feb 19, 2022 6:47 am ]
Post subject:	Re: Engine Tournament
So, if the test time You distribute as follows: number of games -> infinity and timings -> 0, then You get the most accurate results. Did I understand Your point of view correctly?

Author:	q30 [ Sat Apr 16, 2022 3:07 am ]
Post subject:	Re: Engine Tournament
KataGo v.1.10.0-v.1.11.0: 17-15 (details).

Author:	PC_Screen [ Sat Feb 18, 2023 5:50 pm ]
Post subject:	Re: Engine Tournament
Try using the new b18c384nbt-uec.bin.gz net, it's 95MB in size and currently the strongest available net. As of now it's only available through KataGo's github page, but in a couple days/weeks it should replace 60b as the main training net. You'll need KataGo 1.12 to run it https://github.com/lightvector/KataGo/releases/tag/v1.12.4

Page 19 of 20	All times are UTC - 8 hours [ DST ]
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group http://www.phpbb.com/

Author:	q30 [ Sat Sep 10, 2022 5:49 am ]
Post subject:	Re: Engine Tournament
Ray became even weaken, weaker than MoGo (details)...

Author:	q30 [ Sat Jan 14, 2023 5:57 am ]
Post subject:	Re: Engine Tournament
SAI continues to regress (details). Thanks to all of them, who gives for that their computing time...

Author:	q30 [ Sat Feb 18, 2023 4:53 am ]
Post subject:	Re: Engine Tournament
KataGo became stronger after the year training (details). The "heavyweight" file became a bit stronger (12-8), than the two years old "light heavyweight" one... The 2022 year rate of KataGo "weight categories" is next: "bantamweight" <\|= 2 ^ 23 B (< 12 MiB) - g170e-b10c128-s1141046784-d204142634.bin (6) "featherweight" 2 ^ 24 B (12 - 24 MiB) - I haven't "lightweight" 2 ^ 25 B (24 - 48 MiB) - g170e-b15c192-s1672170752-d466197061.bin (5) "welterweight" 2 ^ 26 B (48 - 96 MiB) - g170e-b20c256x2-s5303129600-d1228401921.bin (4) "middleweight" 2 ^ 27 B (96 - 192 MiB) - kata1-b40c256-s12350780416-d3055274313.bin (1) "light heavyweight" 2 ^ 28 B (192 - 384 MiB) - g170-b30c320x2-s4824661760-d1229536699.bin (3) "heavyweight" 2 ^ 29 B (384 - 768 MiB) - kata1-b60c320-s6782286336-d3070935549.bin (2) "super heavyweight" >\|= 2 ^ 30 B (> 768 MiB) - I haven't

Author:	q30 [ Wed Feb 22, 2023 5:04 am ]
Post subject:	Re: Engine Tournament
I'm using sizes of unpacked files. It will replace the weight file of the same "weight category"...

Author:	q30 [ Sat Mar 04, 2023 6:46 am ]
Post subject:	Re: Engine Tournament
Newer (eigen) version of KataGo is a bit (statistically insignificant) weaker than older one (again): v1.12.4-v1.11.0 15-17 (details)...