I believe this was caused as a result of the switchover. I think we can solve this if we were to make this a DeferredUpdate that converts to a job if it fails the first time.
Error
[17ecd3e7-7c79-4840-bf71-8d57dd7c9858] /wiki/Special:CheckUser Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry 'X' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\CheckUserLogService::addLogEntry Query: INSERT INTO `cu_log` (cul_timestamp,cul_actor,cul_type,cul_target_id,cul_target_text,cul_target_hex,cul_range_start,cul_range_end,cul_reason_id,cul_reason_plaintext_id) VALUES ( ... )
from /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/database/Database.php(1193) #0 /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/database/Database.php(1177): Wikimedia\Rdbms\Database->getQueryException(string, int, string, string) #1 /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/database/Database.php(1151): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, int, string, string) #2 /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/database/Database.php(643): Wikimedia\Rdbms\Database->reportQueryError(string, int, string, string, bool) #3 /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/database/Database.php(1471): Wikimedia\Rdbms\Database->query(Wikimedia\Rdbms\Query, string) #4 /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/database/DBConnRef.php(127): Wikimedia\Rdbms\Database->insert(string, array, string, array) #5 /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/database/DBConnRef.php(407): Wikimedia\Rdbms\DBConnRef->__call(string, array) #6 /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/querybuilder/InsertQueryBuilder.php(343): Wikimedia\Rdbms\DBConnRef->insert(string, array, string, array) #7 /srv/mediawiki/php-1.43.0-wmf.23/extensions/CheckUser/src/Services/CheckUserLogService.php(118): Wikimedia\Rdbms\InsertQueryBuilder->execute() #8 /srv/mediawiki/php-1.43.0-wmf.23/includes/deferred/MWCallableUpdate.php(52): MediaWiki\CheckUser\Services\CheckUserLogService::MediaWiki\CheckUser\Services\{closure}(string) #9 /srv/mediawiki/php-1.43.0-wmf.23/includes/deferred/DeferredUpdates.php(460): MediaWiki\Deferred\MWCallableUpdate->doUpdate() #10 /srv/mediawiki/php-1.43.0-wmf.23/includes/deferred/DeferredUpdates.php(204): MediaWiki\Deferred\DeferredUpdates::attemptUpdate(MediaWiki\Deferred\MWCallableUpdate) #11 /srv/mediawiki/php-1.43.0-wmf.23/includes/deferred/DeferredUpdates.php(291): MediaWiki\Deferred\DeferredUpdates::run(MediaWiki\Deferred\MWCallableUpdate) #12 /srv/mediawiki/php-1.43.0-wmf.23/includes/deferred/DeferredUpdatesScope.php(243): MediaWiki\Deferred\DeferredUpdates::MediaWiki\Deferred\{closure}(MediaWiki\Deferred\MWCallableUpdate, int) #13 /srv/mediawiki/php-1.43.0-wmf.23/includes/deferred/DeferredUpdatesScope.php(172): MediaWiki\Deferred\DeferredUpdatesScope->processStageQueue(int, int, Closure) #14 /srv/mediawiki/php-1.43.0-wmf.23/includes/deferred/DeferredUpdates.php(310): MediaWiki\Deferred\DeferredUpdatesScope->processUpdates(int, Closure) #15 /srv/mediawiki/php-1.43.0-wmf.23/includes/MediaWikiEntryPoint.php(674): MediaWiki\Deferred\DeferredUpdates::doUpdates() #16 /srv/mediawiki/php-1.43.0-wmf.23/includes/MediaWikiEntryPoint.php(496): MediaWiki\MediaWikiEntryPoint->restInPeace() #17 /srv/mediawiki/php-1.43.0-wmf.23/includes/MediaWikiEntryPoint.php(454): MediaWiki\MediaWikiEntryPoint->doPostOutputShutdown() #18 /srv/mediawiki/php-1.43.0-wmf.23/includes/MediaWikiEntryPoint.php(209): MediaWiki\MediaWikiEntryPoint->postOutputShutdown() #19 /srv/mediawiki/php-1.43.0-wmf.23/index.php(58): MediaWiki\MediaWikiEntryPoint->run() #20 /srv/mediawiki/w/index.php(3): require(string) #21 {main}
Impact
Means that logs of checks performed are not always written, making it not possible to audit these checks
Root cause deep dive
It was a rainy day.
In a check for auto_increment values. It was clear that some hosts in s3 had a different and lower value for auto_increment than most other hosts. That showed up in 250 different tables.
Show a clear split between replicas:
aawiki page_restrictions {7: ['db2205 (codfw master)', 'db2227', 'db1223', 'db1212', 'db1157'], 8: ['db1189 (eqiad master)', 'db2149', 'db2209', 'db2194', 'db2190', 'db2156', 'db2177', 'db1198', 'db1166', 'db1175']}
By checks, it was clear that the group with higher auto_increment value had the correct data, we recloned the lower replicas from the higher replicas. Except four backup sources.
After reclone in hosts that were on 10.6.19, many tables (~thousands) ended up with auto_increment value of 1 which was clearly wrong. We recloned them back with 10.6.17 and they are now have matching id. But we can't reproduce the problem.