I believe this was caused as a result of the switchover. I think we can solve this if we were to make this a DeferredUpdate that converts to a job if it fails the first time.
Error
[17ecd3e7-7c79-4840-bf71-8d57dd7c9858] /wiki/Special:CheckUser Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry 'X' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\CheckUserLogService::addLogEntry Query: INSERT INTO `cu_log` (cul_timestamp,cul_actor,cul_type,cul_target_id,cul_target_text,cul_target_hex,cul_range_start,cul_range_end,cul_reason_id,cul_reason_plaintext_id) VALUES ( ... )
from /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/database/Database.php(1193)
#0 /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/database/Database.php(1177): Wikimedia\Rdbms\Database->getQueryException(string, int, string, string)
#1 /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/database/Database.php(1151): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, int, string, string)
#2 /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/database/Database.php(643): Wikimedia\Rdbms\Database->reportQueryError(string, int, string, string, bool)
#3 /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/database/Database.php(1471): Wikimedia\Rdbms\Database->query(Wikimedia\Rdbms\Query, string)
#4 /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/database/DBConnRef.php(127): Wikimedia\Rdbms\Database->insert(string, array, string, array)
#5 /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/database/DBConnRef.php(407): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#6 /srv/mediawiki/php-1.43.0-wmf.23/includes/libs/rdbms/querybuilder/InsertQueryBuilder.php(343): Wikimedia\Rdbms\DBConnRef->insert(string, array, string, array)
#7 /srv/mediawiki/php-1.43.0-wmf.23/extensions/CheckUser/src/Services/CheckUserLogService.php(118): Wikimedia\Rdbms\InsertQueryBuilder->execute()
#8 /srv/mediawiki/php-1.43.0-wmf.23/includes/deferred/MWCallableUpdate.php(52): MediaWiki\CheckUser\Services\CheckUserLogService::MediaWiki\CheckUser\Services\{closure}(string)
#9 /srv/mediawiki/php-1.43.0-wmf.23/includes/deferred/DeferredUpdates.php(460): MediaWiki\Deferred\MWCallableUpdate->doUpdate()
#10 /srv/mediawiki/php-1.43.0-wmf.23/includes/deferred/DeferredUpdates.php(204): MediaWiki\Deferred\DeferredUpdates::attemptUpdate(MediaWiki\Deferred\MWCallableUpdate)
#11 /srv/mediawiki/php-1.43.0-wmf.23/includes/deferred/DeferredUpdates.php(291): MediaWiki\Deferred\DeferredUpdates::run(MediaWiki\Deferred\MWCallableUpdate)
#12 /srv/mediawiki/php-1.43.0-wmf.23/includes/deferred/DeferredUpdatesScope.php(243): MediaWiki\Deferred\DeferredUpdates::MediaWiki\Deferred\{closure}(MediaWiki\Deferred\MWCallableUpdate, int)
#13 /srv/mediawiki/php-1.43.0-wmf.23/includes/deferred/DeferredUpdatesScope.php(172): MediaWiki\Deferred\DeferredUpdatesScope->processStageQueue(int, int, Closure)
#14 /srv/mediawiki/php-1.43.0-wmf.23/includes/deferred/DeferredUpdates.php(310): MediaWiki\Deferred\DeferredUpdatesScope->processUpdates(int, Closure)
#15 /srv/mediawiki/php-1.43.0-wmf.23/includes/MediaWikiEntryPoint.php(674): MediaWiki\Deferred\DeferredUpdates::doUpdates()
#16 /srv/mediawiki/php-1.43.0-wmf.23/includes/MediaWikiEntryPoint.php(496): MediaWiki\MediaWikiEntryPoint->restInPeace()
#17 /srv/mediawiki/php-1.43.0-wmf.23/includes/MediaWikiEntryPoint.php(454): MediaWiki\MediaWikiEntryPoint->doPostOutputShutdown()
#18 /srv/mediawiki/php-1.43.0-wmf.23/includes/MediaWikiEntryPoint.php(209): MediaWiki\MediaWikiEntryPoint->postOutputShutdown()
#19 /srv/mediawiki/php-1.43.0-wmf.23/index.php(58): MediaWiki\MediaWikiEntryPoint->run()
#20 /srv/mediawiki/w/index.php(3): require(string)
#21 {main}Impact
Means that logs of checks performed are not always written, making it not possible to audit these checks
Root cause deep dive
It was a rainy day.
In a check for auto_increment values. It was clear that some hosts in s3 had a different and lower value for auto_increment than most other hosts. That showed up in 250 different tables.
Show a clear split between replicas:
aawiki page_restrictions {7: ['db2205 (codfw master)', 'db2227', 'db1223', 'db1212', 'db1157'], 8: ['db1189 (eqiad master)', 'db2149', 'db2209', 'db2194', 'db2190', 'db2156', 'db2177', 'db1198', 'db1166', 'db1175']}By checks, it was clear that the group with higher auto_increment value had the correct data, we recloned the lower replicas from the higher replicas. Except four backup sources.
After reclone in hosts that were on 10.6.19, many tables (~thousands) ended up with auto_increment value of 1 which was clearly wrong. We recloned them back with 10.6.17 and they are now have matching id. But we can't reproduce the problem.
