Page MenuHomePhabricator

"No server with index" and "Warning: Undefined index" from LoadBalancer::reconfigure
Closed, ResolvedPublic

Description

So far on wikis bgwiki, enwikiquote, nowiki, wikidatawiki, cebwiki, shwiki, srwiki during the stub (metadata) dumps. A sample stack trace:

[20221101142049]: InvalidArgumentException from line 2322 of /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php: No server with index '4'
#0 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1191): Wikimedia\Rdbms\LoadBalancer->getServerInfoStrict(4)
#1 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1124): Wikimedia\Rdbms\LoadBalancer->reallyOpenConnection(4, Object(Wikimedia\Rdbms\DatabaseDomain), Array)
#2 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(957): Wikimedia\Rdbms\LoadBalancer->reuseOrOpenConnectionForNewRef(4, Object(Wikimedia\Rdbms\DatabaseDomain), 3)
#3 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/loadmonitor/LoadMonitor.php(237): Wikimedia\Rdbms\LoadBalancer->getServerConnection(4, '', 3)
#4 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/loadmonitor/LoadMonitor.php(172): Wikimedia\Rdbms\LoadMonitor->computeServerStates(Array, Array)
#5 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/objectcache/wancache/WANObjectCache.php(1688): Wikimedia\Rdbms\LoadMonitor->Wikimedia\Rdbms\{closure}(Array, 1, Array, 1667312380.4293, Array)
#6 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/objectcache/wancache/WANObjectCache.php(1521): WANObjectCache->fetchOrRegenerate('global:rdbms-se...', 1, Object(Closure), Array, Array)
#7 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/loadmonitor/LoadMonitor.php(181): WANObjectCache->getWithSetCallback('global:rdbms-se...', 1, Object(Closure), Array)
#8 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/loadmonitor/LoadMonitor.php(104): Wikimedia\Rdbms\LoadMonitor->getServerStates(Array)
#9 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(553): Wikimedia\Rdbms\LoadMonitor->scaleLoads(Array)
#10 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(509): Wikimedia\Rdbms\LoadBalancer->getReaderIndex('dump', 'bgwiki')
#11 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(930): Wikimedia\Rdbms\LoadBalancer->getConnectionIndex(-1, Array, 'bgwiki')
#12 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/database/DBConnRef.php(103): Wikimedia\Rdbms\LoadBalancer->getConnectionInternal(-1, Array, 'bgwiki', 0)
#13 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/database/DBConnRef.php(117): Wikimedia\Rdbms\DBConnRef->ensureConnection()
#14 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/database/DBConnRef.php(356): Wikimedia\Rdbms\DBConnRef->__call('selectRow', Array)
#15 /srv/mediawiki/php-1.40.0-wmf.7/includes/page/WikiPage.php(408): Wikimedia\Rdbms\DBConnRef->selectRow(Array, Array, Array, 'WikiPage::pageD...', Array, Array)
#16 /srv/mediawiki/php-1.40.0-wmf.7/includes/page/WikiPage.php(430): WikiPage->pageData(Object(Wikimedia\Rdbms\DBConnRef), Array, Array)
#17 /srv/mediawiki/php-1.40.0-wmf.7/includes/page/WikiPage.php(470): WikiPage->pageDataFromTitle(Object(Wikimedia\Rdbms\DBConnRef), Object(Title), Array)
#18 /srv/mediawiki/php-1.40.0-wmf.7/includes/page/WikiPage.php(577): WikiPage->loadPageData()
#19 /srv/mediawiki/php-1.40.0-wmf.7/includes/export/XmlDumpWriter.php(259): WikiPage->getId()
#20 /srv/mediawiki/php-1.40.0-wmf.7/includes/export/WikiExporter.php(549): XmlDumpWriter->openPage(Object(stdClass))
#21 /srv/mediawiki/php-1.40.0-wmf.7/includes/export/WikiExporter.php(491): WikiExporter->outputPageStreamBatch(Object(Wikimedia\Rdbms\MysqliResultWrapper), Object(stdClass))
#22 /srv/mediawiki/php-1.40.0-wmf.7/includes/export/WikiExporter.php(315): WikiExporter->dumpPages('page_id >= 5726...', false)
#23 /srv/mediawiki/php-1.40.0-wmf.7/includes/export/WikiExporter.php(198): WikiExporter->dumpFrom('page_id >= 5726...', false)
#24 /srv/mediawiki/php-1.40.0-wmf.7/maintenance/includes/BackupDumper.php(359): WikiExporter->pagesByRange(57260, 59115, false)
#25 /srv/mediawiki/php-1.40.0-wmf.7/maintenance/dumpBackup.php(82): BackupDumper->dump(1, 1)
#26 /srv/mediawiki/php-1.40.0-wmf.7/maintenance/includes/MaintenanceRunner.php(309): DumpBackup->execute()
#27 /srv/mediawiki/php-1.40.0-wmf.7/maintenance/doMaintenance.php(85): MediaWiki\Maintenance\MaintenanceRunner->run()
#28 /srv/mediawiki/php-1.40.0-wmf.7/maintenance/dumpBackup.php(144): require_once('/srv/mediawiki/...')
#29 /srv/mediawiki/multiversion/MWScript.php(120): require_once('/srv/mediawiki/...')
#30 {main}
Wikimedia\Rdbms\DBConnectionError from line 1359 of /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: No working replica DB server: Unknown error
#0 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(522): Wikimedia\Rdbms\LoadBalancer->reportConnectionError()
#1 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/loadbalancer/LoadBalancer.php(930): Wikimedia\Rdbms\LoadBalancer->getConnectionIndex(-1, Array, 'bgwiki')
#2 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/database/DBConnRef.php(103): Wikimedia\Rdbms\LoadBalancer->getConnectionInternal(-1, Array, 'bgwiki', 0)
#3 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/database/DBConnRef.php(117): Wikimedia\Rdbms\DBConnRef->ensureConnection()
#4 /srv/mediawiki/php-1.40.0-wmf.7/includes/libs/rdbms/database/DBConnRef.php(356): Wikimedia\Rdbms\DBConnRef->__call('selectRow', Array)
#5 /srv/mediawiki/php-1.40.0-wmf.7/includes/page/WikiPage.php(408): Wikimedia\Rdbms\DBConnRef->selectRow(Array, Array, Array, 'WikiPage::pageD...', Array, Array)
#6 /srv/mediawiki/php-1.40.0-wmf.7/includes/page/WikiPage.php(430): WikiPage->pageData(Object(Wikimedia\Rdbms\DBConnRef), Array, Array)
#7 /srv/mediawiki/php-1.40.0-wmf.7/includes/page/WikiPage.php(470): WikiPage->pageDataFromTitle(Object(Wikimedia\Rdbms\DBConnRef), Object(Title), Array)
#8 /srv/mediawiki/php-1.40.0-wmf.7/includes/page/WikiPage.php(577): WikiPage->loadPageData()
#9 /srv/mediawiki/php-1.40.0-wmf.7/includes/export/XmlDumpWriter.php(259): WikiPage->getId()
#10 /srv/mediawiki/php-1.40.0-wmf.7/includes/export/WikiExporter.php(549): XmlDumpWriter->openPage(Object(stdClass))
#11 /srv/mediawiki/php-1.40.0-wmf.7/includes/export/WikiExporter.php(491): WikiExporter->outputPageStreamBatch(Object(Wikimedia\Rdbms\MysqliResultWrapper), Object(stdClass))
#12 /srv/mediawiki/php-1.40.0-wmf.7/includes/export/WikiExporter.php(315): WikiExporter->dumpPages('page_id >= 6446...', false)
#13 /srv/mediawiki/php-1.40.0-wmf.7/includes/export/WikiExporter.php(198): WikiExporter->dumpFrom('page_id >= 6446...', false)
#14 /srv/mediawiki/php-1.40.0-wmf.7/maintenance/includes/BackupDumper.php(359): WikiExporter->pagesByRange(64468, 66244, false)
#15 /srv/mediawiki/php-1.40.0-wmf.7/maintenance/dumpBackup.php(82): BackupDumper->dump(1, 1)
#16 /srv/mediawiki/php-1.40.0-wmf.7/maintenance/includes/MaintenanceRunner.php(309): DumpBackup->execute()
#17 /srv/mediawiki/php-1.40.0-wmf.7/maintenance/doMaintenance.php(85): MediaWiki\Maintenance\MaintenanceRunner->run()
#18 /srv/mediawiki/php-1.40.0-wmf.7/maintenance/dumpBackup.php(144): require_once('/srv/mediawiki/...')
#19 /srv/mediawiki/multiversion/MWScript.php(120): require_once('/srv/mediawiki/...')
#20 {main}

I guess this might be the result of the patch merge from T298485 so I'm adding the DBA project preemptively, feel free to move it if that's wrong.

Event Timeline

ArielGlenn created this task.
Krinkle updated the task description. (Show Details)
Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.
Krinkle edited subscribers, added: daniel, Krinkle; removed: ArielGlenn.

It's not actually an issue with LB or LBF. It's an issue with LoadMonitor, This class is something I still don't understand its point of existence and the amount of resources it consumes (e.g. opening a connection to all replicas) while it has not yet prevented one single outage. It is also weird in the sense that it goes to great lengths (e.g. adding jitter to TTL) while forgetting that with depool you completely change the serverIndexes meaning it's pushing weights around to wrong replicas randomly (it has nothing to do the reload config change because it's about WAN cache value)

I can make it not error out anymore because that's probably caused by my reload change but LoadMonitor is already utterly broken and needs to be re-written from scratch.

For now I would say, let's keep it from breaking things, so the stub dumps or other jobs can run to completion, and I'm interested to follow along on the discussion as to LoadMonitor rewrites.

Change 852220 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Avoid errors in case a depooled in load monitor

https://gerrit.wikimedia.org/r/852220

A lot more wikis in the latest exception email: arzwiki, bnwiki, cebwiki, cewiki, cswiki, cywiki, dawiki, dewiki, dewiktionary, elwiki, enwikisource, eowiki, eswiktionary, etwiki, euwiki, fiwiki, frwikisource, glwiki, hiwiki, hrwiki, hywiki, idwiki, ltwiki, mediawikiwiki, nowiki, plwiktionary, ruwikinews, ruwikisource, ruwiktionary, simplewiki, skwiki, specieswiki, thwiki, trwiki, warwiki, zhwiktionary

I made the patch that fixes it but I need someone to review it. I can't do that myself

If it happens again, I suggest completely disabling load monitor in CLI at least. It's not giving much benefit and complicates the infra for no apparent benefit.

@ArielGlenn Do you know if any more of this issue has happened or not? If not, it means it was fixed by the other patch because it happens less often and passes the cache's TTL so it gets the correct one.

No more of these errors since then. I'd like to get through the entire run before closing out this task, though.

We saw some new errors of this sort during the abstracts dump job. Sample stacktrace:

[20221121172444]: InvalidArgumentException from line 2319 of /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/loadbalancer/LoadBalancer.php: No server with index '4'
#0 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1170): Wikimedia\Rdbms\LoadBalancer->getServerInfoStrict(4)
#1 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1102): Wikimedia\Rdbms\LoadBalancer->reallyOpenConnection(4, Object(Wikimedia\Rdbms\DatabaseDomain), Array)
#2 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/loadbalancer/LoadBalancer.php(945): Wikimedia\Rdbms\LoadBalancer->reuseOrOpenConnectionForNewRef(4, Object(Wikimedia\Rdbms\DatabaseDomain), 3)
#3 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/loadmonitor/LoadMonitor.php(237): Wikimedia\Rdbms\LoadBalancer->getServerConnection(4, '', 3)
#4 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/loadmonitor/LoadMonitor.php(172): Wikimedia\Rdbms\LoadMonitor->computeServerStates(Array, Array)
#5 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/objectcache/wancache/WANObjectCache.php(1755): Wikimedia\Rdbms\LoadMonitor->Wikimedia\Rdbms\{closure}(Array, 1, Array, 1669051415.1476, Array)
#6 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/objectcache/wancache/WANObjectCache.php(1585): WANObjectCache->fetchOrRegenerate('global:rdbms-se...', 1, Object(Closure), Array, Array)
#7 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/loadmonitor/LoadMonitor.php(181): WANObjectCache->getWithSetCallback('global:rdbms-se...', 1, Object(Closure), Array)
#8 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/loadmonitor/LoadMonitor.php(104): Wikimedia\Rdbms\LoadMonitor->getServerStates(Array)
#9 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/loadbalancer/LoadBalancer.php(547): Wikimedia\Rdbms\LoadMonitor->scaleLoads(Array)
#10 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/loadbalancer/LoadBalancer.php(507): Wikimedia\Rdbms\LoadBalancer->getReaderIndex('dump')
#11 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/loadbalancer/LoadBalancer.php(920): Wikimedia\Rdbms\LoadBalancer->getConnectionIndex(-1, Array, 'cebwiki')
#12 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(103): Wikimedia\Rdbms\LoadBalancer->getConnectionInternal(-1, Array, 'cebwiki', 0)
#13 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(117): Wikimedia\Rdbms\DBConnRef->ensureConnection()
#14 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(356): Wikimedia\Rdbms\DBConnRef->__call('selectRow', Array)
#15 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/querybuilder/SelectQueryBuilder.php(689): Wikimedia\Rdbms\DBConnRef->selectRow(Array, Array, Array, 'MediaWiki\\Page\\...', Array, Array)
#16 /srv/mediawiki/php-1.40.0-wmf.10/includes/page/PageStore.php(199): Wikimedia\Rdbms\SelectQueryBuilder->fetchRow()
#17 /srv/mediawiki/php-1.40.0-wmf.10/includes/cache/LinkCache.php(461): MediaWiki\Page\PageStore->MediaWiki\Page\{closure}(Object(Wikimedia\Rdbms\DBConnRef), 0, 'Oil_Manikin', Array)
#18 /srv/mediawiki/php-1.40.0-wmf.10/includes/cache/LinkCache.php(494): LinkCache->getGoodLinkRowInternal(Object(TitleValue), Object(Closure), 0)
#19 /srv/mediawiki/php-1.40.0-wmf.10/includes/page/PageStore.php(188): LinkCache->getGoodLinkRow(0, 'Oil_Manikin', Object(Closure), 0)
#20 /srv/mediawiki/php-1.40.0-wmf.10/includes/page/PageStore.php(154): MediaWiki\Page\PageStore->getPageByNameViaLinkCache(0, 'Oil_Manikin', 0)
#21 /srv/mediawiki/php-1.40.0-wmf.10/includes/page/PageStore.php(326): MediaWiki\Page\PageStore->getPageByName(0, 'Oil_Manikin', 0)
#22 /srv/mediawiki/php-1.40.0-wmf.10/includes/title/Title.php(4108): MediaWiki\Page\PageStore->getPageByReference(Object(Title), 0)
#23 /srv/mediawiki/php-1.40.0-wmf.10/includes/title/Title.php(1100): Title->getFieldFromPageStore('page_content_mo...', 0)
#24 /srv/mediawiki/php-1.40.0-wmf.10/extensions/ActiveAbstract/includes/AbstractFilter.php(131): Title->getContentModel()
#25 /srv/mediawiki/php-1.40.0-wmf.10/includes/export/DumpFilter.php(81): MediaWiki\Extension\ActiveAbstract\AbstractFilter->writeClosePage('  </page>\n')
#26 /srv/mediawiki/php-1.40.0-wmf.10/includes/export/ExportProgressFilter.php(44): DumpFilter->writeClosePage('  </page>\n')
#27 /srv/mediawiki/php-1.40.0-wmf.10/includes/export/WikiExporter.php(613): ExportProgressFilter->writeClosePage('  </page>\n')
#28 /srv/mediawiki/php-1.40.0-wmf.10/includes/export/WikiExporter.php(501): WikiExporter->finishPageStreamOutput(Object(stdClass))
#29 /srv/mediawiki/php-1.40.0-wmf.10/includes/export/WikiExporter.php(315): WikiExporter->dumpPages('page_id >= 3220...', false)
#30 /srv/mediawiki/php-1.40.0-wmf.10/includes/export/WikiExporter.php(198): WikiExporter->dumpFrom('page_id >= 3220...', false)
#31 /srv/mediawiki/php-1.40.0-wmf.10/maintenance/includes/BackupDumper.php(359): WikiExporter->pagesByRange(3220001, 3230001, false)
#32 /srv/mediawiki/php-1.40.0-wmf.10/maintenance/dumpBackup.php(84): BackupDumper->dump(2, 0)
#33 /srv/mediawiki/php-1.40.0-wmf.10/maintenance/includes/MaintenanceRunner.php(309): DumpBackup->execute()
#34 /srv/mediawiki/php-1.40.0-wmf.10/maintenance/doMaintenance.php(85): MediaWiki\Maintenance\MaintenanceRunner->run()
#35 /srv/mediawiki/php-1.40.0-wmf.10/maintenance/dumpBackup.php(144): require_once('/srv/mediawiki/...')
#36 /srv/mediawiki/multiversion/MWScript.php(120): require_once('/srv/mediawiki/...')
#37 {main}
Wikimedia\Rdbms\DBConnectionError from line 1341 of /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: could not connect to any replica DB server
#0 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/loadbalancer/LoadBalancer.php(514): Wikimedia\Rdbms\LoadBalancer->reportConnectionError('could not conne...')
#1 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/loadbalancer/LoadBalancer.php(920): Wikimedia\Rdbms\LoadBalancer->getConnectionIndex(-1, Array, 'cebwiki')
#2 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(103): Wikimedia\Rdbms\LoadBalancer->getConnectionInternal(-1, Array, 'cebwiki', 0)
#3 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(117): Wikimedia\Rdbms\DBConnRef->ensureConnection()
#4 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(356): Wikimedia\Rdbms\DBConnRef->__call('selectRow', Array)
#5 /srv/mediawiki/php-1.40.0-wmf.10/includes/libs/rdbms/querybuilder/SelectQueryBuilder.php(689): Wikimedia\Rdbms\DBConnRef->selectRow(Array, Array, Array, 'MediaWiki\\Page\\...', Array, Array)
#6 /srv/mediawiki/php-1.40.0-wmf.10/includes/page/PageStore.php(199): Wikimedia\Rdbms\SelectQueryBuilder->fetchRow()
#7 /srv/mediawiki/php-1.40.0-wmf.10/includes/cache/LinkCache.php(461): MediaWiki\Page\PageStore->MediaWiki\Page\{closure}(Object(Wikimedia\Rdbms\DBConnRef), 0, 'Tangudla_Gutta', Array)
#8 /srv/mediawiki/php-1.40.0-wmf.10/includes/cache/LinkCache.php(494): LinkCache->getGoodLinkRowInternal(Object(TitleValue), Object(Closure), 0)
#9 /srv/mediawiki/php-1.40.0-wmf.10/includes/page/PageStore.php(188): LinkCache->getGoodLinkRow(0, 'Tangudla_Gutta', Object(Closure), 0)
#10 /srv/mediawiki/php-1.40.0-wmf.10/includes/page/PageStore.php(154): MediaWiki\Page\PageStore->getPageByNameViaLinkCache(0, 'Tangudla_Gutta', 0)
#11 /srv/mediawiki/php-1.40.0-wmf.10/includes/page/PageStore.php(326): MediaWiki\Page\PageStore->getPageByName(0, 'Tangudla_Gutta', 0)
#12 /srv/mediawiki/php-1.40.0-wmf.10/includes/title/Title.php(4108): MediaWiki\Page\PageStore->getPageByReference(Object(Title), 0)
#13 /srv/mediawiki/php-1.40.0-wmf.10/includes/title/Title.php(1100): Title->getFieldFromPageStore('page_content_mo...', 0)
#14 /srv/mediawiki/php-1.40.0-wmf.10/extensions/ActiveAbstract/includes/AbstractFilter.php(131): Title->getContentModel()
#15 /srv/mediawiki/php-1.40.0-wmf.10/includes/export/DumpFilter.php(81): MediaWiki\Extension\ActiveAbstract\AbstractFilter->writeClosePage('  </page>\n')
#16 /srv/mediawiki/php-1.40.0-wmf.10/includes/export/ExportProgressFilter.php(44): DumpFilter->writeClosePage('  </page>\n')
#17 /srv/mediawiki/php-1.40.0-wmf.10/includes/export/WikiExporter.php(613): ExportProgressFilter->writeClosePage('  </page>\n')
#18 /srv/mediawiki/php-1.40.0-wmf.10/includes/export/WikiExporter.php(501): WikiExporter->finishPageStreamOutput(Object(stdClass))
#19 /srv/mediawiki/php-1.40.0-wmf.10/includes/export/WikiExporter.php(315): WikiExporter->dumpPages('page_id >= 5020...', false)
#20 /srv/mediawiki/php-1.40.0-wmf.10/includes/export/WikiExporter.php(198): WikiExporter->dumpFrom('page_id >= 5020...', false)
#21 /srv/mediawiki/php-1.40.0-wmf.10/maintenance/includes/BackupDumper.php(359): WikiExporter->pagesByRange(5020001, 5030001, false)
#22 /srv/mediawiki/php-1.40.0-wmf.10/maintenance/dumpBackup.php(84): BackupDumper->dump(2, 0)
#23 /srv/mediawiki/php-1.40.0-wmf.10/maintenance/includes/MaintenanceRunner.php(309): DumpBackup->execute()
#24 /srv/mediawiki/php-1.40.0-wmf.10/maintenance/doMaintenance.php(85): MediaWiki\Maintenance\MaintenanceRunner->run()
#25 /srv/mediawiki/php-1.40.0-wmf.10/maintenance/dumpBackup.php(144): require_once('/srv/mediawiki/...')
#26 /srv/mediawiki/multiversion/MWScript.php(120): require_once('/srv/mediawiki/...')
#27 {main}

from cebwiki, also seen on a handful of others. Note that the abstracts jobs did eventually retry and complete.

Sample job that was running and failed (for enwiktionary this time, similar stack trace):

[20221121211226]: nonzero return 1 from command '/usr/bin/php7.4 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=enwiktionary /srv/mediawiki/php-1.40.0-wmf.10 --plugin=AbstractFilter:/srv/mediawiki/php-1.40.0-wmf.10/extensions/ActiveAbstract/includes/AbstractFilter.php --dbgroupdefault=dump --current --report=1000 --namespaces=0 --output=file:/mnt/dumpsdata/xmldatadumps/temp/e/enwiktionary/enwiktionary-20221120-abstract.xml.gz.inprog_tmp --filter=namespace:NS_MAIN --filter=noredirect --filter=abstract --skip-header --start=6920001 --skip-footer --end 6930001'
nonzero return 1 from command '/usr/bin/php7.4 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=enwiktionary /srv/mediawiki/php-1.40.0-wmf.10 --plugin=AbstractFilter:/srv/mediawiki/php-1.40.0-wmf.10/extensions/ActiveAbstract/includes/AbstractFilter.php --dbgroupdefault=dump --current --report=1000 --namespaces=0 --output=file:/mnt/dumpsdata/xmldatadumps/temp/e/enwiktionary/enwiktionary-20221120-abstract.xml.gz.inprog_tmp --filter=namespace:NS_MAIN --filter=noredirect --filter=abstract --skip-header --start=8510001 --skip-footer --end 8520001'

I noticed some things off about reconfigure:

  • Using the new group load for server #X when X is the new config might be referring to different server (due to reindexing). This breaks LoadMonitor when invoked from LoadBalancer for all non-zero load DB (there could be a phantom server Y with load but no "servers" entry). It also breaks getConnection() if the group reader index is not already set and the phantom server is picked (getServerInfoStrict exception). The later is less likely though.
  • It doesn't handle noticing that a DB was removed and replaced with another (the server count could remain unchanged)
  • It updates load weights, but only if it thinks a server was removed (inconsistent)
  • On switch-over, it removes the primary server index, breaking LB methods, but only if no replacement replica server added (inconsistent)
  • The rebuild loop should really only happen once, even if two servers where found to be depooled (this is mostly stylistic though)
  • IMO, it should handle servers that have their DB_REPLICA load dropped to 0 (including the primary, since it might have non-zero load on non-WMF sites)

Well, "off" doesn't mean it's wrong. It's subjective. For example:

It updates load weights, but only if it thinks a server was removed (inconsistent)

That is intentional. The whole problem with reconfigure is drain a depooled replica once it's depooled. It's not about updating load weights or so on. In fact the reason it updates the load weights is because of all of the implicit logic in the code being entangled making it impossible to remove a replica without making half of lb break.

The solution I have is much simpler, just disable load monitor where config reload is active (CLI). Beside the fact that load monitor has been useless in preventing any outages so far, the CLIs don't have any impact in the actual load.

Change 871137 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: various fixes to LoadBalancer::reconfigure

https://gerrit.wikimedia.org/r/871137

Change 874899 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Disable LoadMonitor in CLI

https://gerrit.wikimedia.org/r/874899

Change 874899 merged by jenkins-bot:

[operations/mediawiki-config@master] Disable LoadMonitor in CLI

https://gerrit.wikimedia.org/r/874899

Mentioned in SAL (#wikimedia-operations) [2023-01-04T15:23:31Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:874899|Disable LoadMonitor in CLI (T322156)]]

Mentioned in SAL (#wikimedia-operations) [2023-01-04T15:25:12Z] <ladsgroup@deploy1002> ladsgroup and ladsgroup: Backport for [[gerrit:874899|Disable LoadMonitor in CLI (T322156)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-01-04T15:33:20Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:874899|Disable LoadMonitor in CLI (T322156)]] (duration: 09m 48s)

I'll wait until the next dump run. Let's see if this fixes the problem properly.

is this still an issue?

The last patch went around after Jan 1 full dujmps run had already started, so we'll need to wait for the end of the Feb 1 run to be sure. It is looking good so far though.

So far, we haven’t had any errors; we'll be certain when the full run completes in a day or two, and we can update the task by then.

The dumps run are done and we didn’t get any errors. Thank you @Ladsgroup

Its back from the same code path (though not from dump servers, but it was never specific to that in the first place):

https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-test-1-1.11.0-6-2023.10?id=kvlAxYYBtuN2AbPYIXeH

PHP Notice: Undefined index: DEFAULT

timestamp: Mar 9, 2023 @ 07:22:38.755

exception.trace:
from /srv/mediawiki/php-1.40.0-wmf.26/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(350)
#2 /srv/mediawiki/php-1.40.0-wmf.26/maintenance/includes/Maintenance.php(1208): Wikimedia\Rdbms\LBFactory->autoReconfigure()
#3 /srv/mediawiki/php-1.40.0-wmf.26/maintenance/migrateRevisionCommentTemp.php(101): Maintenance->waitForReplication()
…
#7 /srv/mediawiki/php-1.40.0-wmf.26/maintenance/migrateRevisionCommentTemp.php(115): require_once(string)

Correlates perfectly to pooling changes for T329203 around the same time frame:

1--- codfw/groupLoadsBySection/DEFAULT live
2+++ codfw/groupLoadsBySection/DEFAULT generated
3@@ -1,8 +1 @@
4-{
5- "dump": {
6- "db2109": 100
7- },
8- "vslow": {
9- "db2109": 100
10- }
11-}
12+{}
13--- codfw/sectionLoads/DEFAULT live
14+++ codfw/sectionLoads/DEFAULT generated
15@@ -3,7 +3,6 @@
16 "db2105": 0
17 },
18 {
19- "db2109": 300,
20 "db2127": 400,
21 "db2149": 400,
22 "db2156": 300,

1--- codfw/groupLoadsBySection/DEFAULT live
2+++ codfw/groupLoadsBySection/DEFAULT generated
3@@ -1 +1,8 @@
4-{}
5+{
6+ "dump": {
7+ "db2109": 10
8+ },
9+ "vslow": {
10+ "db2109": 10
11+ }
12+}
13--- codfw/sectionLoads/DEFAULT live
14+++ codfw/sectionLoads/DEFAULT generated
15@@ -3,6 +3,7 @@
16 "db2105": 0
17 },
18 {
19+ "db2109": 30,
20 "db2127": 400,
21 "db2149": 400,
22 "db2156": 300,

The good news is that these are recoverable runtime warnings, and these undefined's are (luckily) treated as similar to an empty arrays where an array is expected, so no observable impact right now besides log noise.

Krinkle renamed this task from New errors during this month's full dump run: LoadBalancer.php: No server with index '4' to "No server with index" and "Warning: Undefined index" from LoadBalancer::reconfigure.Mar 9 2023, 4:33 PM
Krinkle reassigned this task from Ladsgroup to aaron.
Krinkle added a subscriber: Ladsgroup.

Change 871137 merged by jenkins-bot:

[mediawiki/core@master] rdbms: various fixes to LoadBalancer::reconfigure

https://gerrit.wikimedia.org/r/871137

Change 852220 abandoned by Krinkle:

[mediawiki/core@master] rdbms: Avoid errors in case a depooled in load monitor

Reason:

Superseded by https://gerrit.wikimedia.org/r/871137 (I think)

https://gerrit.wikimedia.org/r/852220