Page MenuHomePhabricator

LBFactoryMulti: Unknown cluster 'cluster14'
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

MediaWiki version: 1.35.0-wmf.30

message
Unknown cluster 'cluster14'

Notes

I've seen exactly 1 of these.

From IRC:

09:40 <brennen> hrm - .30 i/l/r/l/LBFactoryMulti:177  Unknown cluster 'cluster14' - odd.
09:40 <James_F> ES issue?
09:41 <@Reedy> There's no cluster14 in db-eqiad...
09:42 <James_F> That'd not help.

Note "[username]" placeholder in request URL in place of actual username.

Details

Request ID
7c2ded07-61de-4ccc-bcb7-599f3640a014
Request URL
https://test.wikipedia.org/w/api.php?action=feedcontributions&user=[username]&feedformat=atom
Stack Trace
exception.trace
#0 /srv/mediawiki/php-1.35.0-wmf.30/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(194): Wikimedia\Rdbms\LBFactoryMulti->newExternalLB(string, integer)
#1 /srv/mediawiki/php-1.35.0-wmf.30/includes/externalstore/ExternalStoreDB.php(151): Wikimedia\Rdbms\LBFactoryMulti->getExternalLB(string)
#2 /srv/mediawiki/php-1.35.0-wmf.30/includes/externalstore/ExternalStoreDB.php(162): ExternalStoreDB->getLoadBalancer(string)
#3 /srv/mediawiki/php-1.35.0-wmf.30/includes/externalstore/ExternalStoreDB.php(311): ExternalStoreDB->getReplica(string)
#4 /srv/mediawiki/php-1.35.0-wmf.30/includes/externalstore/ExternalStoreDB.php(66): ExternalStoreDB->fetchBlob(string, string, string)
#5 /srv/mediawiki/php-1.35.0-wmf.30/includes/externalstore/ExternalStoreAccess.php(52): ExternalStoreDB->fetchFromURL(string)
#6 /srv/mediawiki/php-1.35.0-wmf.30/includes/Storage/SqlBlobStore.php(501): ExternalStoreAccess->fetchFromURL(string, array)
#7 /srv/mediawiki/php-1.35.0-wmf.30/includes/libs/objectcache/wancache/WANObjectCache.php(1504): MediaWiki\Storage\SqlBlobStore->MediaWiki\Storage\{closure}(boolean, integer, array, NULL, array)
#8 /srv/mediawiki/php-1.35.0-wmf.30/includes/libs/objectcache/wancache/WANObjectCache.php(1347): WANObjectCache->fetchOrRegenerate(string, integer, Closure, array, array)
#9 /srv/mediawiki/php-1.35.0-wmf.30/includes/Storage/SqlBlobStore.php(505): WANObjectCache->getWithSetCallback(string, integer, Closure, array)
#10 /srv/mediawiki/php-1.35.0-wmf.30/includes/Storage/SqlBlobStore.php(424): MediaWiki\Storage\SqlBlobStore->expandBlob(string, array, string)
#11 /srv/mediawiki/php-1.35.0-wmf.30/includes/Storage/SqlBlobStore.php(286): MediaWiki\Storage\SqlBlobStore->fetchBlobs(array, integer)
#12 /srv/mediawiki/php-1.35.0-wmf.30/includes/libs/objectcache/wancache/WANObjectCache.php(1504): MediaWiki\Storage\SqlBlobStore->MediaWiki\Storage\{closure}(boolean, integer, array, NULL, array)
#13 /srv/mediawiki/php-1.35.0-wmf.30/includes/libs/objectcache/wancache/WANObjectCache.php(1347): WANObjectCache->fetchOrRegenerate(string, integer, Closure, array, array)
#14 /srv/mediawiki/php-1.35.0-wmf.30/includes/Storage/SqlBlobStore.php(291): WANObjectCache->getWithSetCallback(string, integer, Closure, array)
#15 /srv/mediawiki/php-1.35.0-wmf.30/includes/Revision/RevisionStore.php(1014): MediaWiki\Storage\SqlBlobStore->getBlob(string, integer)
#16 /srv/mediawiki/php-1.35.0-wmf.30/includes/Revision/RevisionStore.php(1244): MediaWiki\Revision\RevisionStore->loadSlotContent(MediaWiki\Revision\SlotRecord, NULL, NULL, NULL, integer)
#17 [internal function]: MediaWiki\Revision\RevisionStore->MediaWiki\Revision\{closure}(MediaWiki\Revision\SlotRecord)
#18 /srv/mediawiki/php-1.35.0-wmf.30/includes/Revision/SlotRecord.php(307): call_user_func(Closure, MediaWiki\Revision\SlotRecord)
#19 /srv/mediawiki/php-1.35.0-wmf.30/includes/Revision/RevisionRecord.php(175): MediaWiki\Revision\SlotRecord->getContent()
#20 /srv/mediawiki/php-1.35.0-wmf.30/includes/api/ApiFeedContributions.php(187): MediaWiki\Revision\RevisionRecord->getContent(string)
#21 /srv/mediawiki/php-1.35.0-wmf.30/includes/api/ApiFeedContributions.php(158): ApiFeedContributions->feedItemDesc(MediaWiki\Revision\RevisionStoreRecord)
#22 /srv/mediawiki/php-1.35.0-wmf.30/includes/api/ApiFeedContributions.php(120): ApiFeedContributions->feedItem(stdClass)
#23 /srv/mediawiki/php-1.35.0-wmf.30/includes/api/ApiMain.php(1580): ApiFeedContributions->execute()
#24 /srv/mediawiki/php-1.35.0-wmf.30/includes/api/ApiMain.php(523): ApiMain->executeAction()
#25 /srv/mediawiki/php-1.35.0-wmf.30/includes/api/ApiMain.php(494): ApiMain->executeActionWithErrorHandling()
#26 /srv/mediawiki/php-1.35.0-wmf.30/api.php(84): ApiMain->execute()
#27 /srv/mediawiki/w/api.php(3): require(string)
#28 {main}

Event Timeline

Reedy triaged this task as Lowest priority.May 4 2020, 6:20 PM
Reedy added a project: DBA.
Reedy added a subscriber: Reedy.

There's no cluster14 in db-eqiad.php... Tagging DBA but imagine they might not be aware of history from 2007/2008

And noting this is only a testwiki

MariaDB [testwiki]> select old_text, rev_timestamp, page_namespace, page_title from text INNER JOIN revision ON (old_id=rev_text_id) INNER JOIN page ON (rev_page=page_id) where old_text LIKE '%cluster14%';
+-------------------------+----------------+----------------+---------------------+
| old_text                | rev_timestamp  | page_namespace | page_title          |
+-------------------------+----------------+----------------+---------------------+
| DB://cluster14/15848/4  | 20070906230802 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/5  | 20070913223733 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/18 | 20080313231235 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/19 | 20080314073404 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/1  | 20070815004911 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/6  | 20071007042627 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/7  | 20071007042805 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/8  | 20071007043003 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/10 | 20080201140335 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/2  | 20070830020120 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/3  | 20070906032228 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/0  | 20070710120318 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/11 | 20080302171925 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/20 | 20080325153429 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/9  | 20080123002633 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/22 | 20080529153306 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/23 | 20080529153334 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/24 | 20080530055940 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/12 | 20080308231544 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/15 | 20080309161621 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/13 | 20080309015628 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/16 | 20080310024307 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/14 | 20080309140739 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/17 | 20080310160755 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/25 | 20080530154833 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/21 | 20080427054003 |              1 | Main_Page/Archive_3 |
+-------------------------+----------------+----------------+---------------------+
26 rows in set (1.57 sec)

So the page in question is https://test.wikipedia.org/wiki/Talk:Main_Page/Archive_3

We could just delete the revisions from the page?

Marostegui added a subscriber: Marostegui.

I checked on es1 and es2 to see if there was some trace of any table file on anywiki with that name, but no luck. Maybe they were just part of a test?
It is interesting how references on db-eqiad.php just jump from cluster10 to cluster20.

If those are the only rows and only on testwiki I guess it is ok to delete them?

(Removing the tag for us, but staying subscribed on the task just in case).

https://wikitech-static.wikimedia.org/wiki/Server_admin_log/Archive_11

14:10 jeluf: External storage cluster 14 (srv139,138,137) enabled, replacing cluster 11 in $wgDefaultExternalStore

It definitely did exist at one point..

https://wikitech.wikimedia.org/wiki/Server_admin_log/2008-09

13:11 Tim: cluster13 and cluster14 both have only one server left in rotation. Shut down apache on srv129 and srv139 out of fear that it might hasten their doom.

https://wikitech.wikimedia.org/wiki/Server_admin_log/Archive_13

22:29 Tim: deleted cluster13 and cluster14 backups on storage2
08:05 Tim: removed cluster13 and cluster14 from db.php, will watch exception.log for attempted connections

I wonder if it was moved/folded into another cluster and then deleted... And maybe testwiki was just missed from the migration?

So the page in question is https://test.wikipedia.org/wiki/Talk:Main_Page/Archive_3

We could just delete the revisions from the page?

The cleaner way to solve this would be to mark the blobs as bad in the content table by running maintenance/findBadBlobs.php, see T205936.

Other entries in Archive 13 of the SAL show that cluster13 was a source cluster for the first run of recompressTracked.php. Notably, from December 2008:

December 30
09:23 TimStarling: testing recompressTracked on testwiki

So, I guess the test was partially successful, some pages were broken. No big deal. The simplest solution is to stop caring. Delete the rows if they are annoying you.

So the page in question is https://test.wikipedia.org/wiki/Talk:Main_Page/Archive_3

We could just delete the revisions from the page?

The cleaner way to solve this would be to mark the blobs as bad in the content table by running maintenance/findBadBlobs.php, see T205936.

Seems it's pretty broken

reedy@deploy1001:~$ mwscript findBadBlobs.php --wiki=testwiki --revisions=27195,27272,43652,43945,26023,27896,27897,27900,33086,26745,27174,25194,39680,49078,32765,59757,59758,59776,42405,42818,42517,43023,42778,43039,59784,58620
Scanning 26 ids
InvalidArgumentException from line 177 of /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/libs/rdbms/lbfactory/LBFactoryMulti.php: Unknown cluster 'cluster14'
#0 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(194): Wikimedia\Rdbms\LBFactoryMulti->newExternalLB('cluster14', 1970085862)
#1 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/externalstore/ExternalStoreDB.php(151): Wikimedia\Rdbms\LBFactoryMulti->getExternalLB('cluster14')
#2 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/externalstore/ExternalStoreDB.php(162): ExternalStoreDB->getLoadBalancer('cluster14')
#3 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/externalstore/ExternalStoreDB.php(311): ExternalStoreDB->getReplica('cluster14')
#4 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/externalstore/ExternalStoreDB.php(66): ExternalStoreDB->fetchBlob('cluster14', '15848', '0')
#5 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/externalstore/ExternalStoreAccess.php(52): ExternalStoreDB->fetchFromURL('DB://cluster14/...')
#6 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/Storage/SqlBlobStore.php(501): ExternalStoreAccess->fetchFromURL('DB://cluster14/...', Array)
#7 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/libs/objectcache/wancache/WANObjectCache.php(1504): MediaWiki\Storage\SqlBlobStore->MediaWiki\Storage\{closure}(false, 3600, Array, NULL, Array)
#8 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/libs/objectcache/wancache/WANObjectCache.php(1347): WANObjectCache->fetchOrRegenerate('global:SqlBlobS...', 3600, Object(Closure), Array, Array)
#9 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/Storage/SqlBlobStore.php(505): WANObjectCache->getWithSetCallback('global:SqlBlobS...', 3600, Object(Closure), Array)
#10 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/Storage/SqlBlobStore.php(424): MediaWiki\Storage\SqlBlobStore->expandBlob('DB://cluster14/...', Array, 'tt:24887')
#11 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/Storage/SqlBlobStore.php(286): MediaWiki\Storage\SqlBlobStore->fetchBlobs(Array, 0)
#12 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/libs/objectcache/wancache/WANObjectCache.php(1504): MediaWiki\Storage\SqlBlobStore->MediaWiki\Storage\{closure}(false, 3600, Array, NULL, Array)
#13 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/libs/objectcache/wancache/WANObjectCache.php(1347): WANObjectCache->fetchOrRegenerate('global:SqlBlobS...', 3600, Object(Closure), Array, Array)
#14 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/Storage/SqlBlobStore.php(291): WANObjectCache->getWithSetCallback('global:SqlBlobS...', 3600, Object(Closure), Array)
#15 /srv/mediawiki-staging/php-1.35.0-wmf.31/maintenance/findBadBlobs.php(439): MediaWiki\Storage\SqlBlobStore->getBlob('tt:24887')
#16 /srv/mediawiki-staging/php-1.35.0-wmf.31/maintenance/findBadBlobs.php(422): FindBadBlobs->checkSlot(Object(MediaWiki\Revision\RevisionStoreRecord), Object(MediaWiki\Revision\SlotRecord))
#17 /srv/mediawiki-staging/php-1.35.0-wmf.31/maintenance/findBadBlobs.php(342): FindBadBlobs->checkRevision(Object(MediaWiki\Revision\RevisionStoreRecord))
#18 /srv/mediawiki-staging/php-1.35.0-wmf.31/maintenance/findBadBlobs.php(149): FindBadBlobs->scanRevisionsById(Array)
#19 /srv/mediawiki-staging/php-1.35.0-wmf.31/maintenance/doMaintenance.php(105): FindBadBlobs->execute()
#20 /srv/mediawiki-staging/php-1.35.0-wmf.31/maintenance/findBadBlobs.php(503): require_once('/srv/mediawiki-...')
#21 /srv/mediawiki-staging/multiversion/MWScript.php(101): require_once('/srv/mediawiki-...')
#22 {main}
reedy@deploy1001:~$ mwscript findBadBlobs.php --wiki=testwiki --mark=T251778 --revisions=27195,27272,43652,43945,26023,27896,27897,27900,33086,26745,27174,25194,39680,49078,32765,59757,59758,59776,42405,42818,42517,43023,42778,43039,59784,58620
Scanning 26 ids
InvalidArgumentException from line 177 of /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/libs/rdbms/lbfactory/LBFactoryMulti.php: Unknown cluster 'cluster14'
#0 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(194): Wikimedia\Rdbms\LBFactoryMulti->newExternalLB('cluster14', 1219493076)
#1 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/externalstore/ExternalStoreDB.php(151): Wikimedia\Rdbms\LBFactoryMulti->getExternalLB('cluster14')
#2 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/externalstore/ExternalStoreDB.php(162): ExternalStoreDB->getLoadBalancer('cluster14')
#3 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/externalstore/ExternalStoreDB.php(311): ExternalStoreDB->getReplica('cluster14')
#4 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/externalstore/ExternalStoreDB.php(66): ExternalStoreDB->fetchBlob('cluster14', '15848', '0')
#5 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/externalstore/ExternalStoreAccess.php(52): ExternalStoreDB->fetchFromURL('DB://cluster14/...')
#6 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/Storage/SqlBlobStore.php(501): ExternalStoreAccess->fetchFromURL('DB://cluster14/...', Array)
#7 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/libs/objectcache/wancache/WANObjectCache.php(1504): MediaWiki\Storage\SqlBlobStore->MediaWiki\Storage\{closure}(false, 3600, Array, NULL, Array)
#8 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/libs/objectcache/wancache/WANObjectCache.php(1347): WANObjectCache->fetchOrRegenerate('global:SqlBlobS...', 3600, Object(Closure), Array, Array)
#9 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/Storage/SqlBlobStore.php(505): WANObjectCache->getWithSetCallback('global:SqlBlobS...', 3600, Object(Closure), Array)
#10 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/Storage/SqlBlobStore.php(424): MediaWiki\Storage\SqlBlobStore->expandBlob('DB://cluster14/...', Array, 'tt:24887')
#11 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/Storage/SqlBlobStore.php(286): MediaWiki\Storage\SqlBlobStore->fetchBlobs(Array, 0)
#12 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/libs/objectcache/wancache/WANObjectCache.php(1504): MediaWiki\Storage\SqlBlobStore->MediaWiki\Storage\{closure}(false, 3600, Array, NULL, Array)
#13 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/libs/objectcache/wancache/WANObjectCache.php(1347): WANObjectCache->fetchOrRegenerate('global:SqlBlobS...', 3600, Object(Closure), Array, Array)
#14 /srv/mediawiki-staging/php-1.35.0-wmf.31/includes/Storage/SqlBlobStore.php(291): WANObjectCache->getWithSetCallback('global:SqlBlobS...', 3600, Object(Closure), Array)
#15 /srv/mediawiki-staging/php-1.35.0-wmf.31/maintenance/findBadBlobs.php(439): MediaWiki\Storage\SqlBlobStore->getBlob('tt:24887')
#16 /srv/mediawiki-staging/php-1.35.0-wmf.31/maintenance/findBadBlobs.php(422): FindBadBlobs->checkSlot(Object(MediaWiki\Revision\RevisionStoreRecord), Object(MediaWiki\Revision\SlotRecord))
#17 /srv/mediawiki-staging/php-1.35.0-wmf.31/maintenance/findBadBlobs.php(342): FindBadBlobs->checkRevision(Object(MediaWiki\Revision\RevisionStoreRecord))
#18 /srv/mediawiki-staging/php-1.35.0-wmf.31/maintenance/findBadBlobs.php(149): FindBadBlobs->scanRevisionsById(Array)
#19 /srv/mediawiki-staging/php-1.35.0-wmf.31/maintenance/doMaintenance.php(105): FindBadBlobs->execute()
#20 /srv/mediawiki-staging/php-1.35.0-wmf.31/maintenance/findBadBlobs.php(503): require_once('/srv/mediawiki-...')
#21 /srv/mediawiki-staging/multiversion/MWScript.php(101): require_once('/srv/mediawiki-...')
#22 {main}

The cleaner way to solve this would be to mark the blobs as bad in the content table by running maintenance/findBadBlobs.php, see T205936.

Seems it's pretty broken

reedy@deploy1001:~$ mwscript findBadBlobs.php --wiki=testwiki --revisions=27195,27272,43652,43945,26023,27896,27897,27900,33086,26745,27174,25194,39680,49078,32765,59757,59758,59776,42405,42818,42517,43023,42778,43039,59784,58620

Has another task been opened for the error in maintenance/findBadBlobs.php? Is there anything else for Platform Engineering to do on this task?

The cleaner way to solve this would be to mark the blobs as bad in the content table by running maintenance/findBadBlobs.php, see T205936.

Seems it's pretty broken

Not really, it triggering the expected error in the right place. It just fails to catch the resulting exception. The problem is to only catch the "right" exceptions, see discussion at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/584698/6/maintenance/markBadBlobs.php#197.

My proposal for fixing this is to make ExternalStoreDB::getLoadBalancer() catch any exceptions it gets from LoadBalancer and re-throw them wrapped in an ExternalStoreException. This would allow the script to catch the exception and mark the blob as bad.

The alternative would be to make findBadBlobs.php catch all exception and mark revisions as bad regardless of why they fail to load.

Change 597283 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] ExternalStoreDB: wrap database exceptions

https://gerrit.wikimedia.org/r/597283

MariaDB [testwiki]> select old_text, rev_timestamp, page_namespace, page_title from text INNER JOIN revision ON (old_id=rev_text_id) INNER JOIN page ON (rev_page=page_id) where old_text LIKE '%cluster14%';
+-------------------------+----------------+----------------+---------------------+
| old_text                | rev_timestamp  | page_namespace | page_title          |
+-------------------------+----------------+----------------+---------------------+
| DB://cluster14/15848/4  | 20070906230802 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/5  | 20070913223733 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/18 | 20080313231235 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/19 | 20080314073404 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/1  | 20070815004911 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/6  | 20071007042627 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/7  | 20071007042805 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/8  | 20071007043003 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/10 | 20080201140335 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/2  | 20070830020120 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/3  | 20070906032228 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/0  | 20070710120318 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/11 | 20080302171925 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/20 | 20080325153429 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/9  | 20080123002633 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/22 | 20080529153306 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/23 | 20080529153334 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/24 | 20080530055940 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/12 | 20080308231544 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/15 | 20080309161621 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/13 | 20080309015628 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/16 | 20080310024307 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/14 | 20080309140739 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/17 | 20080310160755 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/25 | 20080530154833 |              1 | Main_Page/Archive_3 |
| DB://cluster14/15848/21 | 20080427054003 |              1 | Main_Page/Archive_3 |
+-------------------------+----------------+----------------+---------------------+
26 rows in set (1.57 sec)

Updated query given rev_text_id was dropped:

(testwiki)> select CONCAT('tt:', old_id) as _tt, old_id, old_text FROM text where old_text LIKE 'DB://cluster14%';
+----------+--------+-------------------------+
| _tt      | old_id | old_text                |
+----------+--------+-------------------------+
| tt:24887 |  24887 | DB://cluster14/15848/0  |
| tt:25708 |  25708 | DB://cluster14/15848/1  |
| tt:26418 |  26418 | DB://cluster14/15848/2  |
| tt:26846 |  26846 | DB://cluster14/15848/3  |
| tt:26864 |  26864 | DB://cluster14/15848/4  |
| tt:26933 |  26933 | DB://cluster14/15848/5  |
| tt:27552 |  27552 | DB://cluster14/15848/6  |
| tt:27553 |  27553 | DB://cluster14/15848/7  |
| tt:27556 |  27556 | DB://cluster14/15848/8  |
| tt:32314 |  32314 | DB://cluster14/15848/9  |
| tt:32622 |  32622 | DB://cluster14/15848/10 |
| tt:38715 |  38715 | DB://cluster14/15848/11 |
| tt:41370 |  41370 | DB://cluster14/15848/12 |
| tt:41479 |  41479 | DB://cluster14/15848/13 |
| tt:41726 |  41726 | DB://cluster14/15848/14 |
| tt:41766 |  41766 | DB://cluster14/15848/15 |
| tt:41966 |  41966 | DB://cluster14/15848/16 |
| tt:41981 |  41981 | DB://cluster14/15848/17 |
| tt:42562 |  42562 | DB://cluster14/15848/18 |
| tt:42853 |  42853 | DB://cluster14/15848/19 |
| tt:47951 |  47951 | DB://cluster14/15848/20 |
| tt:57371 |  57371 | DB://cluster14/15848/21 |
| tt:58442 |  58442 | DB://cluster14/15848/22 |
| tt:58443 |  58443 | DB://cluster14/15848/23 |
| tt:58460 |  58460 | DB://cluster14/15848/24 |
| tt:58468 |  58468 | DB://cluster14/15848/25 |
+----------+--------+-------------------------+
26 rows in set (0.28 sec)

And the associated content table rows:

select content_id,content_address FROM content where content_address IN (select CONCAT('tt:', old_id) as _tt FROM text where old_text LIKE 'DB://cluster14%');
+------------+-----------------+
| content_id | content_address |
+------------+-----------------+
|      10139 | tt:24887        |
|      10416 | tt:25708        |
|      10644 | tt:26418        |
|      10983 | tt:26846        |
|      10999 | tt:26864        |
|      11047 | tt:26933        |
|      11461 | tt:27552        |
|      11462 | tt:27553        |
|      11465 | tt:27556        |
|      13654 | tt:32314        |
|      13830 | tt:32622        |
|      19164 | tt:38715        |
|      21498 | tt:41370        |
|      21597 | tt:41479        |
|      21836 | tt:41726        |
|      21872 | tt:41766        |
|      22025 | tt:41966        |
|      22036 | tt:41981        |
|      22510 | tt:42562        |
|      22792 | tt:42853        |
|      27261 | tt:47951        |
|      36203 | tt:57371        |
|      36931 | tt:58442        |
|      36932 | tt:58443        |
|      36938 | tt:58460        |
|      36944 | tt:58468        |
+------------+-----------------+
26 rows in set (11.11 sec)

Per the logic in findBadBlobs.php#markBlob I suggest changing these as follows:

   content_address
-  tt:24887
+  bad:tt%3A24887?reason=T251778&error=Unknown+cluster14

Per my code review at https://gerrit.wikimedia.org/r/597283, I don't think we should tolerate undefined ES or LBFactory stores, this is an infrastructure issue, not an issue with the format of anything in the database, and hopefully a one-off. Ignoring these at the cost of a general bad-wipe in the future due to something intermittent seems like a high risk that we can avoid at little to no cost by fixing these by hand instead.

Or, to do it scripted, perhaps by adding a mode to findBadBlobs that takes explicit content_id numbers e.g. as comma-delimited input and then mark them with a given reason/error.

Per my code review at https://gerrit.wikimedia.org/r/597283, I don't think we should tolerate undefined ES or LBFactory stores, this is an infrastructure issue, not an issue with the format of anything in the database, and hopefully a one-off. Ignoring these at the cost of a general bad-wipe in the future due to something intermittent seems like a high risk that we can avoid at little to no cost by fixing these by hand instead.

I commented on the patch. The idea is not to just ignore the errors, just to use a more specific type of exception.

Or, to do it scripted, perhaps by adding a mode to findBadBlobs that takes explicit content_id numbers e.g. as comma-delimited input and then mark them with a given reason/error.

findBadBlobs already has a mode that thaks a list of revision IDs, from a parameter or stdin. Even in scanning mode, it starts at a given timestamp and then looks at a limited number of revisions (1000 per default), and marks the inaccessible ones in that batch.

findBadBlobs already has a mode that thaks a list of revision IDs, […], and marks the inaccessible ones in that batch.

Awesome, I didn't know that :)

I tried to use it just now for e.g. testwiki rev ID 27195. But it doesn't work right now because it requires a BlobAccessException/ExternalStoreException and for the one we are facing now, it fatals due to LBFactory InvalidArgumentException

I was thinking for the case where you give it rev IDs explicitly, perhaps it could blanket catch all possible Exception. This would make it suitable even for cases that we don't want to be treated as ExternalStoreException in production run-time. E.g. if an ES cluster is accidentally removed, or if we decide there are two blobs on test wiki using a certain serialized class that we no longer want to support that causes some other kind of strange fatal for an undefined PHP class that we removed or something like that, we'd have a way to mark those. Thoughts?

I tried to use it just now for e.g. testwiki rev ID 27195. But it doesn't work right now because it requires a BlobAccessException/ExternalStoreException and for the one we are facing now, it fatals due to LBFactory InvalidArgumentException

Indeed, that was the reason for my patch.

I was thinking for the case where you give it rev IDs explicitly, perhaps it could blanket catch all possible Exception.
This would make it suitable even for cases that we don't want to be treated as ExternalStoreException in production run-time. E.g. if an ES cluster is accidentally removed, or if we decide there are two blobs on test wiki using a certain serialized class that we no longer want to support that causes some other kind of strange fatal for an undefined PHP class that we removed or something like that, we'd have a way to mark those. Thoughts?

Blanket catching all exceptions was my original proposal, but you objected because it was too broad. I agree that it would be too broad if we ran findBadBlobs against all revisions, but that's not what it is designed for. It was designed for one-off cleanup for a known problem that occurred for a known period of time.

I think catching different sets of exceptions depending on whether a list of IDs is supplied explicity would be confusing. I'd rather remove the "scanning" mode entirely, leaving it to the admin to manually find the bad entries. That's of course less convenient.

I was thinking for the case where you give it rev IDs explicitly, perhaps it could blanket catch all possible Exception. […]

Blanket catching all exceptions was my original proposal, but you objected because it was too broad. I agree that it would be too broad if we ran findBadBlobs against all revisions, but that's not what it is designed for. It was designed for one-off cleanup for a known problem that occurred for a known period of time.

I think catching different sets of exceptions depending on whether a list of IDs is supplied explicity would be confusing. I'd rather remove the "scanning" mode entirely, leaving it to the admin to manually find the bad entries. That's of course less convenient.

I think that so long as the scanning and marking modes can be combined in any way, that any form of automatic catching over errors that are indistishable from configuration problems or intermittent infra issues is problematic.

After having reviewed things and picked the ones with a specific error, that seems fine to do given you know what their problem was and a human decided the error is okay to bake in and accept for that blob. How do we that exactly I don't mind. I suppose a really nice ideal way could be that the script is interactive, tells you the exceptions it finds, and then you accept/deny, and then similar ones are also accepted, and then once it has found them all, it can tell you how many it found+accepted and then proceed to mark them the ones you accepted.

A similar way is what we have now, but you pass the rev-IDs into a second invocation, which is what we have now, except that it currently doesn't work as that mode currently shares the same safer catching. That is why I suggested differentiating the two.

I think the solution is to have one mode that lists IDs and problems, and another to mark IDs. The latter should still only mark if there are indeed problems with the given IDs, and warn if there are none.

Change 607002 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] findBadBlobs: better separate scan and mark modes.

https://gerrit.wikimedia.org/r/607002

Change 597283 abandoned by Daniel Kinzler:
ExternalStoreDB: wrap database exceptions

Reason:
I47c11190b665c1dac88db32ee2bf683728cb3dc6

https://gerrit.wikimedia.org/r/597283

Change 608667 had a related patch set uploaded (by Krinkle; owner: Daniel Kinzler):
[mediawiki/core@wmf/1.35.0-wmf.39] findBadBlobs: better separate scan and mark modes.

https://gerrit.wikimedia.org/r/c/mediawiki/core/ /608667

Change 607002 merged by jenkins-bot:
[mediawiki/core@master] findBadBlobs: better separate scan and mark modes.

https://gerrit.wikimedia.org/r/c/mediawiki/core/ /607002

Change 608667 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.39] findBadBlobs: better separate scan and mark modes.

https://gerrit.wikimedia.org/r/c/mediawiki/core/ /608667

I marked one rev as example/try out on testwiki:

krinkle@maint1002
$ mwscript findBadBlobs.php --wiki testwiki --scan-from '20080201140335' --limit 10
Scanning revisions table, 1000 rows starting at rev_timestamp 20080201140335
        ! Found bad blob on revision 33086 (main slot): content_id=13830, address=<tt:32622>, error='Unknown cluster 'cluster14'', type='InvalidArgumentException'. ID: 33086


$ mwscript findBadBlobs.php --wiki testwiki --mark 'T251778' 
The --mark must be used together with --revisions


$ mwscript findBadBlobs.php --wiki testwiki --mark 'T251778' --revisions 33086
Scanning 1 ids
        ! Found bad blob on revision 33086 (main slot): content_id=13830, address=<tt:32622>, error='Unknown cluster 'cluster14'', type='InvalidArgumentException'. ID: 33086
        Changed address to <bad:tt%3A32622?reason=T251778&error=Unknown+cluster+%27cluster14%27>
        - Scanned a batch of 1 revisions
Marked 1 bad revisions.

$ mwscript findBadBlobs.php --wiki testwiki --mark 'T251778' --revisions 33086
Scanning 1 ids
        # No bad blob found on revision 33086, skipped!
        - Scanned a batch of 1 revisions
Marked 0 bad revisions.

$  mwscript findBadBlobs.php --wiki testwiki --scan-from '20080201140335' --limit 10
Scanning revisions table, 10 rows starting at rev_timestamp 20080201140335
        - Scanned a batch of 10 revisions, up to revision 410431 (20080202013246)
Scanning archive table by ar_rev_id, 33085 to 410432
        - Scanned a batch of 10 archived revisions, up to revision 33097 (20080202013246)
The range of archive rows scanned is based on the range of revision IDs scanned in the revision table.
Found 0 bad revisions.

Live link:
https://test.wikipedia.org/w/index.php?title=Talk:Main_Page/Archive_3&oldid=33086

This shows an empty page now with no error message (no server error, but no user-facing/200-OK'ed error either). Is that intentional? Anyway, LGTM.

This shows an empty page now with no error message (no server error, but no user-facing/200-OK'ed error either). Is that intentional? Anyway, LGTM.

Yes. It would be trivial for BlobStore to return a string containing an error message instead of an empty string. But that error message would be treated as actual content by the rest of the system, including in diffs, as if the user had typed it out. That seemed odd.

We could think about think about having SlotRecord know about bad blobs. That way, higher level code could provide special case handling when desired. If you think we should work in that direction, please file a ticket.

@daniel Yeah, the content would need to remain empty I agree. It would be for the display/rendering layer only. I won't file a ticket for it now though, I think that would by default be too low priority to get worked on, and even if someone volunteers it would add signifcant complexity to core. I think we're good to go as-is and if these become prominent in some cases after a future incident (hopefully never) we can revisit it then.

Left for this ticket is to mark the remaining blobs relating to this issue.

We have maintenance/findbadBlobs.php now. We can just run that to mark these revisions as known bad.

daniel raised the priority of this task from Lowest to Medium.Sep 1 2020, 8:50 PM

Mentioned in SAL (#wikimedia-operations) [2020-09-02T11:52:03Z] <duesen__> daniel@mwmaint2001:/srv/mediawiki/php-1.36.0-wmf.6$ mwscript findBadBlobs.php testwiki --mark T251778