Page MenuHomePhabricator

Unable to view some pages due to fatal RevisionAccessException: "Failed to load data blob from tt"
Closed, ResolvedPublicPRODUCTION ERROR

Description

Examples at which this is consistently reproducible:

  • https://os.wikipedia.org/wiki/1888
  • https://de.wikipedia.org/w/index.php?oldid=5780804

Error

Request ID: W7KoHArAIEAAALCQVJQAAABU

message
Failed to load data blob from tt:106822: Failed to load blob from address tt:106822
exception
MediaWiki\Storage\RevisionAccessException:
 Failed to load data blob from tt:106822

#0 /srv/mediawiki/php-1.32.0-wmf.23/includes/Storage/RevisionStore.php(1350): MediaWiki\Storage\RevisionStore->loadSlotContent(MediaWiki\Storage\SlotRecord, NULL, NULL, NULL, integer)
#1 [internal function]: Closure$MediaWiki\Storage\RevisionStore::emulateMainSlot_1_29#2(MediaWiki\Storage\SlotRecord)
#2 /srv/mediawiki/php-1.32.0-wmf.23/includes/Storage/SlotRecord.php(306): call_user_func(Closure$MediaWiki\Storage\RevisionStore::emulateMainSlot_1_29#2;519, MediaWiki\Storage\SlotRecord)
#3 /srv/mediawiki/php-1.32.0-wmf.23/includes/Storage/RevisionRecord.php(174): MediaWiki\Storage\SlotRecord->getContent()
#4 /srv/mediawiki/php-1.32.0-wmf.23/includes/Revision/RenderedRevision.php(205): MediaWiki\Storage\RevisionRecord->getContent(string, integer, NULL)
#5 /srv/mediawiki/php-1.32.0-wmf.23/includes/Revision/RevisionRenderer.php(169): MediaWiki\Revision\RenderedRevision->getSlotParserOutput(string)
#6 /srv/mediawiki/php-1.32.0-wmf.23/includes/Revision/RevisionRenderer.php(122): MediaWiki\Revision\RevisionRenderer->combineSlotOutput(MediaWiki\Revision\RenderedRevision, array)
#7 [internal function]: Closure$MediaWiki\Revision\RevisionRenderer::getRenderedRevision#2(MediaWiki\Revision\RenderedRevision, array)
#8 /srv/mediawiki/php-1.32.0-wmf.23/includes/Revision/RenderedRevision.php(176): call_user_func(Closure$MediaWiki\Revision\RevisionRenderer::getRenderedRevision#2;723, MediaWiki\Revision\RenderedRevision, array)
#9 /srv/mediawiki/php-1.32.0-wmf.23/includes/poolcounter/PoolWorkArticleView.php(188): MediaWiki\Revision\RenderedRevision->getRevisionParserOutput()
#10 /srv/mediawiki/php-1.32.0-wmf.23/includes/poolcounter/PoolCounterWork.php(123): PoolWorkArticleView->doWork()
#11 /srv/mediawiki/php-1.32.0-wmf.23/includes/page/Article.php(771): PoolCounterWork->execute()
#12 /srv/mediawiki/php-1.32.0-wmf.23/includes/actions/ViewAction.php(68): Article->view()
#13 /srv/mediawiki/php-1.32.0-wmf.23/includes/MediaWiki.php(501): ViewAction->show()
#14 /srv/mediawiki/php-1.32.0-wmf.23/includes/MediaWiki.php(294): MediaWiki->performAction(Article, Title)
#15 /srv/mediawiki/php-1.32.0-wmf.23/includes/MediaWiki.php(868): MediaWiki->performRequest()
#16 /srv/mediawiki/php-1.32.0-wmf.23/includes/MediaWiki.php(525): MediaWiki->main()
#17 /srv/mediawiki/php-1.32.0-wmf.23/index.php(42): MediaWiki->run()
previous.exception
MediaWiki\Storage\BlobAccessException: 
 Failed to load blob from address tt:106822

#0 /srv/mediawiki/php-1.32.0-wmf.23/includes/Storage/RevisionStore.php(1433): MediaWiki\Storage\SqlBlobStore->getBlob(string, integer)
#1 /srv/mediawiki/php-1.32.0-wmf.23/includes/Storage/RevisionStore.php(1350): MediaWiki\Storage\RevisionStore->loadSlotContent(MediaWiki\Storage\SlotRecord, NULL, NULL, NULL, integer)
#2 [internal function]: Closure$MediaWiki\Storage\RevisionStore::emulateMainSlot_1_29#2(MediaWiki\Storage\SlotRecord)
#3 /srv/mediawiki/php-1.32.0-wmf.23/includes/Storage/SlotRecord.php(306): call_user_func(Closure$MediaWiki\Storage\RevisionStore::emulateMainSlot_1_29#2;519, MediaWiki\Storage\SlotRecord)
#4 /srv/mediawiki/php-1.32.0-wmf.23/includes/Storage/RevisionRecord.php(174): MediaWiki\Storage\SlotRecord->getContent()
#5 /srv/mediawiki/php-1.32.0-wmf.23/includes/Revision/RenderedRevision.php(205): MediaWiki\Storage\RevisionRecord->getContent(string, integer, NULL)
#6 /srv/mediawiki/php-1.32.0-wmf.23/includes/Revision/RevisionRenderer.php(169): MediaWiki\Revision\RenderedRevision->getSlotParserOutput(string)
#7 /srv/mediawiki/php-1.32.0-wmf.23/includes/Revision/RevisionRenderer.php(122): MediaWiki\Revision\RevisionRenderer->combineSlotOutput(MediaWiki\Revision\RenderedRevision, array)
#8 [internal function]: Closure$MediaWiki\Revision\RevisionRenderer::getRenderedRevision#2(MediaWiki\Revision\RenderedRevision, array)
#9 /srv/mediawiki/php-1.32.0-wmf.23/includes/Revision/RenderedRevision.php(176): call_user_func(Closure$MediaWiki\Revision\RevisionRenderer::getRenderedRevision#2;723, MediaWiki\Revision\RenderedRevision, array)
#10 /srv/mediawiki/php-1.32.0-wmf.23/includes/poolcounter/PoolWorkArticleView.php(188): MediaWiki\Revision\RenderedRevision->getRevisionParserOutput()
#11 /srv/mediawiki/php-1.32.0-wmf.23/includes/poolcounter/PoolCounterWork.php(123): PoolWorkArticleView->doWork()
#12 /srv/mediawiki/php-1.32.0-wmf.23/includes/page/Article.php(771): PoolCounterWork->execute()
#13 /srv/mediawiki/php-1.32.0-wmf.23/includes/actions/ViewAction.php(68): Article->view()
#14 /srv/mediawiki/php-1.32.0-wmf.23/includes/MediaWiki.php(501): ViewAction->show()
#15 /srv/mediawiki/php-1.32.0-wmf.23/includes/MediaWiki.php(294): MediaWiki->performAction(Article, Title)
#16 /srv/mediawiki/php-1.32.0-wmf.23/includes/MediaWiki.php(868): MediaWiki->performRequest()
#17 /srv/mediawiki/php-1.32.0-wmf.23/includes/MediaWiki.php(525): MediaWiki->main()
#18 /srv/mediawiki/php-1.32.0-wmf.23/index.php(42): MediaWiki->run()

Notes

The fatal RevisionAccessException exception happens during the building of a page view in MediaWiki for both unregistered users and logged-in users. Triaging as at least High priority given it exposes a fatal error on a public GET url, which is a risk for false alarms and increased load.

The error seems to be nested behind a related problem that is handled, but then re-throws. The above previous.exception happens first, and then re-throws as RevisionAccessException, which is is currently unhandled by any UI layer and results in the top-level error handling reporting is as an application problem.

Seems to be a regression that started with 1.32.0-wmf.19. (I didn't find matches among messages from previous versions). Although it is much more frequent on wmf.22 than on wmf.20 and wmf.19.

Just speculating as to why it is growing, but:

  • It might mean that the same pages are receiving more traffic than before, or;
  • It could be that the same problem is slowly affecting a growing number of pages, or;
  • It could be that another bug was introduced last week that made it affect more pages.

Consistently reproducible at https://os.wikipedia.org/wiki/1888.

See also:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Anomie added a subscriber: Anomie.Oct 4 2018, 4:26 PM

Regarding the specific example,

wikiadmin@10.64.16.191(oswiki)> select * from text where old_id=106822;
+--------+---------------+-----------+------------------+-------------+----------+---------------+---------------+----------------+---------------------+-------------------+
| old_id | old_namespace | old_title | old_text         | old_comment | old_user | old_user_text | old_timestamp | old_minor_edit | old_flags           | inverse_timestamp |
+--------+---------------+-----------+------------------+-------------+----------+---------------+---------------+----------------+---------------------+-------------------+
| 106822 |               |           | DB://cluster20/0 |             |          |               |               |                | utf-8,gzip,external |                   |
+--------+---------------+-----------+------------------+-------------+----------+---------------+---------------+----------------+---------------------+-------------------+

That "DB://cluster20/0" is definitely not right, the number after the slash is supposed to be a blob_id in the ExternalStore blobs table and should always be positive.

wikiadmin@10.64.32.185(oswiki)> select * from blobs_cluster20 where blob_id=0;
Empty set (0.00 sec)

I spot checked several other instances of the error, on multiple wikis, and interestingly they all have the same old_text and all are on revisions dated 2009-03-09, around 21:00-23:00. https://wikitech.wikimedia.org/wiki/Server_admin_log/Archive_13#March%209 mentions some issues with ExternalStore at about that time (and in particular cluster20 is mentioned on March 10 at 00:48).

I suspect it's generally another issue where MCR code throws/logs exceptions on database inconsistencies where pre-MCR code just displayed an error page to the user without logging anything.

CCicalese_WMF lowered the priority of this task from High to Medium.Nov 26 2018, 4:21 PM
CCicalese_WMF added a subscriber: CCicalese_WMF.

Will need to add better user level error handling. The database corruption should also be addressed.

Krinkle updated the task description. (Show Details)

https://de.wikipedia.org/w/index.php?oldid=5780804 reported last year:

Interner Fehler [W9n5NApAMEoAAJqHzzYAAABM] 2018-10-31 18:49:24: Fataler Ausnahmefehler des Typs „MediaWiki\Revision\RevisionAccessException“

Caught on BETA, but might help here.

  • I understood that External Storage is used to store text blobs of productive WMF wikis.
  • Perhaps some entries in mapping tables are corrupt.
  • The following error report would result from a problem in the other direction.
[XTHCuKwQBHcAADFkv-IAAAAK] /w/index.php?title=Wikipedia:Spielwiese&action=submit MediaWiki\Storage\BlobAccessException from line 261 of /srv/mediawiki/php-master/includes/Storage/SqlBlobStore.php: Unable to store text to external storage

Backtrace:

  • 0 /srv/mediawiki/php-master/includes/Revision/RevisionStore.php(950): MediaWiki\Storage\SqlBlobStore->storeBlob(string, array)
  • 1 /srv/mediawiki/php-master/includes/Revision/RevisionStore.php(686): MediaWiki\Revision\RevisionStore->storeContentBlob(MediaWiki\Revision\SlotRecord, Title, array)
  • 2 /srv/mediawiki/php-master/includes/Revision/RevisionStore.php(621): MediaWiki\Revision\RevisionStore->insertSlotOn(Wikimedia\Rdbms\DatabaseMysqli, integer, MediaWiki\Revision\SlotRecord, Title, array)
  • 3 /srv/mediawiki/php-master/includes/Revision/RevisionStore.php(519): MediaWiki\Revision\RevisionStore->insertRevisionInternal(MediaWiki\Revision\MutableRevisionRecord, Wikimedia\Rdbms\DatabaseMysqli, User, CommentStoreComment, Title, integer, integer)
  • 4 /srv/mediawiki/php-master/includes/libs/rdbms/database/Database.php(3952): Closure$MediaWiki\Revision\RevisionStore::insertRevisionOn(Wikimedia\Rdbms\DatabaseMysqli, string)
  • 5 /srv/mediawiki/php-master/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure$MediaWiki\Revision\RevisionStore::insertRevisionOn;3293)
  • 6 /srv/mediawiki/php-master/includes/libs/rdbms/database/DBConnRef.php(645): Wikimedia\Rdbms\DBConnRef->__call(string, array)
  • 7 /srv/mediawiki/php-master/includes/Revision/RevisionStore.php(521): Wikimedia\Rdbms\DBConnRef->doAtomicSection(string, Closure$MediaWiki\Revision\RevisionStore::insertRevisionOn;3293)
  • 8 /srv/mediawiki/php-master/includes/Storage/PageUpdater.php(985): MediaWiki\Revision\RevisionStore->insertRevisionOn(MediaWiki\Revision\MutableRevisionRecord, Wikimedia\Rdbms\DBConnRef)
  • 9 /srv/mediawiki/php-master/includes/Storage/PageUpdater.php(756): MediaWiki\Storage\PageUpdater->doModify(CommentStoreComment, User, integer)
  • 10 /srv/mediawiki/php-master/includes/page/WikiPage.php(1921): MediaWiki\Storage\PageUpdater->saveRevision(CommentStoreComment, integer)
  • 11 /srv/mediawiki/php-master/includes/EditPage.php(2316): WikiPage->doEditContent(WikitextContent, CommentStoreComment, integer, boolean, User, string, array, integer)
  • 12 /srv/mediawiki/php-master/includes/EditPage.php(1596): EditPage->internalAttemptSave(array, boolean)
  • 13 /srv/mediawiki/php-master/includes/EditPage.php(681): EditPage->attemptSave(array)
  • 14 /srv/mediawiki/php-master/includes/actions/EditAction.php(55): EditPage->edit()
  • 15 /srv/mediawiki/php-master/includes/actions/SubmitAction.php(38): EditAction->show()
  • 16 /srv/mediawiki/php-master/includes/MediaWiki.php(499): SubmitAction->show()
  • 17 /srv/mediawiki/php-master/includes/MediaWiki.php(294): MediaWiki->performAction(Article, Title)
  • 18 /srv/mediawiki/php-master/includes/MediaWiki.php(884): MediaWiki->performRequest()
  • 19 /srv/mediawiki/php-master/includes/MediaWiki.php(515): MediaWiki->main()
  • 20 /srv/mediawiki/php-master/index.php(42): MediaWiki->run()
  • 21 /srv/mediawiki/w/index.php(3): include(string)
  • 22 {main}
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM
daniel added a subscriber: daniel.Dec 13 2019, 9:57 PM

I made a patch for T240686: Allow content blobs to be marked as broken in the content table. Once that is merged, we can set content_address for the affected content to something like bad:T205936?orig=DB://cluster20/0 or some such. The revisions would behave as if they were empty, nothing would be logged in production, and we'd still have a clear marker in the database if we want to do cleanup later.

Caught on BETA, but might help here.

  • I understood that External Storage is used to store text blobs of productive WMF wikis.
  • Perhaps some entries in mapping tables are corrupt.
  • The following error report would result from a problem in the other direction.

Hm... I didn't even know we had external storage set up on beta. Do we?

WDoranWMF raised the priority of this task from Medium to High.Mar 13 2020, 3:59 PM
AMooney assigned this task to daniel.Mar 30 2020, 3:04 PM
AMooney moved this task from Later to Ready on the Platform Team Workboards (Clinic Duty Team) board.

Change 584698 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Add markBadBlobs script.

https://gerrit.wikimedia.org/r/584698

Change 584918 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Allow specific revision IDs to be passed to markBadBlobs.php

https://gerrit.wikimedia.org/r/584918

Thanks. BTW, you might check the heading comment of rMW/ maintenance/markBadBlobs.php – it looks like copied from rMW/maintenance/cleanupImages.php and is bearing wrong purpose, several brion, and 2005-2006.

Change 584698 merged by jenkins-bot:
[mediawiki/core@master] Add findBadBlobs script.

https://gerrit.wikimedia.org/r/584698

Change 584918 merged by jenkins-bot:
[mediawiki/core@master] Allow specific revision IDs to be passed to markBadBlobs.php

https://gerrit.wikimedia.org/r/584918

Pchelolo added a subscriber: Pchelolo.

The reviews here has been done, so moving this to waiting for deployment for the maintenance scripts to be deployed. After that it can be moved back to ready, since we'd need to run the scripts I assume

FYI I observed this error while dry-running this script in translatewiki.net:

[6cfa253cff0d86624825b84d] [no req]   TypeError from line 1205 of /srv/mediawiki/workdir/includes/Revision/RevisionStore.php: Argument 4 passed to MediaWiki\Revision\RevisionStore::constructSlotRecords() must be an instance of Title, null given, called in /srv/mediawiki/workdir/includes/Revision/RevisionStore.php on line 1687
Backtrace:
#0 /srv/mediawiki/workdir/includes/Revision/RevisionStore.php(1687): MediaWiki\Revision\RevisionStore->constructSlotRecords(string, array, integer, NULL)
#1 [internal function]: MediaWiki\Revision\RevisionStore->MediaWiki\Revision\{closure}(stdClass)
#2 /srv/mediawiki/workdir/includes/Revision/RevisionStore.php(1668): array_map(Closure, array)
#3 /srv/mediawiki/workdir/maintenance/findBadBlobs.php(372): MediaWiki\Revision\RevisionStore->newRevisionsFromBatch(Wikimedia\Rdbms\ResultWrapper, array)
#4 /srv/mediawiki/workdir/maintenance/findBadBlobs.php(334): FindBadBlobs->loadRevisionsById(array)
#5 /srv/mediawiki/workdir/maintenance/findBadBlobs.php(149): FindBadBlobs->scanRevisionsById(array)
#6 /srv/mediawiki/workdir/maintenance/doMaintenance.php(113): FindBadBlobs->execute()
#7 /srv/mediawiki/workdir/maintenance/findBadBlobs.php(503): require_once(string)
#8 {main}

FYI I observed this error while dry-running this script in translatewiki.net:

That would imply that some rev_page refers to a page ID that doesn't exist in the page table. Of ar_title contains a string that is not a valid page title. We'd need to catch this error earlier to provide more information about the bad row...

Change 593041 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] RevisionStore: improve error handling in newRevisionsFromBatch

https://gerrit.wikimedia.org/r/593041

@Nikerabbit can you try the above patch?

Doesn't fail now, but formatting is slightly odd (PS3):

seq 1 100000 | php maintenance/findBadBlobs.php --revisions=-                                         
Scanning 100001 ids
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 996 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 996 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        ! Found bad blob on revision 31074 (main slot): content_id=88136, address=<tt:30393>, error='Unable to fetch blob at tt:30393'
        ! Found bad blob on revision 31075 (main slot): content_id=88137, address=<tt:30394>, error='Unable to fetch blob at tt:30394'
        ! Found bad blob on revision 31076 (main slot): content_id=88138, address=<tt:30395>, error='Unable to fetch blob at tt:30395'
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 997 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
* Internal error: Couldn&#39;t find title for rev 74860 (page key 46868)
* Internal error: Couldn&#39;t find title for rev 74913 (page key 46921)
        - Scanned a batch of 998 revisions
Internal error: Couldn&#39;t find title for rev 75411 (page key 47381)
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
Internal error: Couldn&#39;t find title for rev 82963 (page key 318654)
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
Internal error: Couldn&#39;t find title for rev 85766 (page key 55099)
        - Scanned a batch of 999 revisions
Internal error: Couldn&#39;t find title for rev 86068 (page key 55356)
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 999 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
        - Scanned a batch of 1000 revisions
Internal error: Couldn&#39;t find title for rev 94726 (page key 62315)
        - Scanned a batch of 999 revisions
^C

I indeed could not find anything from page or archive table for the first one I checked. Seems legitimate corruption dating back to 2007.

Also the --from-date option queries are so slow as to be unusable:

EXPLAIN SELECT /* FindBadBlobs::loadRevisionsByTimestamp nike@web2.trans... */ rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,temp_rev_user.revactor_actor AS `rev_actor` FROM `bw_revision` JOIN `bw_revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `bw_comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `bw_revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `bw_actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) WHERE rev_timestamp > '20150101000000' OR (rev_timestamp = '20150101000000' AND rev_id > 0 ) ORDER BY rev_timestamp, rev_id LIMIT 1000:

*** row 1 ***
          table:  actor_rev_user
           type:  ALL
  possible_keys:  PRIMARY
            key:  NULL
        key_len:  NULL
            ref:  NULL
           rows:  21289
          Extra:  Using temporary; Using filesort
*** row 2 ***
          table:  temp_rev_user
           type:  ref
  possible_keys:  PRIMARY,revactor_rev,actor_timestamp
            key:  actor_timestamp
        key_len:  8
            ref:  translatewiki_net.actor_rev_user.actor_id
           rows:  101
          Extra:  Using index
*** row 3 ***
          table:  bw_revision
           type:  eq_ref
  possible_keys:  rev_id,rev_timestamp
            key:  rev_id
        key_len:  4
            ref:  translatewiki_net.temp_rev_user.revactor_rev
           rows:  1
          Extra:  Using where
*** row 4 ***
          table:  temp_rev_comment
           type:  ref
  possible_keys:  PRIMARY,revcomment_rev
            key:  PRIMARY
        key_len:  4
            ref:  translatewiki_net.temp_rev_user.revactor_rev
           rows:  1
          Extra:  Using index
*** row 5 ***
          table:  comment_rev_comment
           type:  eq_ref
  possible_keys:  PRIMARY
            key:  PRIMARY
        key_len:  8
            ref:  translatewiki_net.temp_rev_comment.revcomment_comment_id
           rows:  1
          Extra:  NULL
daniel added a comment.EditedMay 1 2020, 7:39 PM

Using temporary; Using filesort

Huh, that's a problem. That shouldn't happen.
It says "possible_keys: PRIMARY", so why doesn't it use the primary key?

Change 593961 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] findBadBLobs: Fore rev_timestamp index

https://gerrit.wikimedia.org/r/593961

daniel added a comment.May 3 2020, 7:38 PM

Also the --from-date option queries are so slow as to be unusable:

@Nikerabbit please try https://gerrit.wikimedia.org/r/c/mediawiki/core/+/593961

Change 593041 merged by jenkins-bot:
[mediawiki/core@master] RevisionStore: improve error handling in newRevisionsFromBatch

https://gerrit.wikimedia.org/r/593041

Change 593961 merged by jenkins-bot:
[mediawiki/core@master] findBadBlobs: Force rev_timestamp index

https://gerrit.wikimedia.org/r/593961

Ok, all patches merged. Moving this back to "ready" on the clinic duty board, since we now need to run this script in production. We will first need to identify time periods that produced bad content rows on different wikis. Searching for "Failed to load data blob" in phab and in the logs should give us a good start.

brennen moved this task from Backlog to Logs/Train on the User-brennen board.
daniel removed daniel as the assignee of this task.Jul 15 2020, 4:31 PM
daniel added a comment.EditedAug 26 2020, 11:47 AM

For oswiki, I found twelve bad revisions around 2009-03-09T20:00:00Z:

daniel@mwdebug1001:/srv/mediawiki/php-1.36.0-wmf.5$ mwscript maintenance/findBadBlobs.php oswiki --scan-from 2009-03-09T20:00:00Z --limit 1000
Scanning revisions table, 1000 rows starting at rev_timestamp 20090309200000
        ! Found bad blob on revision 107224 (main slot): content_id=101850, address=<tt:106781>, error='Bad data in text row 106781.', type='MediaWiki\Storage\BlobAccessException'. ID: 107224
        ! Found bad blob on revision 107227 (main slot): content_id=101853, address=<tt:106783>, error='Bad data in text row 106783.', type='MediaWiki\Storage\BlobAccessException'. ID: 107227
        ! Found bad blob on revision 107240 (main slot): content_id=101865, address=<tt:106790>, error='Bad data in text row 106790.', type='MediaWiki\Storage\BlobAccessException'. ID: 107240
        ! Found bad blob on revision 107263 (main slot): content_id=101884, address=<tt:106802>, error='Bad data in text row 106802.', type='MediaWiki\Storage\BlobAccessException'. ID: 107263
        ! Found bad blob on revision 107265 (main slot): content_id=101886, address=<tt:106803>, error='Bad data in text row 106803.', type='MediaWiki\Storage\BlobAccessException'. ID: 107265
        ! Found bad blob on revision 107271 (main slot): content_id=101889, address=<tt:106806>, error='Bad data in text row 106806.', type='MediaWiki\Storage\BlobAccessException'. ID: 107271
        ! Found bad blob on revision 107277 (main slot): content_id=101893, address=<tt:106809>, error='Bad data in text row 106809.', type='MediaWiki\Storage\BlobAccessException'. ID: 107277
        ! Found bad blob on revision 107284 (main slot): content_id=101898, address=<tt:106813>, error='Bad data in text row 106813.', type='MediaWiki\Storage\BlobAccessException'. ID: 107284
        ! Found bad blob on revision 107301 (main slot): content_id=101909, address=<tt:106822>, error='Bad data in text row 106822.', type='MediaWiki\Storage\BlobAccessException'. ID: 107301
        ! Found bad blob on revision 107305 (main slot): content_id=101911, address=<tt:106824>, error='Bad data in text row 106824.', type='MediaWiki\Storage\BlobAccessException'. ID: 107305
        ! Found bad blob on revision 107318 (main slot): content_id=101918, address=<tt:106831>, error='Bad data in text row 106831.', type='MediaWiki\Storage\BlobAccessException'. ID: 107318
        - Scanned a batch of 1000 revisions, up to revision 108245 (20090310195313)
Scanning archive table by ar_rev_id, 107200 to 108246
        ! Found bad blob on revision 107325 (main slot): content_id=458798, address=<tt:106835>, error='Bad data in text row 106835.', type='MediaWiki\Storage\BlobAccessException'. ID: 107325
        - Scanned a batch of 44 archived revisions, up to revision 107745 (20090310195313)
The range of archive rows scanned is based on the range of revision IDs scanned in the revision table.
Found 12 bad revisions.

Per Anomie's analysis, this is from a misconfiguration around that time. They all have the same broken external store addess: DB://cluster20/0. This points to a cluster that doesn't exist, with blob ID 0.

For dewiki, I'm finding 21 bad revisions in a range of about ten hours on 2005-05-12:

Scanning revisions table, 10000 rows starting at rev_timestamp 20050512070000
	! Found bad blob on revision 5780804 (main slot): content_id=5395022, address=<tt:5780804>, error='Bad data in text row 5780804.', type='MediaWiki\Storage\BlobAccessException'. ID:	5780804
	! Found bad blob on revision 5766848 (main slot): content_id=5382566, address=<tt:5766848>, error='Bad data in text row 5766848.', type='MediaWiki\Storage\BlobAccessException'. ID:	5766848
	! Found bad blob on revision 5765691 (main slot): content_id=5381580, address=<tt:5765691>, error='Bad data in text row 5765691.', type='MediaWiki\Storage\BlobAccessException'. ID:	5765691
	! Found bad blob on revision 5765768 (main slot): content_id=5381652, address=<tt:5765768>, error='Bad data in text row 5765768.', type='MediaWiki\Storage\BlobAccessException'. ID:	5765768
	- Scanned a batch of 1000 revisions, up to revision 182337302 (20050512080456)
	! Found bad blob on revision 5765796 (main slot): content_id=5381676, address=<tt:5765796>, error='Bad data in text row 5765796.', type='MediaWiki\Storage\BlobAccessException'. ID:	5765796
	! Found bad blob on revision 5767268 (main slot): content_id=5382923, address=<tt:5767268>, error='Bad data in text row 5767268.', type='MediaWiki\Storage\BlobAccessException'. ID:	5767268
	- Scanned a batch of 999 revisions, up to revision 182337302 (20050512090622)
	! Found bad blob on revision 5767280 (main slot): content_id=5382934, address=<tt:5767280>, error='Bad data in text row 5767280.', type='MediaWiki\Storage\BlobAccessException'. ID:	5767280
	! Found bad blob on revision 5769279 (main slot): content_id=5384710, address=<tt:5769279>, error='Bad data in text row 5769279.', type='MediaWiki\Storage\BlobAccessException'. ID:	5769279
	- Scanned a batch of 1000 revisions, up to revision 196159736 (20050512100014)
	! Found bad blob on revision 5768160 (main slot): content_id=5383714, address=<tt:5768160>, error='Bad data in text row 5768160.', type='MediaWiki\Storage\BlobAccessException'. ID:	5768160
	! Found bad blob on revision 5768696 (main slot): content_id=5384172, address=<tt:5768696>, error='Bad data in text row 5768696.', type='MediaWiki\Storage\BlobAccessException'. ID:	5768696
	- Scanned a batch of 1000 revisions, up to revision 196159736 (20050512110330)
	- Scanned a batch of 1000 revisions, up to revision 196303412 (20050512121015)
	! Found bad blob on revision 5770364 (main slot): content_id=5385685, address=<tt:5770364>, error='Bad data in text row 5770364.', type='MediaWiki\Storage\BlobAccessException'. ID:	5770364
	! Found bad blob on revision 5770219 (main slot): content_id=5385557, address=<tt:5770219>, error='Bad data in text row 5770219.', type='MediaWiki\Storage\BlobAccessException'. ID:	5770219
	! Found bad blob on revision 5770490 (main slot): content_id=5385798, address=<tt:5770490>, error='Bad data in text row 5770490.', type='MediaWiki\Storage\BlobAccessException'. ID:	5770490
	- Scanned a batch of 999 revisions, up to revision 196303412 (20050512131059)
	- Scanned a batch of 999 revisions, up to revision 196303413 (20050512135731)
	- Scanned a batch of 1000 revisions, up to revision 200428340 (20050512145024)
	! Found bad blob on revision 5773152 (main slot): content_id=5388247, address=<tt:5773152>, error='Bad data in text row 5773152.', type='MediaWiki\Storage\BlobAccessException'. ID:	5773152
	- Scanned a batch of 999 revisions, up to revision 200428340 (20050512155526)
	! Found bad blob on revision 5794978 (main slot): content_id=5407846, address=<tt:5794978>, error='Bad data in text row 5794978.', type='MediaWiki\Storage\BlobAccessException'. ID:	5794978
	- Scanned a batch of 1000 revisions, up to revision 200428340 (20050512171007)
	- Scanned a batch of 4 revisions, up to revision 200428340 (20050512171037)
Scanning archive table by ar_rev_id, 5764784 to 200428341
	! Found bad blob on revision 5776174 (main slot): content_id=164648354, address=<tt:5776174>, error='Bad data in text row 5776174.', type='MediaWiki\Storage\BlobAccessException'. ID:	5776174
	- Scanned a batch of 1000 archived revisions, up to revision 5780851 (20050512171037)
	! Found bad blob on revision 5792079 (main slot): content_id=164649452, address=<tt:5792079>, error='Bad data in text row 5792079.', type='MediaWiki\Storage\BlobAccessException'. ID:	5792079
	- Scanned a batch of 1000 archived revisions, up to revision 5798219 (20050512171037)
	! Found bad blob on revision 5800477 (main slot): content_id=164649909, address=<tt:5800477>, error='Bad data in text row 5800477.', type='MediaWiki\Storage\BlobAccessException'. ID:	5800477
	! Found bad blob on revision 5801455 (main slot): content_id=164649945, address=<tt:5801455>, error='Bad data in text row 5801455.', type='MediaWiki\Storage\BlobAccessException'. ID:	5801455
	! Found bad blob on revision 5801456 (main slot): content_id=164649946, address=<tt:5801456>, error='Bad data in text row 5801456.', type='MediaWiki\Storage\BlobAccessException'. ID:	5801456
	! Found bad blob on revision 5801457 (main slot): content_id=164649947, address=<tt:5801457>, error='Bad data in text row 5801457.', type='MediaWiki\Storage\BlobAccessException'. ID:	5801457
	- Scanned a batch of 1000 archived revisions, up to revision 5816785 (20050512171037)
	- Scanned a batch of 1000 archived revisions, up to revision 5835481 (20050512171037)
	- Scanned a batch of 1000 archived revisions, up to revision 5854532 (20050512171037)
	- Scanned a batch of 1000 archived revisions, up to revision 5864938 (20050512171037)
	- Scanned a batch of 1000 archived revisions, up to revision 5880805 (20050512171037)
	- Scanned a batch of 1000 archived revisions, up to revision 5898941 (20050512171037)
	- Scanned a batch of 1000 archived revisions, up to revision 5919773 (20050512171037)
	- Scanned a batch of 1000 archived revisions, up to revision 5938311 (20050512171037)
The range of archive rows scanned is based on the range of revision IDs scanned in the revision table.
Found 21 bad revisions.

The root cause here seems to be different from the problem on oswiki. The entries in the text table look like this:

wikiadmin@10.64.16.11(dewiki)> select * from text where old_id in ( 5780804, 5770364, 5800477, 5801457  );
+---------+------------------------------------------------------------------------------------------------------------+-----------+
| old_id  | old_text                                                                                                   | old_flags |
+---------+------------------------------------------------------------------------------------------------------------+-----------+
| 5770364 | O:15:"historyblobstub":2:{s:6:"mOldId";s:7:"5627981";s:5:"mHash";s:32:"d41d8cd98f00b204e9800998ecf8427e";} | object    |
| 5780804 | O:15:"historyblobstub":2:{s:6:"mOldId";s:7:"5783995";s:5:"mHash";s:32:"fd983b822cf054a4bf57fbd817db41cf";} | object    |
| 5800477 | O:15:"historyblobstub":2:{s:6:"mOldId";s:7:"5800476";s:5:"mHash";s:32:"d41d8cd98f00b204e9800998ecf8427e";} | object    |
| 5801457 | O:15:"historyblobstub":2:{s:6:"mOldId";s:7:"5801461";s:5:"mHash";s:32:"25ca7dd74cc8b8e0dc2e4753c9e7f925";} | object    |
+---------+------------------------------------------------------------------------------------------------------------+-----------+
4 rows in set (0.00 sec)

So these are serialized HistoryBlob objects that point to a blob that contains multiple revisions. If I understand correctly, the mOldId field in the serialized objects points to another row in the text table. For the sample above, these rows are:

wikiadmin@10.64.16.11(dewiki)> select * from text where old_id in ( 5627981, 5783995, 5800476, 5801461  );
+---------+-------------------------------------------------------+----------------+
| old_id  | old_text                                              | old_flags      |
+---------+-------------------------------------------------------+----------------+
| 5627981 | DB://cluster5/70906/8ec74838a0123a78eef62b00b32148dd  | external,utf-8 |
| 5783995 | DB://cluster4/290079/43ee849ba13f4c917c88f5eff0731cab | external,utf-8 |
| 5800476 | DB://cluster4/293326/e000bf53775f2fce5a463b5456184741 | external,utf-8 |
| 5801461 | DB://cluster4/598411/576892943c819e04a76636540732f82b | external,utf-8 |
+---------+-------------------------------------------------------+----------------+
4 rows in set (0.00 sec)

Investigating further, it seems like these blobs to exist in ExternalStore, but they appear to be corrupted. At least, I'm unable to decode them manually with gzip:

daniel@mwmaint1002:/srv/mediawiki/php-1.36.0-wmf.5$ mwscript maintenance/eval.php dewiki 
> $mws = MediaWiki\MediaWikiServices::getInstance();

>  $es = $mws->getExternalStoreAccess();

> $blob = $es->fetchFromURL( "DB://cluster5/70906/8ec74838a0123a78eef62b00b32148dd" );

> print gettype( $blob );
string
> print strlen( $blob );
24412
> $data = gzinflate( $blob );

> print gettype( $data );
boolean
> var_dump( $data );
bool(false)

> $blob = $es->fetchFromURL( "DB://cluster4/598411/576892943c819e04a76636540732f82b" );

> $data = gzinflate( $blob );

> var_dump( $data );
bool(false)

> $data = gzdecode( $blob );

> var_dump( $data );
bool(false)

> $data = gzuncompress( $blob );

> var_dump( $data );
bool(false)

I will mark the following revisions as bad on oswiki:

107224
107227
107240
107263
107265
107271
107277
107284
107301
107305
107318
107325

I will mark the following revisions as bad on dewiki:

5780804
5766848
5765691
5765768
5765796
5767268
5767280
5769279
5768160
5768696
5770364
5770219
5770490
5773152
5794978
5776174
5792079
5800477
5801455
5801456
5801457

Mentioned in SAL (#wikimedia-operations) [2020-08-26T13:16:47Z] <duesen> daniel@mwmaint1002:/srv/mediawiki/php-1.36.0-wmf.5$ mwscript maintenance/findBadBlobs.php oswiki --mark T205936 --revisions - < ~/T205936-oswiki-20090309200000.ids # marking known bad revisions for T205936

Mentioned in SAL (#wikimedia-operations) [2020-08-26T13:18:00Z] <duesen> daniel@mwmaint1002:/srv/mediawiki/php-1.36.0-wmf.5$ mwscript maintenance/findBadBlobs.php dewiki --mark T205936 --revisions - < ~/T205936-dewiki-20050512070000.ids # marking known bad revisions for T205936

daniel closed this task as Resolved.Aug 26 2020, 1:21 PM
daniel claimed this task.
daniel moved this task from Ready to Done on the Platform Team Workboards (Clinic Duty Team) board.

Accessing the broken revisions no longer triggers an error, they just appear empty. Marking resolved.

Aklapper removed a subscriber: Anomie.Fri, Oct 16, 5:40 PM