Page MenuHomePhabricator

[Bug] CentralNotice: "Failed to load data blob" error when editing translatable messages
Closed, ResolvedPublic

Description

Currently happens on production when creating or adding a translatable message in the banner. Scroll down a bit in this logstash to see it.

Event Timeline

Here's the list of all the logstash things from that request:
https://logstash.wikimedia.org/goto/9687318b49e9772f118b992696b5cacf

The other one that might be interesting is this:
MediaWiki\Storage\RevisionStore::getTitle fell back to READ_LATEST and got a Title.
<li>RevisionStore.php line 1533 calls MediaWiki\Storage\RevisionStore->getTitle()</li>
<li>RevisionStore.php line 1892 calls MediaWiki\Storage\RevisionStore->newRevisionFromRow()</li>
<li>RevisionStore.php line 1743 calls MediaWiki\Storage\RevisionStore->loadRevisionFromConds()</li>
<li>RevisionStore.php line 898 calls MediaWiki\Storage\RevisionStore->loadRevisionFromId()</li>
<li>Revision.php line 1154 calls MediaWiki\Storage\RevisionStore->newNullRevision()</li>
<li>WikiPage.php line 2277 calls Revision::newNullRevision()</li>
<li>WikiPage.php line 2119 calls WikiPage->insertProtectNullRevision()</li>
<li>BannerMessage.php line 152 calls WikiPage->doUpdateRestrictions()</li>
<li>BannerMessage.php line 125 calls BannerMessage->protectMessageInCnNamespaces()</li>
<li>SpecialCentralNoticeBanners.php line 972 calls BannerMessage->update()</li>
<li>SpecialCentralNoticeBanners.php line 941 calls SpecialCentralNoticeBanners->processSaveBannerAction()</li>
<li>HTMLForm.php line 662 calls SpecialCentralNoticeBanners->processEditBanner()</li>
<li>HTMLForm.php line 554 calls HTMLForm->trySubmit()</li>
<li>SpecialCentralNoticeBanners.php line 437 calls HTMLForm->tryAuthorizedSubmit()</li>
<li>SpecialCentralNoticeBanners.php line 85 calls SpecialCentralNoticeBanners->showBannerEditor()</li>
<li>SpecialPage.php line 565 calls SpecialCentralNoticeBanners->execute()</li>

Hmmm yeah...!! Comparing the stack traces from the warning vs. the actual error... The warning happens in the call to Revision::newNullRevision(), on WikiPage.php line 2277. Then two lines later, on WikiPage.php line 2279, we use the result from that call in the call to Revision::insertOn(), and in that call the error happens...

@daniel We added you because we're thinking this may have something to do with the MCR patches - it showed up shortly after them landing on production. It's reproducable on beta, but not locally so far. That may be because neither of us has a multi-db setup, so we don't get to the fall back in getTitle.

Also adding @Addshore for the same reason, thx in advance!! :)

Friendly ping also for @Tgr @MarkTraceur @Anomie @Fjalapeno ... (thinking you might have some input!) Thx again!!!!

@Ejegg quick guess, without any investigation: the null revision is created immediately after an actual revision. when creating the null revision, we are trying to load the previous revision's content, but it can't be found due to replication lag.

We had a similar error before, @Addshore may remember the details.

One thing to try would be to not use newNullRevision at all, and instead use a PageUpdater directly, as proposed in this change: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/441595

Fjalapeno added a subscriber: CCicalese_WMF.

@AndyRussG working on getting you some help here from one of the MCR engineers

cc @CCicalese_WMF

As to the getTitle fallback: this is probably a symptom aft the same problem: the new row isn't yet in the page tabel on the replica, so getTitle() falls back to master. If the same doesn't happen when trying to load the content blob, things will fail.

...looks like loadRevisionFromId() needs a $queryFlag parameter which it can then pass to loadRevisionFromConds(). newNullRevision() can then call loadRevisionFromId with READ_LATEST. That should fix the issue, but always hitting master when creating a null revision may not be desirable.

...looks like loadRevisionFromId() needs a $queryFlag parameter which it can then pass to loadRevisionFromConds(). newNullRevision() can then call loadRevisionFromId with READ_LATEST. That should fix the issue, but always hitting master when creating a null revision may not be desirable.

OTOH, if you're creating a null revision you're probably planning on inserting it, which is going to hit master anyway.

Looking at the pre-MCR code, Revision::newNullRevision actually did a select on DB_MASTER with a FOR UPDATE lock on the page and revision tables. And the new code is still doing that for the page table, but nor revision (and friends).

I'll try to force a master connection to be used consistently. Sadly, this kind of think is hard to test.

@daniel if you have a patch, we are able to reproduce it on the beta cluster :)

Change 444926 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Use master when fetching revision to base a null-revison on.

https://gerrit.wikimedia.org/r/444926

Change 444926 merged by jenkins-bot:
[mediawiki/core@master] Use master when fetching revision to base a null-revison on.

https://gerrit.wikimedia.org/r/444926

Change 444939 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] In RevisionStore, use DB_MASTER when READ_LATEST is set.

https://gerrit.wikimedia.org/r/444939

The first change (444926) fixed this issue on the beta cluster! yay!!! Thanks @daniel, @Anomie

@AndyRussG Krinkle spotted another place where a similar thing may happen, depending on timing and the migration stage of the CommentStore. I made another patch, see above.

@daniel cool! I don't know an easy way to test that one...

@AndyRussG well, as long as it doesn't break anything ;)

Change 444939 merged by jenkins-bot:
[mediawiki/core@master] In RevisionStore, use DB_MASTER when READ_LATEST is set.

https://gerrit.wikimedia.org/r/444939

Ejegg triaged this task as High priority.