Page MenuHomePhabricator

[wmf.16-regression] Fatal exception of type "Flow\Exception\InvalidDataException" for opting out from "Structured Discussions on user talk"
Closed, ResolvedPublic

Description

  1. On testwiki (or cawiki) go Preferences-Beta features and un-check "Structured Discussions on user talk' option.
  2. Go to your User talk page (e.g. https://ca.wikipedia.org/wiki/Usuari_Discussi%C3%B3:Etonkovidova). The following error is displayed:

[WlayGQpAAEEAACDqrNgAAABJ] 2018-01-11 00:38:49: Fatal exception of type "Flow\Exception\InvalidDataException"

From https://logstash.wikimedia.org:

#0 /srv/mediawiki/php-1.31.0-wmf.16/extensions/Flow/includes/WorkflowLoaderFactory.php(103): Flow\WorkflowLoaderFactory->loadWorkflowById(Title, Flow\Model\UUID)
#1 /srv/mediawiki/php-1.31.0-wmf.16/extensions/Flow/includes/Actions/Action.php(105): Flow\WorkflowLoaderFactory->createWorkflowLoader(Title)
#2 /srv/mediawiki/php-1.31.0-wmf.16/extensions/Flow/includes/Actions/ViewAction.php(20): Flow\Actions\FlowAction->showForAction(string, OutputPage)
#3 /srv/mediawiki/php-1.31.0-wmf.16/extensions/Flow/includes/Actions/Action.php(50): Flow\Actions\ViewAction->showForAction(string)
#4 /srv/mediawiki/php-1.31.0-wmf.16/includes/MediaWiki.php(500): Flow\Actions\FlowAction->show()
#5 /srv/mediawiki/php-1.31.0-wmf.16/includes/MediaWiki.php(294): MediaWiki->performAction(Article, Title)
#6 /srv/mediawiki/php-1.31.0-wmf.16/includes/MediaWiki.php(854): MediaWiki->performRequest()
#7 /srv/mediawiki/php-1.31.0-wmf.16/includes/MediaWiki.php(524): MediaWiki->main()
#8 /srv/mediawiki/php-1.31.0-wmf.16/index.php(42): MediaWiki->run()
#9 /srv/mediawiki/w/index.php(3): include(string)
#10 {main}

There is no way to bring a User talk page back into a working state - opting-in SD beta feature (and 'Restore all default settings...') has no effect.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Hiding the preference until the fix may make sense.

wgHiddenPrefs should work.

There is a gotcha that hidden preferences are coerced to their default when reading... by default.

In this case, that's false. However, we only check the value on-save, and if it thinks both before and after are false, it should theoretically work bogus-ly (nothing to change).

I acknowledge that is kind of meh. In general, this can be solved with User->getOption $ignoreHidden. It would also have to be done in BetaFeatureUtil.php.

Hiding the preference until the fix may make sense.
wgHiddenPrefs should work.
There is a gotcha that hidden preferences are coerced to their default when reading... by default.
In this case, that's false. However, we only check the value on-save, and if it thinks both before and after are false, it should theoretically work bogus-ly (nothing to change).
I acknowledge that is kind of meh. In general, this can be solved with User->getOption $ignoreHidden. It would also have to be done in BetaFeatureUtil.php.

I was thinking $wgFlowEnableOptInBetaFeature = false; to remove it from the beta feature page and make sure the hook that applies the changes returns early.

It looks like opt-out breaks in the middle and leaves the talk page in an inconsistent state. The problem might be this exception:

PHP Error: Argument 1 passed to WikiPage::prepareContentForEdit() must implement interface Content, null given
#0 /srv/mediawiki/php-1.31.0-wmf.16/includes/page/WikiPage.php(2145): MWExceptionHandler::handleError(integer, string, string, integer, array, array)
#1 /srv/mediawiki/php-1.31.0-wmf.16/includes/page/WikiPage.php(2204): WikiPage->prepareContentForEdit(NULL, Revision, User, NULL, boolean)
#2 /srv/mediawiki/php-1.31.0-wmf.16/includes/MovePage.php(564): WikiPage->doEditUpdates(Revision, User, array)
#3 /srv/mediawiki/php-1.31.0-wmf.16/includes/MovePage.php(271): MovePage->moveToInternal(User, Title, string, boolean, array)
#4 /srv/mediawiki/php-1.31.0-wmf.16/extensions/Flow/includes/Import/OptInController.php(142): MovePage->move(User, string, boolean)
#5 /srv/mediawiki/php-1.31.0-wmf.16/extensions/Flow/includes/Import/OptInController.php(113): Flow\Import\OptInController->movePage(Title, Title, string)
#6 /srv/mediawiki/php-1.31.0-wmf.16/extensions/Flow/includes/Import/OptInUpdate.php(50): Flow\Import\OptInController->disable(Title)
#7 /srv/mediawiki/php-1.31.0-wmf.16/includes/deferred/DeferredUpdates.php(259): Flow\Import\OptInUpdate->doUpdate()
#8 /srv/mediawiki/php-1.31.0-wmf.16/includes/deferred/DeferredUpdates.php(210): DeferredUpdates::runUpdate(Flow\Import\OptInUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer)
#9 /srv/mediawiki/php-1.31.0-wmf.16/includes/deferred/DeferredUpdates.php(131): DeferredUpdates::execute(array, string, integer)
#10 /srv/mediawiki/php-1.31.0-wmf.16/includes/MediaWiki.php(900): DeferredUpdates::doUpdates(string)
#11 /srv/mediawiki/php-1.31.0-wmf.16/includes/MediaWiki.php(720): MediaWiki->restInPeace(string, boolean)
#12 [internal function]: Closure$MediaWiki::doPostOutputShutdown()
#13 {main}

Change 404988 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[mediawiki/extensions/Flow@master] Trying to make Opt in/out atomic

https://gerrit.wikimedia.org/r/404988

SBisson added a subscriber: aaron.Jan 18 2018, 2:25 PM

[mediawiki/extensions/Flow@master] Trying to make Opt in/out atomic
https://gerrit.wikimedia.org/r/404988

Question for @aaron, I'm told you are the expert on this.

This patch tries to make the whole process transactional by using AtomicSectionUpdate. It works locally but many things work better against a single server.

The AtomicSectionUpdate constructor is given wfGetDB( DB_MASTER ) with the assumption that the move and edit WikiPage code will also request DB_MASTER and will receive the same connection in which the transaction is open. Does it make sense? Is there any problem you can spot in this patch?

Thanks!

Change 404988 abandoned by Sbisson:
Trying to make Opt in/out atomic

Reason:
I think I074586aa1a718bfb6905e33ce999ad8832fb6aaa is more promising since Flow and Core are not in the same db in production.

https://gerrit.wikimedia.org/r/404988

[mediawiki/extensions/Flow@master] Trying to make Opt in/out atomic
https://gerrit.wikimedia.org/r/404988

Question for @aaron, I'm told you are the expert on this.
This patch tries to make the whole process transactional by using AtomicSectionUpdate. It works locally but many things work better against a single server.
The AtomicSectionUpdate constructor is given wfGetDB( DB_MASTER ) with the assumption that the move and edit WikiPage code will also request DB_MASTER and will receive the same connection in which the transaction is open. Does it make sense? Is there any problem you can spot in this patch?
Thanks!

Scratch that.

I think https://gerrit.wikimedia.org/r/#/c/314480/ (which you've reviewed a while back) is more promising as it handles rollback for both the Core and Flow dbs since they are not on the same server in production.

I have rebased it and addressed some of the comments from back then.

Change 405034 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[operations/mediawiki-config@master] Hide Flow beta feature everywhere but testwiki

https://gerrit.wikimedia.org/r/405034

Change 405034 merged by jenkins-bot:
[operations/mediawiki-config@master] Hide Flow beta feature everywhere but testwiki

https://gerrit.wikimedia.org/r/405034

Mentioned in SAL (#wikimedia-operations) [2018-01-18T18:58:29Z] <mattflaschen@tin> Synchronized wmf-config/InitialiseSettings.php: T184670: Hide Flow beta feature everywhere but testwiki (duration: 01m 10s)

The Flow beta feature is now disabled everywhere but on testwiki.

Next steps:

  1. Repair inconsistent user talk pages in production
  2. review/merge/deploy https://gerrit.wikimedia.org/r/#/c/314480/ to prevent user talk page corruption and re-enable the beta feature
  3. Figure out why WikiPage::prepareContentForEdit() must implement interface Content, null given

Change 405054 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[mediawiki/core@master] Log the reason why revision->getContent() returns null

https://gerrit.wikimedia.org/r/405054

  1. Repair inconsistent user talk pages in production

Dry run: P6618

There are not very many:

grep 'INCONSISTENT' FlowFixInconsistentBoards_all_2018-01-19_dry_run_force.txt
frwiki:  INCONSISTENT: Core title for 'tgno0jiutbg6zukj' is 'Discussion utilisateur:Zilant17', but Flow title is 'Discussion utilisateur:Zilant17/Structured Discussions Archive 1'
frwiki:  INCONSISTENT: Core title for 'u1uw91cj53ekrbvk' is 'Discussion utilisateur:Stephanecbisson', but Flow title is 'Discussion utilisateur:Stephanecbisson/Structured Discussions Archive 1'
testwiki:  INCONSISTENT: Core title for 'spfeq7tyt191zvvx' is 'User talk:Etonkovidova', but Flow title is 'User talk:Etonkovidova/Structured Discussions Archive 1'
testwiki:  INCONSISTENT: Core title for 'tgeds5mlkb0we4rv' is 'User talk:Zilant63', but Flow title is 'User talk:Zilant63/Structured Discussions Archive 1'
testwiki:  INCONSISTENT: Core title for 'u5tzvm8lv1hw6ve1' is 'User talk:Stephanecbisson', but Flow title is 'User talk:Stephanecbisson/Structured Discussions Archive 1'
zhwiki:  INCONSISTENT: Core title for 'u5uxvde8yqfxmywd' is 'User talk:DreamGuan', but Flow title is 'User talk:DreamGuan/结构式讨论 存档 1'

[...]
There are not very many:

grep 'INCONSISTENT' FlowFixInconsistentBoards_all_2018-01-19_dry_run_force.txt
frwiki:  INCONSISTENT: Core title for 'tgno0jiutbg6zukj' is 'Discussion utilisateur:Zilant17', but Flow title is 'Discussion utilisateur:Zilant17/Structured Discussions Archive 1'
frwiki:  INCONSISTENT: Core title for 'u1uw91cj53ekrbvk' is 'Discussion utilisateur:Stephanecbisson', but Flow title is 'Discussion utilisateur:Stephanecbisson/Structured Discussions Archive 1'
testwiki:  INCONSISTENT: Core title for 'spfeq7tyt191zvvx' is 'User talk:Etonkovidova', but Flow title is 'User talk:Etonkovidova/Structured Discussions Archive 1'
testwiki:  INCONSISTENT: Core title for 'tgeds5mlkb0we4rv' is 'User talk:Zilant63', but Flow title is 'User talk:Zilant63/Structured Discussions Archive 1'
testwiki:  INCONSISTENT: Core title for 'u5tzvm8lv1hw6ve1' is 'User talk:Stephanecbisson', but Flow title is 'User talk:Stephanecbisson/Structured Discussions Archive 1'
zhwiki:  INCONSISTENT: Core title for 'u5uxvde8yqfxmywd' is 'User talk:DreamGuan', but Flow title is 'User talk:DreamGuan/结构式讨论 存档 1'

They're all Elena and I but the last one is a zhwiki user whose report is merged into the current ticket. Please go ahead and fix them all when you have a chance.

@Wong128hk We apologize for this issue. It is fixed now, but temporarily you will not be able to adjust the Beta Feature. We will post here when you can.

Full log is at P6624. Summary below (the grep excludes gomwiki, due to unrelated T154623: Workflow ID null for some user talk pages on gomwiki). The other two errors are also unrelated, going back to P4715.

DreamGuan's now displays right; they all should fully work now.

grep -E 'FIXED|INCONSISTENT|ERROR' FlowFixInconsistentBoards_all_2018-01-19_actual_force.txt|grep -v 'परप'                        
frwiki:  INCONSISTENT: Core title for 'tgno0jiutbg6zukj' is 'Discussion utilisateur:Zilant17', but Flow title is 'Discussion utilisateur:Zilant17/Structured Discussions Archive 1'
frwiki:  FIXED: Updated 'tgno0jiutbg6zukj' to match core title, 'Discussion utilisateur:Zilant17'
frwiki:  INCONSISTENT: Core title for 'u1uw91cj53ekrbvk' is 'Discussion utilisateur:Stephanecbisson', but Flow title is 'Discussion utilisateur:Stephanecbisson/Structured Discussions Archive 1'
frwiki:  FIXED: Updated 'u1uw91cj53ekrbvk' to match core title, 'Discussion utilisateur:Stephanecbisson'
ERROR: 'User talk:Eimanahmed80' refers to workflow ID 'suhadmgqu7o4iwgg', which could not be found.
ERROR: '<REDACTED>' refers to workflow ID 'sgkoegybywkraxl3', which could not be found.
testwiki:  INCONSISTENT: Core title for 'spfeq7tyt191zvvx' is 'User talk:Etonkovidova', but Flow title is 'User talk:Etonkovidova/Structured Discussions Archive 1'
testwiki:  FIXED: Updated 'spfeq7tyt191zvvx' to match core title, 'User talk:Etonkovidova'
testwiki:  INCONSISTENT: Core title for 'tgeds5mlkb0we4rv' is 'User talk:Zilant63', but Flow title is 'User talk:Zilant63/Structured Discussions Archive 1'
testwiki:  FIXED: Updated 'tgeds5mlkb0we4rv' to match core title, 'User talk:Zilant63'
testwiki:  INCONSISTENT: Core title for 'u5tzvm8lv1hw6ve1' is 'User talk:Stephanecbisson', but Flow title is 'User talk:Stephanecbisson/Structured Discussions Archive 1'
testwiki:  FIXED: Updated 'u5tzvm8lv1hw6ve1' to match core title, 'User talk:Stephanecbisson'
zhwiki:  INCONSISTENT: Core title for 'u5uxvde8yqfxmywd' is 'User talk:DreamGuan', but Flow title is 'User talk:DreamGuan/结构式讨论 存档 1'
zhwiki:  FIXED: Updated 'u5uxvde8yqfxmywd' to match core title, 'User talk:DreamGuan'

@greg The pages are fixed, but the beta feature is disabled in production. The ticket was about having errors when option-out, so I am inclined to have keep this ticket open to see the fixes for opt-out (instead of creating a separate ticket for opt-out failures).

greg lowered the priority of this task from Unbreak Now! to High.Jan 22 2018, 11:29 PM

Lowering to High then, thanks!

Change 406051 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[mediawiki/extensions/Flow@master] OptInController catch both errors and exception

https://gerrit.wikimedia.org/r/406051

Change 406051 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] OptInController catch both errors and exception

https://gerrit.wikimedia.org/r/406051

Re-ran on testwiki:

Checked a total of 300 Flow boards.  Of those, 0 boards had an inconsistent title; 0 were fixed.
INCONSISTENT: Core title for 'u5tzvm8lv1hw6ve1' is 'User talk:Stephanecbisson', but Flow title is 'User talk:Stephanecbisson/Structured Discussions Archive 1'
FIXED: Updated 'u5tzvm8lv1hw6ve1' to match core title, 'User talk:Stephanecbisson'

Checked a total of 354 Flow boards.  Of those, 1 boards had an inconsistent title; 1 were fixed.

real    0m8.573s
user    0m2.724s
sys     0m0.748s

Change 406631 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[mediawiki/extensions/Flow@wmf/1.31.0-wmf.17] OptInController catch both errors and exception

https://gerrit.wikimedia.org/r/406631

Change 406631 merged by jenkins-bot:
[mediawiki/extensions/Flow@wmf/1.31.0-wmf.17] OptInController catch both errors and exception

https://gerrit.wikimedia.org/r/406631

Mentioned in SAL (#wikimedia-operations) [2018-02-05T14:24:24Z] <hashar@tin> Synchronized php-1.31.0-wmf.17/extensions/Flow/includes/Import/OptInController.php: OptInController catch both errors and exception - T184670 (duration: 00m 55s)

Mattflaschen-WMF added a comment.EditedFeb 9 2018, 7:10 AM

Re-ran to fix testwiki errors (fixed).

I can provide the log, but there's nothing really notable (two test wiki fixes (Stephanecbisson and Etonkovidova), otherwise same as before, except there was a database error ("Database is read-only: The database has been automatically locked while the replica database servers catch up to the master.") on fawiki, so I re-ran on that.

We should discuss how to do this going forward (leave as is, convert to special page, do automatically, etc.).

Also, we should probably try debugging through (with the HHVM debugger) in production. Is it reproducible 100%?

Change 410193 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[mediawiki/extensions/Flow@master] Simplify nested deferred in OptInController

https://gerrit.wikimedia.org/r/410193

Change 410193 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] Simplify nested deferred in OptInController

https://gerrit.wikimedia.org/r/410193

Change 405054 merged by jenkins-bot:
[mediawiki/core@master] Log the reason why revision->getContent() returns null

https://gerrit.wikimedia.org/r/405054

Change 410522 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[mediawiki/core@wmf/1.31.0-wmf.21] Log the reason why revision->getContent() returns null

https://gerrit.wikimedia.org/r/410522

Change 410546 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[operations/mediawiki-config@master] Enable log channel T184670

https://gerrit.wikimedia.org/r/410546

Change 410546 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable log channel T184670

https://gerrit.wikimedia.org/r/410546

Change 410522 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.21] Log the reason why revision->getContent() returns null

https://gerrit.wikimedia.org/r/410522

Mentioned in SAL (#wikimedia-operations) [2018-02-15T14:35:42Z] <zfilipin@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:410546|Enable log channel T184670 (T184670)]] (duration: 01m 12s)

Mentioned in SAL (#wikimedia-operations) [2018-02-15T14:35:42Z] <zfilipin@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:410546|Enable log channel T184670 (T184670)]] (duration: 01m 12s)

Mentioned in SAL (#wikimedia-operations) [2018-02-15T14:37:12Z] <zfilipin@tin> Synchronized php-1.31.0-wmf.21/includes/Revision.php: SWAT: [[gerrit:410522|Log the reason why revision->getContent() returns null (T184670)]] (duration: 01m 12s)

From the logging that has been added:

2018-02-15 14:31:09 [WoWZnwpAAC4AACz4mHkAAAAB] mwdebug1002 testwiki 1.31.0-wmf.21 T184670 INFO: Revision::getContent: Cannot get content: Failed to load data blob from tt:375056: Failed to load blob from address tt:375056
#0 /srv/mediawiki/php-1.31.0-wmf.21/includes/Storage/RevisionStore.php(830): MediaWiki\Storage\RevisionStore->loadSlotContent()
#1 (): Closure$MediaWiki\Storage\RevisionStore::emulateMainSlot_1_29#2()
#2 /srv/mediawiki/php-1.31.0-wmf.21/includes/Storage/SlotRecord.php(201): call_user_func()
#3 /srv/mediawiki/php-1.31.0-wmf.21/includes/Storage/RevisionRecord.php(173): MediaWiki\Storage\SlotRecord->getContent()
#4 /srv/mediawiki/php-1.31.0-wmf.21/includes/Revision.php(906): MediaWiki\Storage\RevisionRecord->getContent()
#5 /srv/mediawiki/php-1.31.0-wmf.21/includes/page/WikiPage.php(2173): Revision->getContent()
#6 /srv/mediawiki/php-1.31.0-wmf.21/includes/MovePage.php(564): WikiPage->doEditUpdates()
#7 /srv/mediawiki/php-1.31.0-wmf.21/includes/MovePage.php(271): MovePage->moveToInternal()
#8 /srv/mediawiki/php-1.31.0-wmf.21/extensions/Flow/includes/Import/OptInController.php(226): MovePage->move()
#9 /srv/mediawiki/php-1.31.0-wmf.21/extensions/Flow/includes/Import/OptInController.php(197): Flow\Import\OptInController->movePage()
#10 /srv/mediawiki/php-1.31.0-wmf.21/extensions/Flow/includes/Import/OptInController.php(122): Flow\Import\OptInController->disable()
#11 /srv/mediawiki/php-1.31.0-wmf.21/includes/deferred/MWCallableUpdate.php(34): Closure$Flow\Import\OptInController::initiateChange()
#12 /srv/mediawiki/php-1.31.0-wmf.21/includes/deferred/DeferredUpdates.php(259): MWCallableUpdate->doUpdate()
#13 /srv/mediawiki/php-1.31.0-wmf.21/includes/deferred/DeferredUpdates.php(222): DeferredUpdates::runUpdate()
#14 /srv/mediawiki/php-1.31.0-wmf.21/includes/deferred/DeferredUpdates.php(131): DeferredUpdates::execute()
#15 /srv/mediawiki/php-1.31.0-wmf.21/includes/MediaWiki.php(907): DeferredUpdates::doUpdates()
#16 /srv/mediawiki/php-1.31.0-wmf.21/includes/MediaWiki.php(727): MediaWiki->restInPeace()
#17 (): Closure$MediaWiki::doPostOutputShutdown()
#18 {main}

I just found a very interesting log entry while playing around with this:

Revision::getContent: Cannot get content: Failed to load data blob from tt:375559: Failed to load blob from address tt:375559
#0 /srv/mediawiki/php-1.31.0-wmf.23/includes/Storage/RevisionStore.php(852): MediaWiki\Storage\RevisionStore->loadSlotContent()
#1 (): Closure$MediaWiki\Storage\RevisionStore::emulateMainSlot_1_29#2()
#2 /srv/mediawiki/php-1.31.0-wmf.23/includes/Storage/SlotRecord.php(201): call_user_func()
#3 /srv/mediawiki/php-1.31.0-wmf.23/includes/Storage/RevisionRecord.php(173): MediaWiki\Storage\SlotRecord->getContent()
#4 /srv/mediawiki/php-1.31.0-wmf.23/includes/Revision.php(938): MediaWiki\Storage\RevisionRecord->getContent()
#5 /srv/mediawiki/php-1.31.0-wmf.23/extensions/EventBus/includes/EventBus.php(245): Revision->getContent()
#6 /srv/mediawiki/php-1.31.0-wmf.23/extensions/EventBus/includes/EventBusHooks.php(113): EventBus::createRevisionAttrs()
#7 /srv/mediawiki/php-1.31.0-wmf.23/includes/Hooks.php(177): EventBusHooks::onRevisionInsertComplete()
#8 /srv/mediawiki/php-1.31.0-wmf.23/includes/Hooks.php(205): Hooks::callHook()
#9 /srv/mediawiki/php-1.31.0-wmf.23/includes/Revision.php(1136): Hooks::run()
#10 /srv/mediawiki/php-1.31.0-wmf.23/includes/MovePage.php(536): Revision->insertOn()
#11 /srv/mediawiki/php-1.31.0-wmf.23/includes/MovePage.php(271): MovePage->moveToInternal()
#12 /srv/mediawiki/php-1.31.0-wmf.23/extensions/Flow/includes/Import/OptInController.php(220): MovePage->move()
#13 /srv/mediawiki/php-1.31.0-wmf.23/extensions/Flow/includes/Import/OptInController.php(191): Flow\Import\OptInController->movePage()
#14 /srv/mediawiki/php-1.31.0-wmf.23/extensions/Flow/includes/Import/OptInController.php(119): Flow\Import\OptInController->disable()
#15 /srv/mediawiki/php-1.31.0-wmf.23/includes/deferred/MWCallableUpdate.php(34): Closure$Flow\Import\OptInController::initiateChange()
#16 /srv/mediawiki/php-1.31.0-wmf.23/includes/deferred/DeferredUpdates.php(259): MWCallableUpdate->doUpdate()
#17 /srv/mediawiki/php-1.31.0-wmf.23/includes/deferred/DeferredUpdates.php(210): DeferredUpdates::runUpdate()
#18 /srv/mediawiki/php-1.31.0-wmf.23/includes/deferred/DeferredUpdates.php(131): DeferredUpdates::execute()
#19 /srv/mediawiki/php-1.31.0-wmf.23/includes/MediaWiki.php(907): DeferredUpdates::doUpdates()
#20 /srv/mediawiki/php-1.31.0-wmf.23/includes/MediaWiki.php(727): MediaWiki->restInPeace()
#21 (): Closure$MediaWiki::doPostOutputShutdown()
#22 {main}

I wondered if the fact that EventBus is an extension that's not installed on most of our test installs could explain why we can't reproduce it. I'll try installing EventBus locally, but I'm not yet convinced that it could be related. Instead what this proves is that calling ->getContent() on the null revision fails for the first time immediately after the null revision is inserted (this stack trace is for the RevisionInsertComplete hook). There are other log entries too, but those happen later in the process (because ->getContent() keeps failing and returning null every time it's called, and we only die when one of those nulls violates a type hint).

I theorized there might be some sort of master/replica or external store race condition, but I simulated the creation of a null revision on my (Flow-enabled) testwiki user talk page in eval.php (all as one statement), and I ended up with non-null content. Then I thought it might be related to the $dbw->startAtomic() call, so I wrapped my test in that, but that also didn't break. And it wouldn't make sense for any of those things to be the culprit anyway, because normal page moves work. Perhaps the reason for the failure is specific to it being inside a DeferredUpdate? Or to doing two moves (A->B and C->A) in the same request?

The above error is preceded by MediaWiki\Storage\SqlBlobStore::fetchBlob: No text row with ID 375559. [Called from MediaWiki\Storage\SqlBlobStore::fetchBlob in /srv/mediawiki/php-1.31.0-wmf.23/includes/Storage/SqlBlobStore.php at line 345], so somehow the text table row either wasn't inserted, or wasn't there yet when SqlBlobStore queried the text table. I checked afterwards, and there was no text row with that ID, but that may have been due to a rollback. Before that, there is a log entry that says Rejected set() for testwiki:revisiontext:textid:tt%3A375558 due to pending writes., but I think that might be a red herring.

What is very strange though is that looking through the detailed debug logs, where I can see every SQL query, there's never an INSERT INTO text query for the null revision, while there is one for the revision changing the archive template.

aaron added a comment.Mar 3 2018, 3:13 AM

Don't null revisions just reuse the same rev_text_id and insert no new blob? At least that's how it used to work.

Hmm:

INSERT  INTO `revision` (rev_page,rev_parent_id,rev_text_id,rev_minor_edit,rev_timestamp,rev_deleted,rev_len,rev_sha1,rev_comment,rev_user,rev_user_text,rev_content_model,rev_content_format) VALUES ('98405','347046','375559','0','20180303013112','0','175','6m1xf2l9k0xzru0kd88561a0t7qrziy','Adding current template','26836','Flow talk page manager',NULL,NULL)

INSERT  INTO `revision` (rev_page,rev_parent_id,rev_text_id,rev_minor_edit,rev_timestamp,rev_deleted,rev_len,rev_sha1,rev_comment,rev_user,rev_user_text,rev_content_model,rev_content_format) VALUES ('98405','347047','375559','1','20180303013112','0','175','6m1xf2l9k0xzru0kd88561a0t7qrziy','Flow talk page manager moved page [[User talk:Roan Kattouw (WMF)/Archive 1]] to [[User talk:Roan Kattouw (WMF)]] without leaving a redirect: Restore archived wikitext page.','26836','Flow talk page manager',NULL,NULL)

Two different revisions to the same page ID with the same text ID. I guess that's expected because the second one is a null revision. Page ID 98405 is User talk:Roan Kattouw (WMF)/Archive 1 so that checks out, this is behaving as expected. But then, that text ID can't be found:

getMulti(WANCache:v:testwiki:revisiontext:textid:tt%3A375559)
SELECT  old_text,old_flags  FROM `text`    WHERE old_id = '375559'  LIMIT 1  
MediaWiki\Storage\SqlBlobStore::fetchBlob: No text row with ID 375559. [Called from MediaWiki\Storage\SqlBlobStore::fetchBlob in /srv/mediawiki/php-1.31.0-wmf.23/includes/Storage/SqlBlobStore.php at line 345]

Even though I just saw something being inserted:

INSERT  INTO `text` (old_id,old_text,old_flags) VALUES (NULL,'DB://cluster25/141901','utf-8,gzip,external')

And I've confirmed with additional logging (in SqlBlobStore::storeBlob()) that that really is the ID of the text row that was inserted.

Don't null revisions just reuse the same rev_text_id and insert no new blob? At least that's how it used to work.

Yes, that's right. It looks like what's happening here is that it's editing the page (inserting a "real" new revision), then inserting a null revision on top of that in the same request. It's also moving the page in the same request, so there's a lot of moving parts here.

I think I've found something suspicious:

  • RevisionStore::newNullRevision() uses a master connection to get the latest rev_text_id among other things
  • It then calls emulateMainSlot_1_29() to put that text ID into the main slot. However, it passes $queryFlags=0
  • emulateMainSlot_1_29() calls loadSlotContent(), which calls SqlBlobStore::getBlob(), which calls SqlBlobStore::fetchBlob() propagating $queryFlags along the way
  • Because $queryFlags=0, this results in a read from the replica

I'll test on mwdebug1001 to see if hacking in READ_LATEST (which SqlBLobStore will then map to READ_LATEST_IMMUTABLE) changes the behavior at all.

That worked! Patch inbound.

Change 416205 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] Fix master/replica race condition when creating null revisions

https://gerrit.wikimedia.org/r/416205

Change 416205 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] Fix master/replica race condition when creating null revisions
https://gerrit.wikimedia.org/r/416205

Note that to test this, you also need https://gerrit.wikimedia.org/r/#/c/416199/ , otherwise the beta feature does nothing

Change 416205 merged by jenkins-bot:
[mediawiki/core@master] Fix master/replica race condition when creating null revisions

https://gerrit.wikimedia.org/r/416205

Change 416417 had a related patch set uploaded (by Sbisson; owner: Catrope):
[mediawiki/core@wmf/1.31.0-wmf.23] Fix master/replica race condition when creating null revisions

https://gerrit.wikimedia.org/r/416417

Change 416417 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.23] Fix master/replica race condition when creating null revisions

https://gerrit.wikimedia.org/r/416417

Mentioned in SAL (#wikimedia-operations) [2018-03-05T14:36:40Z] <hashar@tin> Started scap: core + Flow, master/replicate race condition - T182358 T184670

Mentioned in SAL (#wikimedia-operations) [2018-03-05T14:41:05Z] <hashar@tin> Finished scap: core + Flow, master/replicate race condition - T182358 T184670 (duration: 04m 24s)

Etonkovidova added a comment.EditedMar 12 2018, 4:58 PM

Checked in betalabs and cawiki (wmf.24) - the beta feature "Structured Discussions on user talk" is present and works as expected (disable/enable works as expected).

The error is still present in testwiki (wmf.25) - https://test.wikipedia.org/wiki/User_talk:Zilant17

The Structured Discussions workflow is not associated with this page.

Return to Main Page.

[WqkAHwpAADgAADZn9pYAAABF] /wiki/User_talk:Zilant17 Flow\Exception\InvalidDataException from line 130 of /srv/mediawiki/php-1.31.0-wmf.25/extensions/Flow/includes/WorkflowLoaderFactory.php: Flow workflow is for different page

Backtrace:

#0 /srv/mediawiki/php-1.31.0-wmf.25/extensions/Flow/includes/WorkflowLoaderFactory.php(103): Flow\WorkflowLoaderFactory->loadWorkflowById(Title, Flow\Model\UUID)
#1 /srv/mediawiki/php-1.31.0-wmf.25/extensions/Flow/includes/Actions/Action.php(105): Flow\WorkflowLoaderFactory->createWorkflowLoader(Title)
#2 /srv/mediawiki/php-1.31.0-wmf.25/extensions/Flow/includes/Actions/ViewAction.php(20): Flow\Actions\FlowAction->showForAction(string, OutputPage)
#3 /srv/mediawiki/php-1.31.0-wmf.25/extensions/Flow/includes/Actions/Action.php(50): Flow\Actions\ViewAction->showForAction(string)
#4 /srv/mediawiki/php-1.31.0-wmf.25/includes/MediaWiki.php(500): Flow\Actions\FlowAction->show()
#5 /srv/mediawiki/php-1.31.0-wmf.25/includes/MediaWiki.php(294): MediaWiki->performAction(Article, Title)
#6 /srv/mediawiki/php-1.31.0-wmf.25/includes/MediaWiki.php(861): MediaWiki->performRequest()
#7 /srv/mediawiki/php-1.31.0-wmf.25/includes/MediaWiki.php(524): MediaWiki->main()
#8 /srv/mediawiki/php-1.31.0-wmf.25/index.php(42): MediaWiki->run()
#9 /srv/mediawiki/w/index.php(3): include(string)
#10 {main}

The error is still present in testwiki (wmf.25) - https://test.wikipedia.org/wiki/User_talk:Zilant17

If you broke it just now by opting out... it's bad. We'll have to go back and see why it isn't fixed.

If it has been broken for some time, the page has to be manually repaired by a script (fixInconsistentBoards.php). @Catrope has the power to run it in production.

I ran the script but it said it fixed nothing. https://test.wikipedia.org/wiki/User_talk:Zilant17 doesn't look broken to me, nor does https://test.wikipedia.org/wiki/User_talk:Zilant17/Structured_Discussions_Archive_1 . @Etonkovidova what happens if you opt in and back out? Does that break anything?

Ha, looks like this is a temporary state: if you visit your talk page too quickly after opting out, you get that error, but then if you refresh it's no longer broken.

It seems to be ok now - both in testwiki and cawiki (wmf.25)

Etonkovidova closed this task as Resolved.Mar 15 2018, 7:39 AM
238482n375 removed SBisson as the assignee of this task.Jun 15 2018, 8:03 AM
238482n375 lowered the priority of this task from High to Lowest.
238482n375 moved this task from Next Up to In Code Review on the Analytics-Kanban board.
238482n375 edited subscribers, added: 238482n375; removed: Aklapper.

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDIwMTgtMDYtMTQtMTEpCk9wZW4sIE5lZWRzIFRyaWFnZVB1YmxpYwoKICAgIEVkaXQgVGFzawogICAgRWRpdCBSZWxhdGVkIFRhc2tzLi4uCiAgICBFZGl0IFJlbGF0ZWQgT2JqZWN0cy4uLgogICAgUHJvdGVjdCBhcyBzZWN1cml0eSBpc3N1ZQoKICAgIE11dGUgTm90aWZpY2F0aW9ucwogICAgQXdhcmQgVG9rZW4KICAgIEZsYWcgRm9yIExhdGVyCgpUYWdzCgogICAgQW5hbHl0aWNzLUthbmJhbiAoSW4gUHJvZ3Jlc3MpCgpTdWJzY3JpYmVycwpBa2xhcHBlciwgSkFsbGVtYW5kb3UKQXNzaWduZWQgVG8KSkFsbGVtYW5kb3UKQXV0aG9yZWQgQnkKSkFsbGVtYW5kb3UsIEZyaSwgSnVuIDE1CkRlc2NyaXB0aW9uCgpPb3ppZSBqb2JzIGhhdmUgYmVlbiBmYWlsaW5nIGF0IGxlYXN0IGEgZmV3IHRpbWVzIGVhY2guIE1vcmUgaW52ZXN0aWdhdGlvbiBuZWVkZWQuCkpBbGxlbWFuZG91IGNyZWF0ZWQgdGhpcyB0YXNrLkZyaSwgSnVuIDE1LCA3OjIxIEFNCkhlcmFsZCBhZGRlZCBhIHN1YnNjcmliZXI6IEFrbGFwcGVyLiC3IFZpZXcgSGVyYWxkIFRyYW5zY3JpcHRGcmksIEp1biAxNSwgNzoyMSBBTQpKQWxsZW1hbmRvdSBjbGFpbWVkIHRoaXMgdGFzay5GcmksIEp1biAxNSwgNzoyMiBBTQpKQWxsZW1hbmRvdSB1cGRhdGVkIHRoZSB0YXNrIGRlc2NyaXB0aW9uLiAoU2hvdyBEZXRhaWxzKQpKQWxsZW1hbmRvdSBhZGRlZCBhIHByb2plY3Q6IEFuYWx5dGljcy1LYW5iYW4uCkpBbGxlbWFuZG91IG1vdmVkIHRoaXMgdGFzayBmcm9tIE5leHQgVXAgdG8gSW4gUHJvZ3Jlc3Mgb24gdGhlIEFuYWx5dGljcy1LYW5iYW4gYm9hcmQuCkNoYW5nZSBTdWJzY3JpYmVycwpDaGFuZ2UgUHJpb3JpdHkKQXNzaWduIC8gQ2xhaW0KTW92ZSBvbiBXb3JrYm9hcmQKQ2hhbmdlIFByb2plY3QgVGFncwpBbmFseXRpY3MtS2FuYmFuCtcKU2VjdXJpdHkK1wpXaWtpbWVkaWEtVkUtQ2FtcGFpZ25zIChTMi0yMDE4KQrXClNjYXAK1wpTY2FwIChTY2FwMy1BZG9wdGlvbi1QaGFzZTIpCtcKQWJ1c2VGaWx0ZXIK1wpEYXRhLXJlbGVhc2UK1wpIYXNodGFncwrXCkxhYnNEQi1BdWRpdG9yCtcKTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2kK1wpMYW5ndWFnZS0yMDE4LUFwci1KdW5lCtcKTGFuZ3VhZ2UtMjAxOC1KYW4tTWFyCtcKSEhWTQrXCkhBV2VsY29tZQrXCkJvbGQKSXRhbGljcwpNb25vc3BhY2VkCkxpbmsKQnVsbGV0ZWQgTGlzdApOdW1iZXJlZCBMaXN0CkNvZGUgQmxvY2sKUXVvdGUKVGFibGUKVXBsb2FkIEZpbGUKTWVtZQpQcmV2aWV3CkhlbHAKRnVsbHNjcmVlbiBNb2RlClBpbiBGb3JtIE9uIFNjcmVlbgoyMzg0ODJuMzc1IGFkZGVkIHByb2plY3RzOiBTZWN1cml0eSwgV2lraW1lZGlhLVZFLUNhbXBhaWducyAoUzItMjAxOCksIFNjYXAgKFNjYXAzLUFkb3B0aW9uLVBoYXNlMiksIEFidXNlRmlsdGVyLCBEYXRhLXJlbGVhc2UsIEhhc2h0YWdzLCBMYWJzREItQXVkaXRvciwgTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2ksIExhbmd1YWdlLTIwMTgtQXByLUp1bmUsIExhbmd1YWdlLTIwMTgtSmFuLU1hciwgSEhWTSwgSEFXZWxjb21lLlBSRVZJRVcKMjM4NDgybjM3NSBtb3ZlZCB0aGlzIHRhc2sgZnJvbSBJbiBQcm9ncmVzcyB0byBJbiBDb2RlIFJldmlldyBvbiB0aGUgQW5hbHl0aWNzLUthbmJhbiBib2FyZC4KMjM4NDgybjM3NSByZW1vdmVkIEpBbGxlbWFuZG91IGFzIHRoZSBhc3NpZ25lZSBvZiB0aGlzIHRhc2suCjIzODQ4Mm4zNzUgdHJpYWdlZCB0aGlzIHRhc2sgYXMgTG93ZXN0IHByaW9yaXR5LgoyMzg0ODJuMzc1IHJlbW92ZWQgc3Vic2NyaWJlcnM6IEFrbGFwcGVyLCBKQWxsZW1hbmRvdS4KQ29udGVudCBsaWNlbnNlZCB1bmRlciBDcmVhdGl2ZSBDb21tb25zIEF0dHJpYnV0aW9uLVNoYXJlQWxpa2UgMy4wIChDQy1CWS1TQSkgdW5sZXNzIG90aGVyd2lzZSBub3RlZDsgY29kZSBsaWNlbnNlZCB1bmRlciBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSAoR1BMKSBvciBvdGhlciBvcGVuIHNvdXJjZSBsaWNlbnNlcy4gQnkgdXNpbmcgdGhpcyBzaXRlLCB5b3UgYWdyZWUgdG8gdGhlIFRlcm1zIG9mIFVzZSwgUHJpdmFjeSBQb2xpY3ksIGFuZCBDb2RlIG9mIENvbmR1Y3QuILcgV2lraW1lZGlhIEZvdW5kYXRpb24gtyBQcml2YWN5IFBvbGljeSC3IENvZGUgb2YgQ29uZHVjdCC3IFRlcm1zIG9mIFVzZSC3IERpc2NsYWltZXIgtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu

Aklapper assigned this task to SBisson.Jun 15 2018, 1:06 PM