Page MenuHomePhabricator

Target deletion during page move fails
Closed, ResolvedPublicPRODUCTION ERROR

Assigned To
Authored By
hashar
Nov 29 2018, 3:25 PM
Referenced Files
None
Tokens
"Pirate Logo" token, awarded by Raphael."Barnstar" token, awarded by Huji."The World Burns" token, awarded by Ladsgroup."The World Burns" token, awarded by Dalba."The World Burns" token, awarded by Binabik."The World Burns" token, awarded by Framawiki."The World Burns" token, awarded by OldUser02."The World Burns" token, awarded by Thibaut120094."Burninate" token, awarded by Xaosflux.

Description

Failed to delete page-move revision: <OK, collected 1 error(s) on the way, no value set>

Error

Request ID: W--3DgpAMFAAAIRg54sAAAAD

message
Failed to delete page-move revision: <OK, collected 1 error(s) on the way, no value set>
+------+---------------------------+------------------------------------------+
|    1 | cannotdelete              | Kosice                                   |
+------+---------------------------+------------------------------------------+
trace
#0 /srv/mediawiki/php-1.33.0-wmf.6/includes/MovePage.php(276): MovePage->moveToInternal(User, Title, string, boolean, array)
#1 /srv/mediawiki/php-1.33.0-wmf.6/includes/specials/SpecialMovepage.php(608): MovePage->move(User, string, boolean)
#2 /srv/mediawiki/php-1.33.0-wmf.6/includes/specials/SpecialMovepage.php(128): MovePageForm->doSubmit()
#3 /srv/mediawiki/php-1.33.0-wmf.6/includes/specialpage/SpecialPage.php(569): MovePageForm->execute(NULL)
#4 /srv/mediawiki/php-1.33.0-wmf.6/includes/specialpage/SpecialPageFactory.php(558): SpecialPage->run(NULL)
#5 /srv/mediawiki/php-1.33.0-wmf.6/includes/MediaWiki.php(288): MediaWiki\Special\SpecialPageFactory->executePath(Title, RequestContext)
#6 /srv/mediawiki/php-1.33.0-wmf.6/includes/MediaWiki.php(862): MediaWiki->performRequest()
#7 /srv/mediawiki/php-1.33.0-wmf.6/includes/MediaWiki.php(517): MediaWiki->main()
#8 /srv/mediawiki/php-1.33.0-wmf.6/index.php(42): MediaWiki->run()
#9 /srv/mediawiki/w/index.php(3): include(string)
#10 {main}

Impact

Notes

Found 9 occurences when deploying 1.33.0-wmf.6 to all wikis.

Happens at least on page renaming.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Ukwiki is affected too, so I would support the mention in the Tech news for other wikis.

The whole MediaWiki sofware is affected... Unfortunately T/N/04 is already translated and we'll probably have to announce it in /05 (which will be published on Jan 28th). CC'ing @Johan to be sure. The message could be:

Because of a known bug, moving a page over a target which already exists (thus overwriting it) can sometimes fail with an error message. We're actively working on understanding and trying to resolve the issue. You can check the state of the problem [[:phab:T210739|on Phabricator]].

I would expand it as:

Because of a known bug, moving a page over a target which already exists (thus overwriting it) can sometimes fail with an error message. We're actively working on understanding and trying to resolve the issue. You can check the state of the problem [[:phab:T210739|on Phabricator]]. In the meantime, sysops can work around this issue, whenever it occurs, by deleting the target page first, and then proceeding with the move as usual.

I create a gadget on zhwiki to bypass the problem. It add a button in Special:MovePage to delete target page (via API). If you want, you can copy it to your wiki. If you have questions about this gadget, leave a message on my talk page.
Definition: * T210739[ResourceLoader|targets=desktop,mobile|dependencies=oojs-ui,mediawiki.api,mediawiki.notify|rights=delete|default]|T210739.js
Source code: https://zh.wikipedia.org/wiki/MediaWiki:Gadget-T210739.js

After 16 duplicate bugs I'm thinking that maybe we should insert some sort of message in the interface to inform people that we are aware of the issue and point them to this bug.

@MarcoAurelio This will be announced in Tech/News... On Jan 28th... The announcement also suggests to add a message in MediaWiki:Delete and move confirm. I see that Johan has also added the gadget above as a suggestion. However, I must note that the gadget isn't really localizable, and chinese writings are probably fine only for zhwiki. We could ask intadmins to manually localize it (but this would create code duplication), or make it retrieve messages from a customizable variable. Or, probably easier & quicker, just change the messages to be in english.

Change 485822 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@wmf/1.33.0-wmf.13] Add temporary logging for T210739

https://gerrit.wikimedia.org/r/485822

Change 485822 merged by jenkins-bot:
[mediawiki/core@wmf/1.33.0-wmf.13] Add temporary logging for T210739

https://gerrit.wikimedia.org/r/485822

Mentioned in SAL (#wikimedia-operations) [2019-01-22T14:56:14Z] <anomie@deploy1001> Synchronized php-1.33.0-wmf.13/includes/page/WikiPage.php: Add more temporary logging for T210739 (duration: 00m 47s)

So far with the new logging, all the examples indicate that the page row really is deleted on master while the Title loaded earlier from a replica isn't deleted. I suppose the next thing to look at is whether deletions are getting rolled back somehow, versus some unknown cause for that.

Also, I noticed an interesting correlation that may or may not be related. As mentioned earlier the error mostly stopped on December 3, then came back around January 10. I took a closer look at that and noticed the following:

That suggests that something in that Title::checkUserBlock() change is somehow causing this, although I don't see any obvious way that would happen. And it's not impossible that could just be coincidence.

@Anomie Thanks for finding that correlation. Our team, Anti-Harassment, is looking into this right now.

In T210739#4896481, @A2093064 wrote:

I create a gadget on zhwiki to bypass the problem. It add a button in Special:MovePage to delete target page (via API). If you want, you can copy it to your wiki. If you have questions about this gadget, leave a message on my talk page.
Definition: * T210739[ResourceLoader|targets=desktop,mobile|dependencies=oojs-ui,mediawiki.api,mediawiki.notify|rights=delete|default]|T210739.js
Source code: https://zh.wikipedia.org/wiki/MediaWiki:Gadget-T210739.js

I appreciate your work on this, and your desire to share. However, I think it is best not to make it too easy for people to delete the target page. This error happens rarely, and using the gadget may make it happen even more rarely, making the analysis of the issue more difficult. The task is getting a lot of attention now, so I am hopeful we can find a fix soon.

This error happens rarely

I appreciate the gadget. I have been getting the error every time I try to move over existing pages, so I mostly gave up trying and just delete them first.

dbarratt subscribed.

all the examples indicate that the page row really is deleted on master while the Title loaded earlier from a replica isn't deleted.

I don't understand this ... how is the page deleted already if it's in the middle of the delete operation?

Also, I'm wondering if this is a race condition? We did increase the time it takes to perform the permission checks slightly by calling User::isBlockedFrom() on every action, but loading the block (and the restrictions) is cached in memory (during execution), so this query would only execute once during the execution. But perhaps this is the straw that broke the camel's back so to speak?

I've done some investigation and I cannot find any of the code in the "revert" patch in the code path where the exception is thrown. The dates are quite compelling.

That said, I looked at @Anomie's logging patch and it seems to be indicating that the $id in WikiPage::doDeleteArticleBatched is equal to 0.

In that case, I looked at the call that populates the $id which is to loadPageData.

That function eventually calls loadFromRow. I see a large patch which does a lot of work around managing some DB connections between master and replica. That patch is here: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/9cc425ff958bbdd35420de114e30867b01b22554%5E%21/#F0

If I apply Occam's Razor and my intuition that this is a replication-type issue, I would take a close look at this patch.

I recognize this doesn't answer the clarion call of the coincidence with the dates.

all the examples indicate that the page row really is deleted on master while the Title loaded earlier from a replica isn't deleted.

I don't understand this ... how is the page deleted already if it's in the middle of the delete operation?

That's a very good question, if I knew that then I'd probably know how to fix this. As it is, I'm using https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/484516 to propose additional temporary logging to try to track it down.

Also, I'm wondering if this is a race condition?

It's possible. I don't think it's a straight timing thing though.

That function eventually calls loadFromRow. I see a large patch which does a lot of work around managing some DB connections between master and replica. That patch is here: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/9cc425ff958bbdd35420de114e30867b01b22554%5E%21/#F0

If I apply Occam's Razor and my intuition that this is a replication-type issue, I would take a close look at this patch.

On the other hand, that patch wasn't in 1.33.0-wmf.6 while the error did occur in that version (until rMW505845ec917a: Revert "Title::checkUserBlock should call User::isBlockedFrom for every action" was deployed).

On the other hand, that patch wasn't in 1.33.0-wmf.6 while the error did occur in that version (until rMW505845ec917a: Revert "Title::checkUserBlock should call User::isBlockedFrom for every action" was deployed).

Yeah, that's a problem for sure. Maybe there was an earlier version of that code that was around at the time and exhibited a similar issue but which is lost/overwritten by the current code?

I don't see any cannotdelete errors pre 01-10. Did those somehow not make it into Logstash?

Looks like everything in logstash before 2018-12-23 00:00:00 is missing (30-day expiry maybe?). The data is there on mwlog1001, that has a 90-day expiry.

Could we revert the beta cluster to the versions @Anomie pointed out, test the issue, and then move the beta cluster forward one revision at a time until the issue appears (or disappears)? After all, the beta cluster has the closest setup to the WMF production.

That's a good idea @Huji, if it's not too much work.

If we do that and we can't reproduce with any version, that would be a good indicator that this is a replication delay issue or something along those lines. If we can reproduce, it would help narrow down the breaking change.

@Huji is anyone able to reproduce it consistently in beta in it's current state? Or is it purely a production-only issue?

Change 486122 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@wmf/1.33.0-wmf.13] Add temporary logging for T210739

https://gerrit.wikimedia.org/r/486122

Change 486122 merged by Anomie:
[mediawiki/core@wmf/1.33.0-wmf.13] Add temporary logging for T210739

https://gerrit.wikimedia.org/r/486122

Mentioned in SAL (#wikimedia-operations) [2019-01-23T18:35:01Z] <anomie@deploy1001> Synchronized php-1.33.0-wmf.13/includes/page/WikiPage.php: Add even more temporary logging for T210739 (duration: 00m 54s)

Change 484516 abandoned by Anomie:
Add temporary logging for T210739

Reason:
Done with this.

https://gerrit.wikimedia.org/r/484516

Ok. I've finally managed to reproduce it on enwiki, and I've figured out what's going on.

The page is being deleted as expected by SpecialMovepage line 576. That updates the Title object to indicate that the page doesn't exist.

After the deletion, SpecialMovepage goes on to create a MovePage, call the permissions checks, and then call the move() method to perform the move. At some point in there, the Title object is getting reloaded from a replica (where the page isn't deleted yet) so by the time it gets to the move() call it incorrectly thinks the title now exists.

Unsurprisingly based on the timing coincidence identified in T210739#4899646, that reset happens when MovePage->checkPermissions() calls $this->newTitle->getUserPermissionsErrors( 'edit', $user ) and eventually gets into Title->checkUserBlock(). Specifically, at line 2712, it creates a WikiPage wrapping the Title and passes it to Action::factory(), which at some point triggers WikiPage->loadPageData() which loads the page data from a replica and then (in WikiPage->loadFromRow()) reloads the Title object from that loaded data.

Change 486157 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Title: Load page from master in checkUserBlock() when appropriate

https://gerrit.wikimedia.org/r/486157

which at some point triggers WikiPage->loadPageData() which loads the page data from a replica and then (in WikiPage->loadFromRow()) reloads the Title object from that loaded data.

Where? I'm not seeing where this would happen.

@Anomie Also, even if this is being loaded at that point, I'm not seeing where it would be statically cached?

which at some point triggers WikiPage->loadPageData() which loads the page data from a replica and then (in WikiPage->loadFromRow()) reloads the Title object from that loaded data.

Where? I'm not seeing where this would happen.

Nevermind, I found it... Action::factory() calls WikiPage::getActionOverrides() which calls WikPage::getContentModel() which calls WikiPage::exists() which calls WikiPage::loadFromRow()

@Anomie ok, my only other question is... where is this being cached that it's being loaded again?

@Anomie ok, my only other question is... where is this being cached that it's being loaded again?

AHHHHH it's Title::loadFromRow()

Change 486178 had a related patch set uploaded (by Dbarratt; owner: Dbarratt):
[mediawiki/core@master] Clone the Title object to prevent mutations.

https://gerrit.wikimedia.org/r/486178

Do you have a rough idea when the fix might be in production?

(Asking for Tech News purposes.)

First one of the two patches has to be reviewed and merged. Then the fix would be backported to wmf.14. That could happen within the next hour, or it could take two weeks (since most WMF staff will be at All Hands). Or anything in between, or beyond.

Change 486178 merged by jenkins-bot:
[mediawiki/core@master] Clone the Title object to prevent mutations.

https://gerrit.wikimedia.org/r/486178

Change 486369 had a related patch set uploaded (by Gergő Tisza; owner: Dbarratt):
[mediawiki/core@wmf/1.33.0-wmf.14] Clone the Title object to prevent mutations.

https://gerrit.wikimedia.org/r/486369

Change 486157 abandoned by Anomie:
Title: Load page from master in checkUserBlock() when appropriate

Reason:
Gergő decided to merge I9ad973e9 instead.

https://gerrit.wikimedia.org/r/486157

Change 486369 merged by jenkins-bot:
[mediawiki/core@wmf/1.33.0-wmf.14] Clone the Title object to prevent mutations.

https://gerrit.wikimedia.org/r/486369

Mentioned in SAL (#wikimedia-operations) [2019-01-25T00:29:24Z] <ebernhardson@deploy1001> Synchronized php-1.33.0-wmf.14/includes/Title.php: SWAT T210739 gerrit:486369: Clone the Title object to prevent mutation (duration: 00m 47s)

Anomie claimed this task.

@Johan and everyone else: The fix should be deployed to all Wikimedia wikis as of T210739#4907814.

We need a barnstar for solving the most complicated bugs in MediaWiki. @Anomie would surely be the first recipient! Thank you!

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM