Page MenuHomePhabricator

Pageimage property (and possibly other page properties) not updated reliably after reverts
Open, MediumPublic

Description

After a revert on en:Barack_Obama, the pageimage property as returned by the action API disappeared. A subsequent minor (not pageimage related) edit brought it back within seconds, and also quickly updated the derived page summary response in the REST API.

Given the quick recovery, it seems unlikely that the pageimage property update was delayed for over an hour by job queue backlogs. The more likely explanation seems to be that the pageimage update was completely missed.

I have not yet tried to reproduce this issue by reverting a page multiple times. It might or might not be specific to reverts. It could also be caused by a random job being lost in the job queue.


Related incident: https://wikitech.wikimedia.org/wiki/Incident_documentation/2017-09-22_REST-API-alert-pageimage-update

Event Timeline

@Tgr added this on a related mail thread:

The API just does a straight DB lookup for a single row, hard to imagine anything going wrong with that (unless an API query errored out completely and MCS somehow interpreted it as an empty response, which as far as I see canont happen).
PageImages itself is not very robust though - it stored the data in a non-revisioned table which opens it up to all kinds of race conditions. Plus it runs on LinksUpdate so it can lag or maybe get killed by the DB transaction timeout or something similar.
Specifically in this case there was a vandalism two hours ago, which was reverted almost immediately. Probably the link update for the revert failed failed somehow.

As far as I can tell the job code should not be involved here. Editing invokes WikiPage::doModify which calls WikiPage::doEditUpdates (deferred until after the edit transaction is committed and the HTTP response is closed, but still in the same request) which calls AbstractContent::getSecondaryDataUpdates which (absent some unusually hacky hook magic) returns a single LinksUpdate, which will again be deferred into a separate transaction but still the same request, then it will be called via its doUpdate method and that calls the LinksUpdate hook which PageImages implements. So PageImages will run at the end of the edit request, in a separate transaction which it shares with all other LinksUpdate hook handlers and with the updates for all the link tables (for the edited page only; updates for transcluding pages are scheduled as jobs).

It's possible for PageImages to be invoked by the job queue, when someone edits a template and the page image on a transcluding page needs to be recalculated, but that couldn't possibly result in no page image for something like Barack Obama, which has plenty of non-transcluded images.

So my guess is some kind of DB error caused by another link update task, taking the changes of PagesImages with it. (There are plenty of DB timeout and deadlock errors in fatalmonitor, will check if any of those happen in the link update.) Since it's not called from the job queue, there is no retry on failure.

An easy improvement to make is to store the revision used for selecting the page image in page_props. That way clients could at least detect that they are given potentially wrong information, and maybe even trigger some kind of refresh.

Also, it does not seem like there is much benefit in the LinksUpdate hook handlers sharing a transaction - they all depend on the page/revision/text tables for consistency, but not on each other.

Also, it does not seem like there is much benefit in the LinksUpdate hook handlers sharing a transaction - they all depend on the page/revision/text tables for consistency, but not on each other.

I suspect the main motivation for the shared transaction is consistency. This way, both during the save (race condition) as well as afterwards, the page props for a given page ID will always be of the same revision.

Krinkle moved this task from Untriaged to Usage problem on the MediaWiki-libs-Rdbms board.
Krinkle removed a subscriber: Krinkle.

Taking patrol status into account when determining cache expiry could be one way to reduce the impact of this. See T167400 for a similar proposal for images.

Jdlrobson triaged this task as Medium priority.Jun 18 2020, 5:15 PM

PageImages loads the text of the page from the database during LinksUpdate. Presumably this bug would be fixed if it stopped doing that, as I proposed in T296895.

Change 743061 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Add a ParserModifyImageHTML hook for PageImages

https://gerrit.wikimedia.org/r/743061

Change 743062 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/PageImages@master] Identify lead images using a new parser hook instead of during LinksUpdate

https://gerrit.wikimedia.org/r/743062

Change 743061 merged by jenkins-bot:

[mediawiki/core@master] Add a ParserModifyImageHTML hook for PageImages

https://gerrit.wikimedia.org/r/743061

Change 743062 merged by jenkins-bot:

[mediawiki/extensions/PageImages@master] Identify lead images using a new parser hook instead of during LinksUpdate

https://gerrit.wikimedia.org/r/743062