Page MenuHomePhabricator

Pageimage property (and possibly other page properties) not updated reliably after reverts
Open, MediumPublic


After a revert on en:Barack_Obama, the pageimage property as returned by the action API disappeared. A subsequent minor (not pageimage related) edit brought it back within seconds, and also quickly updated the derived page summary response in the REST API.

Given the quick recovery, it seems unlikely that the pageimage property update was delayed for over an hour by job queue backlogs. The more likely explanation seems to be that the pageimage update was completely missed.

I have not yet tried to reproduce this issue by reverting a page multiple times. It might or might not be specific to reverts. It could also be caused by a random job being lost in the job queue.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 22 2017, 8:55 PM
GWicke updated the task description. (Show Details)Sep 22 2017, 9:13 PM
GWicke added a subscriber: Tgr.Sep 22 2017, 9:16 PM

@Tgr added this on a related mail thread:

The API just does a straight DB lookup for a single row, hard to imagine anything going wrong with that (unless an API query errored out completely and MCS somehow interpreted it as an empty response, which as far as I see canont happen).
PageImages itself is not very robust though - it stored the data in a non-revisioned table which opens it up to all kinds of race conditions. Plus it runs on LinksUpdate so it can lag or maybe get killed by the DB transaction timeout or something similar.
Specifically in this case there was a vandalism two hours ago, which was reverted almost immediately. Probably the link update for the revert failed failed somehow.

Tgr added a comment.Sep 22 2017, 9:46 PM

As far as I can tell the job code should not be involved here. Editing invokes WikiPage::doModify which calls WikiPage::doEditUpdates (deferred until after the edit transaction is committed and the HTTP response is closed, but still in the same request) which calls AbstractContent::getSecondaryDataUpdates which (absent some unusually hacky hook magic) returns a single LinksUpdate, which will again be deferred into a separate transaction but still the same request, then it will be called via its doUpdate method and that calls the LinksUpdate hook which PageImages implements. So PageImages will run at the end of the edit request, in a separate transaction which it shares with all other LinksUpdate hook handlers and with the updates for all the link tables (for the edited page only; updates for transcluding pages are scheduled as jobs).

It's possible for PageImages to be invoked by the job queue, when someone edits a template and the page image on a transcluding page needs to be recalculated, but that couldn't possibly result in no page image for something like Barack Obama, which has plenty of non-transcluded images.

So my guess is some kind of DB error caused by another link update task, taking the changes of PagesImages with it. (There are plenty of DB timeout and deadlock errors in fatalmonitor, will check if any of those happen in the link update.) Since it's not called from the job queue, there is no retry on failure.

An easy improvement to make is to store the revision used for selecting the page image in page_props. That way clients could at least detect that they are given potentially wrong information, and maybe even trigger some kind of refresh.

Also, it does not seem like there is much benefit in the LinksUpdate hook handlers sharing a transaction - they all depend on the page/revision/text tables for consistency, but not on each other.

Krinkle added a subscriber: aaron.

Also, it does not seem like there is much benefit in the LinksUpdate hook handlers sharing a transaction - they all depend on the page/revision/text tables for consistency, but not on each other.

I suspect the main motivation for the shared transaction is consistency. This way, both during the save (race condition) as well as afterwards, the page props for a given page ID will always be of the same revision.

Krinkle moved this task from Untriaged to Usage problem on the Wikimedia-Rdbms board.
Krinkle removed a subscriber: Krinkle.
Tgr added a comment.Nov 13 2017, 9:39 AM

Taking patrol status into account when determining cache expiry could be one way to reduce the impact of this. See T167400 for a similar proposal for images.

Jdlrobson triaged this task as Medium priority.Jun 18 2020, 5:15 PM