Page MenuHomePhabricator

Pages that were edited very recently may show the old revision content, but with the new revision ID in permalinks/wgRevisionId etc.
Closed, ResolvedPublic

Description

(@ROdonnell-WMF has let me know that the folks who complained about T339164 are still having trouble. Unfortunately they did not provide reproduction steps, so I am guessing what their problem is. I found a bug, can't tell if this is the one. I'm not planning to work on this further.)

Pages that were edited very recently may show the old revision content, but with the new revision ID in permalinks/wgRevisionId etc.

This only occurs for a few seconds after an edit, and possibly only while viewing the page while logged out, so you need some script to reproduce this that watches recent changes and scrapes the article HTML. Here's such a script I wrote:

After running this for a few seconds or minutes, the script will print "Multiple revision IDs found!" and dump the HTML of the affected page to a file. Here's an example:

All of the "metadata" including oldid values in permalinks and mw.config variables like wgRevisionId agree that the revision of the page being shown (and the latest revision) is 1163264388, but the content shown (and the "Saved in parser cache…" comment) is actually of revision 1163061819.

Event Timeline

We can see <!-- parser cache is expired, sending anyway due to pool contention--> in the head of the mw-body-content element, and this piece of HTML comment is from Article.php#852

Can this ticket be moved to a higher priority? I've set it to high, so it gets noticed in the backlog planning, hope this is OK

A large Enterprise client is asking for this fix because they rely on the the parsoid odid and wgCurRevisionId to be consistent with the requested revisionId in the URL. In particular they are looking at the printfooter section of the page.

There was a previous attempt to fix the issue, but it didn't resolve the intermitant problem with inconsistent revisioin id

Tagging @daniel, as he worked on refactoring ParserOutputAccess and ParserCache in recent months. A signifcant change from this, which I still doubt makes sense or was intentional (as per T329842#8816557) is that now basically all ParserOutput access goes through the PoolWorkArticleViewCurrent class.

That class was designed for use with page views and page views only, and enforces PoolCounter restrictions, and provides silent stale-revision fallbacks. These now apply even to contexts where these are not needed and likely not allowed within the caller's contract (e.g. the caller might not be shortening and self-correcting its caches the way we do for page views/CDN; this isn't part of ParserOutputAccess contract).

Afaik caching for old revisions was added by @tstarling also fairly recently (maybe 2 years ago?), however I'm not sure that's implicated because that caching takes place after the code the code has already decided what revision to show. The general stale-revision concept in PoolWorkArticleViewCurrent goes back ~10 years to the Michael Jackson effect.

The fact that logged out users (and other low-priority accesses) can possibly see "old" revisions is a deliberate decision for performance reasons. There is a flag to insist that a request go to the primary DB (not a secondary read replica) which we use in some very specific places to ensure that we don't run into this problem in editing workflows, but in general APIs are allowed to return stale revisions to protect against the Michael Jackson effect, as @Krinkle notes.

Parsoid also used to have some edge cases where a page would have been saved to the master DB but not yet present in RestBASE, but the script provided here is hitting the action API, not one of the REST APIs -- which also probably means that @daniel's refactoring of ParsoidOutputAccess is not to blame.

What's happening is that the second query is "page views and page views only" as @Krinkle put it, and so it deliberately allowed to fall back to a secondary DB and/or a slightly-stale cache; again, because if we insisted that all page views hit the primary database for reads then it makes the entire DB cluster moot. This is the system operating "as designed".

If our "large enterprise client" wants a special API for a page view which is guaranteed to hit the master DB and bypass the stale-revision fallbacks, I'm sure that could be provided -- but it's not something that we could expose in a public API without DoS'ing ourselves.

Similarly, nerfing the recent changes API to only report a change when *all* of the secondary DBs have had the revision propagated to them adds a very expensive step that loads *all* of the cluster DBs.

So I don't see an easy answer here, other than documenting that our APIs are intended to be "eventually consistent" and that not all servers in the cluster will necessarily have all the revisions at the exact same time. What we *can* guarantee is that a new revision which *any* server reports *will* eventually show up on all servers. So the correct solution here is just to back off a few seconds and retry -- or, if this is a periodic poll, ignore the new revision until the next polling time, at which point the new revision will almost certainly have propagated to all the secondary DBs.

(To put this in another frame, multiple independent requests to our APIs are not transactional; that is, there's no guarantee that independent requests always see exactly the same state of the world, especially if you're following a real time event stream like recent changes. For edits, the read-modify-write nature of an edit does require transactional consistency, so there are various mechanisms used to ensure it in that specific case. But that's the exception rather than the rule -- and write ("edits") are also the exception rather than the rule ("reads") which is why we can afford to spend extra resources on that path.)

Thank you @cscott and @Krinkle.

The client asked if we could present a list of HTML sections that could be out-of-sync with the revision they requested in the URL. Their main worry is that more important sections of HTML could be eventually-consistent.

Is it possible to list which parts of the HTML are coming from the master vs secondary database instances?

@cscott I think you may have misunderstood the report. Or rather, I think the report describes two facts, and we're focussing on one but not the other.

The report describes first and foremost that in the seconds after a page view, we sometimes serve HTML responses in which we simultanously claim to have served revision ID N+1 (which we will have queried from a cache or replica), whilst the content on that pageview is actually from revision ID N-1 (which we also queried from a cache or replica). This is to me a hard bug that we need to fix. This does not require serving something newer than we do today. It only requires for us to know what we serve and send the correct metadata for it.

The report then also describes that this self-corrects after a few seconds or minutes.

I choose to believe that the Enterprise customer in question already has logic in place that checks wgRevisionId in the pageview HTML, and if they know about a newer revision for this article, they'll simply try again in a few seconds. The problem is that we sometimes respond with a pageview that has wgRevisionId set to a newer revision ID than the revision that the body content is actually for.

My intepretation is thus that we are being asked to make it easy to detect when the new content is available, and when it is not yet available. Today this is virtually impossible, because we mix two sources of data in a single response, which makes it very difficult to detect indeed. The customer only realized because they compared the HTML by hand to that of the stated revision ID within that same stale HTML response.

Generally, I think the above is indeed a bug. Generally, we design MediaWiki such that requests share a single replica snapshot for their data, and we also often design cache interactions to use cache keys that are verifable/deterministic to allow for natural top-down propagation once a request knows the data exists.

The exception is when PoolCounter reports excessively high workers parsing the same article, in which it will serve stale content, older than even the revision ID reported by the replica DB. I would expect that to also carry some kind of error message, HTTP header, or mw.config key to indicate that this has happened. Plus, I expect that to be rare enough that you can't trigger it by simply crawling recentchanges for 5 minutes. This too is imho a reasonable behaviour that I don't think the customer would mind, as long as they can detect it.

Suggested outcomes:

  • Generally speaking, content matches the wgRevisionId, even in the seconds after an edit. It's okay if both are for the (same) non-current revision.
  • In the rare cases where they differ due to PoolCounter stale protection kicking in, we probably can't reasonably reverse-engineer the old metadata, but we can and should indicate in some other way that the content is old.
  • PoolCounter stale protection should not be kicking in under normal workloads. If we see this even once every 5 min while drawling RC, there is probably something wrong? Ie. something being overprotective or the worker count being overreported?
@cscott wrote:

[…] the script provided here is hitting the action API, not one of the REST APIs -- which also probably means that @daniel's refactoring of ParsoidOutputAccess is not to blame.

I was not referring to "Parsoid"OutputAccess, but "ParserOutput"Access. This service used to check ParserCache, and fallback to invoking Parser, and (only if coming from PoolWorkArticleViewCurrent on page views) save result to ParserCache. Nowdays, ParserOutputAccess goes through PoolWorkArticleViewCurrent for everything.

That includes CLI scripts, Jobs, and Action API requests.

  • PoolCounter stale protection should not be kicking in under normal workloads. […] or the worker count being overreported?

"worker count being overreported" seems rather likely if all JobQueue and Action API demand is now counted against PoolCounter protection. And as stated in T329842#8816557, it means that various workloads that (unlike pageviews) have no means of dealing with stale content are now sometimes getting stale content with no way to detect or recover.

I'll add one observation from the customer:

In some scenarios, after they see a newer revisionid in the page footer, it can take over 1 hour for the newer page revision to be seen in the RecentChanges API. Sometimes it's faster, within a minute or two. If it's a replication problem, it's strange that it can take one hour for the updates to propagate to the replica databases.

Any reason why the RecentChanges feed can get delayed by up to an hour?

I'll add one observation from the customer:

In some scenarios, after they see a newer revisionid in the page footer, it can take over 1 hour for the newer page revision to be seen in the RecentChanges API. Sometimes it's faster, within a minute or two. If it's a replication problem, it's strange that it can take one hour for the updates to propagate to the replica databases.

Any reason why the RecentChanges feed can get delayed by up to an hour?

Are you sure the problem reported is a delay in an edit appearing in the RC feed? I thought the feed is fine, but that the problem is that the customer is experiencing a delay in finding the content for an edit. E.g. the page view request and the response HTML, is containing stale content from a previous edit, and we're waiting for the content to propagate.

If we're seeing the opposite, where the page view is already propagated, but the feed is lagged behind in announcing the edit, then that would be a very different kind of problem. That kind of problem also involve a number of mistakes on the consumer side, and lots of intermediary layers such as closed Enterprise code, Kafka, and more. That might not be a MediaWiki-related problem.

Change 946568 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] page: Set rev id in OutputPage from dirty ParserOutput on action=view

https://gerrit.wikimedia.org/r/946568

Change 946568 merged by jenkins-bot:

[mediawiki/core@master] page: Set rev id in OutputPage from dirty ParserOutput on action=view

https://gerrit.wikimedia.org/r/946568

matmarex claimed this task.

The patch is now live on English Wikipedia and it seems to have resolved the issue. I ran the script I shared for about 30 minutes today, and did not encounter any pages with this problem.

I found a few cases where the script gives false positives, so take these into account if you try to reproduce:

Hi @matmarex, @Krinkle and @cscott

We have a follow-up question from the customer, see: In terms of consistency with the data is there a revision id field (that is part of the HTML) that is guaranteed to be consistent with the HTML content, or is the only way to check that all revision id's along with the HTML content are identical and therefore consistent with data?

We explained that there is a scalability trade-off between performance and eventual consistency.

However, they want to clarify which revision ID, in HTML attributes, is guaranteed to be consistent with the article content revision.

Thanks for your help with this one.
-Ruairi

wgRevisionId should be consistent with the article content revision now, but we make no guarantees for HTML scraping.

wgRevisionId should be consistent with the article content revision now, but we make no guarantees for HTML scraping.

Thanks!