Page MenuHomePhabricator

Cached outdated revisions served to logged-out users
Closed, ResolvedPublic

Event Timeline

Boshomi created this task.Jul 30 2016, 9:46 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 30 2016, 9:46 AM
Restricted Application added a project: Security. · View Herald TranscriptJul 30 2016, 11:20 AM
Aklapper renamed this task from Changes not visible for not logged in people, Cache is fare behinde to Changes not visible for non logged in people, Cache is far behind.Jul 30 2016, 4:17 PM
Aklapper triaged this task as High priority.
Aklapper added a project: Operations.
Aklapper updated the task description. (Show Details)
Aklapper renamed this task from Changes not visible for non logged in people, Cache is far behind to Cached outdated revisions served to logged-out users.Jul 30 2016, 4:49 PM
Aklapper added projects: Traffic, MediaWiki-Cache.
Aklapper changed the visibility from "Custom Policy" to "Public (No Login Required)".
Aklapper changed Security from Software security bug to None.
Aklapper added subscribers: Glaisher, BBlack.

Quoting Glaisher from T141693:

Multiple reports at enwiki and OTRS.

I checked some random pages on RC while logged out and it's sometimes showing revisions which are outdated by several revisions. For example: https://en.wikipedia.org/wiki/List_of_equestrian_statues has three edits so far today but the last modified at the footer shows 29 June 2016, at 18:20 currently. Manual purging seems to fix these. I don't think this is a frequent occurrence so this is probably worth further investigation as it may be related to a recent change.

(Wondering if belated syncing of ProofRead status on Wikisource reported in T141692 might be related to caching issues.)

Boshomi updated the task description. (Show Details)Jul 30 2016, 4:54 PM

(Wondering if belated syncing of ProofRead status on Wikisource reported in T141692 might be related to caching issues.)

It showed me page 211 was yellow, but after purging it turned to green. Since this was while logged-in, it is probably not related to this bug as logged-in users are not supposed to be served from this cache.

This comment was removed by Boshomi.
Boshomi added a subscriber: Gestrid.

in T141695 @Gestrid wrote:

I work in the English Wikipedia's Teahouse (a place for new users to ask questions and get answers), where we have recently received several questions about why pages they've (the new users) edited aren't showing their edits. The issue seems to be that logged-out users (or IP users, as some call them) aren't seeing the most recent version of a page unless they manually purge the page. I personally have not tried to reproduce the issue, as I think the questions we've received at the Teahouse (some as recently as an hour ago) suffice as evidence of the issue.

As I said, the only workaround seems to be to purge the page manually.

I'm also not sure if this issue is only with English Wikipedia or if it is more widespread than that.

Several reports have come in at the English Wikipedia's Teahouse regarding this issue. Perhaps some of their questions there can help narrow down the cause.

As my copied comment above says, the only workaround seems to be purging the page.

Boshomi updated the task description. (Show Details)Jul 30 2016, 7:34 PM
greg added a subscriber: greg.Jul 30 2016, 10:27 PM

There haven't been any related changes recently on the cache side of things (e.g. changes in relevant VCL or how purging works, etc). More likely it's a change on the other side of the purge queue (related to... jobqueue? eventbus?). Do we have an idea when the problem started up, to correlate against changes?

Gestrid added a comment.EditedJul 30 2016, 10:52 PM

There haven't been any related changes recently on the cache side of things (e.g. changes in relevant VCL or how purging works, etc). More likely it's a change on the other side of the purge queue (related to... jobqueue? eventbus?). Do we have an idea when the problem started up, to correlate against changes?

The first question in the Teahouse related to this problem came in yesterday (July 29th) at about 2:16 pm (UTC-4), according to the autosigned-signature. This means that whatever happened was before that time.

The question can be found here. (If it gets archived, search for "Revised Page Not Truly Saved" in the Teahouse archives.)

Also, I was talking to someone in IRC (specifically the wikipedia-en channel), and they seemed to remember something similar from a couple of years ago. Specifically, they said:

This happened a couple years ago too I think
If I remember right, there was something that was causing IP's to go to an archived copy of the pedia that was used as a backup in case of crashes

Perhaps it's a repeat of that issue? If anyone can find the relevant ticket number for that bug, that could be helpful.

I'm not sure it is related but on Commons are also some cache problems (since 2 days) reported (unrelated logged in or not).

I'm not sure it is related but on Commons are also some cache problems (since 2 days) reported (unrelated logged in or not).

Are you saying that the caching problem on Commons is happening whether you're logged in or not?
Also, specifically, when did you start noticing the problem?

I just briefly looked through the 1.28.0-wmf.12 change log, only thing obviously from this area of core functionality was https://gerrit.wikimedia.org/r/#/c/299859/1 but it seems correct...

Glaisher added a subscriber: aaron.Jul 31 2016, 8:09 AM

Probably related: I have twice (now and one or two days ago) seen version histories that lacked the most recent revision(s) when viewed as logged-out user, but were up to date when viewed while logged in. Example: https://archive.is/RLxjO . After purging the page's cache, the missing revisions showed up in the logged-out view, too.

Glaisher raised the priority of this task from High to Unbreak Now!.Jul 31 2016, 10:12 AM

This issue seems more widespread than initially thought and apparently it's not just enwiki so raising priority to UBN. A user also just now reported on IRC that they were getting shown a revision containing vandalism which has been removed since then. This bug is likely causing already-reverted vandalism to be shown more frequently to the public.

Restricted Application added subscribers: Luke081515, TerraCodes. · View Herald TranscriptJul 31 2016, 10:12 AM

I noticed this two days ago while visiting a user talk page on the Italian Wikipedia, where a message posted over 10 minutes before was not visible.

Nemo_bis updated the task description. (Show Details)
crh23 added a subscriber: crh23.Jul 31 2016, 12:21 PM

This issue seems more widespread than initially thought and apparently it's not just enwiki so raising priority to UBN. A user also just now reported on IRC that they were getting shown a revision containing vandalism which has been removed since then. This bug is likely causing already-reverted vandalism to be shown more frequently to the public.

I think this is a dubious argument if it's the basis of raising it to UBN. It's just as likely it's suppressing a vandalous update from public visibility as well.

I've dug through live PURGE logs to see if I can notice any odd patterns or malformed requests, but so far everything looks relatively-normal. I see the usual standard pattern where article purges come in sets of 4x (content page + ?action=history, desktop and mobile for each), and then are repeated ~12-ish seconds later to cover races. The overall rate incoming to varnish hasn't changed in any remarkable way recently, either, but I'm pretty sure there's one or more forms of ratelimiting at the jobqueue which has been hitting its cap for a long time now. Is it possible the purge jobqueue is backlogging and these edits will eventually be purged?

One pattern change (vs long term history) is there are larger bursts of REST purges happening more-frequently, but I don't know if that's new enough to be at issue here (I haven't looked at the raw log in quite some time). If that were the culprit, it would just be contributing to backlog or contending with traditional purges when it comes to packet loss in UDP network buffers, perhaps.

In general, we know current PURGE levels have been excessive to the point of dysfunction for months ( T133821 ), but things have been tending to mostly-work in spite of that, at least up until this report.

Also, can someone better explain what's going on in https://gerrit.wikimedia.org/r/#/c/299859/1/includes/page/WikiPage.php (same commit referenced earlier)? Is it possible this changed the relative timing of parser and varnish caches, such that varnish purge might pick up not-yet-invalidated parsercache output when it refreshes from its purge?

varnish purge might pick up not-yet-invalidated parsercache output when it refreshes from its purge?

T140558 is one example where the varnish cache was purged after an edit, and the underlying parser cache was also more recent of the edit, but it still had stale content; that was before the changeset you mention.

AlexMonk-WMF added a comment.EditedJul 31 2016, 4:06 PM

Also, can someone better explain what's going on in https://gerrit.wikimedia.org/r/#/c/299859/1/includes/page/WikiPage.php (same commit referenced earlier)? Is it possible this changed the relative timing of parser and varnish caches, such that varnish purge might pick up not-yet-invalidated parsercache output when it refreshes from its purge?

So, my understanding is: It used to be that WikiPage::doPurge would have Title::invalidateCache called upon the DB master connection transaction idling... But if you look at what that function does, it actually sets up code to be run on the same event (DB master connection transaction idling). So WikiPage::doPurge was changed to just call Title::invalidateCache directly (no extra TrxIdleCallback).
At a brief glance, the commit looks okay (don't assign much value to this though - I'm not confident enough in that area of the code to be properly reviewing it, it's not something I know the details of), it's just the only thing I could find on the most recently deployed wmf branch change log obviously in the area of cache purging. There are things I haven't checked in detail, for example there could be some strange issue with the state of the Title object, there could be some problem involving timing of when TrxIdleCallbacks get run that may have been bypassed before by waiting twice (?) before.
These are things someone else should review for, the problem may turn out to be something completely different and unrelated.

I think this is a dubious argument if it's the basis of raising it to UBN. It's just as likely it's suppressing a vandalous update from public visibility as well.

Edits that are suppressed later on are generally reverted quite fast so they are not very public from then onwards so I don't think that can be compared. Feel free to change the priority if you think it's not a UBN though.

Also, can someone better explain what's going on in https://gerrit.wikimedia.org/r/#/c/299859/1/includes/page/WikiPage.php (same commit referenced earlier)? Is it possible this changed the relative timing of parser and varnish caches, such that varnish purge might pick up not-yet-invalidated parsercache output when it refreshes from its purge?

I don't think that change be related as doPurge() is called when ?action=purge URLs are hit, not during an edit. However, there is a https://gerrit.wikimedia.org/r/#/c/299857/ which changes the commit order to be slightly different but I don't know whether that can cause this huge of a change?

I think https://gerrit.wikimedia.org/r/#/c/300338/3/includes/page/WikiPage.php is what happened here. If that is true, purges done after an edit is not executed. That means if we are seeing latest revision after an edit for logged-users, it is by another purge, not the regular purge after the edit.

Change 302117 had a related patch set uploaded (by Glaisher):
Make WikiPage::doEditUpdates and related functions be run PRESEND again

https://gerrit.wikimedia.org/r/302117

Change 302118 had a related patch set uploaded (by Aaron Schulz):
Make WikiPage::doEditUpdates and related functions be run PRESEND again

https://gerrit.wikimedia.org/r/302118

Change 302117 merged by jenkins-bot:
Make WikiPage::doEditUpdates and related functions be run PRESEND again

https://gerrit.wikimedia.org/r/302117

Change 302118 merged by jenkins-bot:
Make WikiPage::doEditUpdates and related functions be run PRESEND again

https://gerrit.wikimedia.org/r/302118

Change 302122 had a related patch set uploaded (by Aaron Schulz):
Use PRESEND for WikiPage::doCreate updates

https://gerrit.wikimedia.org/r/302122

Change 302123 had a related patch set uploaded (by Aaron Schulz):
Use PRESEND for WikiPage::doCreate updates

https://gerrit.wikimedia.org/r/302123

Change 302123 merged by jenkins-bot:
Use PRESEND for WikiPage::doCreate updates

https://gerrit.wikimedia.org/r/302123

aaron awarded a token.Jul 31 2016, 6:09 PM

I suspected missing PRESEND looking at this before, bug @Glaiser beat me to it.

Change 302122 merged by jenkins-bot:
Use PRESEND for WikiPage::doCreate updates

https://gerrit.wikimedia.org/r/302122

aaron closed this task as Resolved.Jul 31 2016, 6:51 PM
aaron claimed this task.

Closing unless new edits have this happen again.

Tgr added a subscriber: Tgr.Jul 31 2016, 6:59 PM

I'm not sure it is related but on Commons are also some cache problems (since 2 days) reported (unrelated logged in or not).

Are you saying that the caching problem on Commons is happening whether you're logged in or not?
Also, specifically, when did you start noticing the problem?

Not sure if that is the same issue but I heard claims of thumbnails not updating without manual purge when a new image version is uploaded. That started 2 days ago. IIRC thumbnails are cached the same way for logged-in and anonymous users (no going around varnish even if you are logged in) so it would make sense that login status does not matter.

Change 302204 had a related patch set uploaded (by Aaron Schulz):
Make recordUpload2() deferred updates PRESEND again

https://gerrit.wikimedia.org/r/302204

Change 302205 had a related patch set uploaded (by Aaron Schulz):
Make recordUpload2() deferred updates PRESEND again

https://gerrit.wikimedia.org/r/302205

Change 302205 merged by jenkins-bot:
Make recordUpload2() deferred updates PRESEND again

https://gerrit.wikimedia.org/r/302205

aaron added a comment.Aug 1 2016, 3:21 AM

Thumbail deferred updates were also post-send for the last few days by oversight, which was fixed/deployed in the above patches.

Change 302204 merged by jenkins-bot:
Make recordUpload2() deferred updates PRESEND again

https://gerrit.wikimedia.org/r/302204