Page MenuHomePhabricator

Mobile REST API delivers year old+ content for very select pages
Open, Needs TriagePublic

Description

Mobile REST API call:

$curl -s https://en.wiktionary.org/api/rest_v1/page/mobile-sections/rocker | jq | grep revision
    "revision": "52542455",

which is a revision of April 2019

Standard HTML REST API call:

$ curl -s https://en.wiktionary.org/api/rest_v1/page/html/rocker | grep revision
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="https://en.wiktionary.org/wiki/Special:Redirect/revision/61774146"><head prefix="mwr: https://en.wiktionary.org/wiki/Special:Redirect/"><meta property="mw:TimeUuid" content="f74ed480-6954-11eb-b395-7bbda64f2d61"/><meta charset="utf-8"/><meta property="mw:pageId" content="223755"/><meta property="mw:pageNamespace" content="0"/><link rel="dc:replaces" resource="mwr:revision/61659965"/><meta property="mw:revisionSHA1" content="26eab9753486198f17b6169d978f39bd3ed19506"/><meta property="dc:modified" content="2021-02-07T14:58:36.000Z"/><meta property="mw:html:version" content="2.2.0"/><link rel="dc:isVersionOf" href="//en.wiktionary.org/wiki/rocker"/><title>rocker</title><base href="//en.wiktionary.org/wiki/"/><link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=mediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Csite.styles&amp;only=styles&amp;skin=vector"/><meta http-equiv="content-language" content="en"/><meta http-equiv="vary" content="Accept"/></head><body id="mwAA" lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><section data-mw-section-id="0" id="mwAQ"><div class="disambig-see-also" about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"also","href":"./Template:also"},"params":{"1":{"wt":"Rocker"}},"i":0}}]}' id="mwAg"><i>See also:</i> <b class="Latn"><a rel="mw:WikiLink" href="./Rocker" title="Rocker">Rocker</a></b></div>

Wich is the proper latest revision of February 2020.

Both call should return the content corresponding to the same latest revision, here 61774146

But first reported for MWoffliner at https://github.com/openzim/mwoffliner/issues/1397

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Getting the same thing on enwiki, try https://en.wikipedia.org/api/rest_v1/page/mobile-html/Wikipedia:Administrators'_noticeboard%2FIncidents

As of this writing, I get a revision from September 2020.

This prevented me (and I assume, other users) from commenting at that page using the official Wikipedia Android app.

Just (re-) surfacing this - I've likely duplicated this report at T292330.

Could I be a pain and ask if there's any movement on this issue?

TheresNoTime renamed this task from (Wiktionary) Mobile REST API does not (always) deliver HTML for latest revid to Mobile REST API does not (always) deliver HTML for latest revid.Oct 3 2021, 12:04 AM

I've spent a bit of time debugging this (initially from the Android app side, but this does just seem to be upstream of it) - directly calling the latest revision ID correctly returns the expected data:

https://en.wikipedia.org/api/rest_v1/page/mobile-html/Wikipedia:Administrators'_noticeboard%2FIncidents/1048089227

@Samtar Not that I'm aware of... and this is a pretty annoying one. We regularly have users (at Kiwix) complaining because of an old revision of an article because the REST API cache is not refreshed properly.

TheDJ added subscribers: Legoktm, TheDJ.

https://en.wikipedia.org/api/rest_v1/page/mobile-html/Wikipedia:Administrators'_noticeboard%2FIncidents

This is weird. The network level cache for this is only 14 days... but this page has been returning an old version for months it seems.. It's a high trafficked page so it gets active purges all the time. This really means that it's a logic error or there is some cache layer which simply NEVER expires or something (permissions?) or its writing to one cache key form but retrieving from another (encoding issue).. ???

It would be really helpful if someone from WMF could help find out which system is even failing here. restbase, pcs, varnish ?

@Legoktm sorry to bother, but can you perhaps help pinpoint so we can get this assigned ?

TheDJ renamed this task from Mobile REST API does not (always) deliver HTML for latest revid to Mobile REST API delivers year old+ content for very select pages.Nov 26 2021, 9:35 PM
taavi added a subscriber: taavi.

It would be really helpful if someone from WMF could help find out which system is even failing here. restbase, pcs, varnish ?

I don't work for the WMF, but it's not Varnish, which can be verified by adding a meaningless query parameter to the url and checking the response headers. Since PCS does not use persistent cache storage (this can be verified from the egress rules in the deployment-charts repo) and has certainly been restarted in the last year (for example for the Helm 3 redeploy this week) I think Restbase is the most likely cause of this issue.

Tagging PET per https://www.mediawiki.org/wiki/Developers/Maintainers.

I don't really understand Restbase's architecture, but if I'm reading the code correctly, this class is serving all mobile-html requests (per this). On line 29, it seems to always use a cached value when the revision is not set in the URL.

https://en.wikipedia.org/api/rest_v1/page/mobile-html/Wikipedia:Administrators'_noticeboard%2FIncidents

This is weird. The network level cache for this is only 14 days... but this page has been returning an old version for months it seems.. It's a high trafficked page so it gets active purges all the time. This really means that it's a logic error or there is some cache layer which simply NEVER expires or something (permissions?) or its writing to one cache key form but retrieving from another (encoding issue).. ???

It would be really helpful if someone from WMF could help find out which system is even failing here. restbase, pcs, varnish ?

@Legoktm sorry to bother, but can you perhaps help pinpoint so we can get this assigned ?

AIUI one of the Event* services is supposed to tell Restbase to invalidate the cache, and Restbase will invoke PCS to generate the new mobile-html and it'll store it. It's not really a cache, rather a persistent storage database.

So it could be:

  • Event* is not telling Restbase to fetch new mobile-html for this page
  • Restbase is failing to listen to Event* for this page or it's failing to get new mobile-html so it keeps the old one around
  • PCS is failing to generate new mobile-html - given that visiting the revid directly does give mobile-html, we can probably rule this out

So I agree with Majavah that Restbase is probably at fault or at least seems like a good place to start. Note that PCS/mobileapps is now maintained by the new https://www.mediawiki.org/wiki/Content_Transform_Team in case someone from there needs pinging.

AIUI one of the Event* services is supposed to tell Restbase to invalidate the cache, and Restbase will invoke PCS to generate the new mobile-html and it'll store it. It's not really a cache, rather a persistent storage database.

Good point, I totally forgot about that! Although I think the component responsible for that is Changeprop and not EventSomething. The Changeprop config in the charts repo has a blacklist key that contains AN/I (but not rocker @ enwiktionary).

Where does changeprop log these days? Logstash front page links to https://logstash.wikimedia.org/app/dashboards#/view/change-prop (which is empty) and the generic k8s service dashboard was mostly filled with logs from the monitoring sidecars.

The corresponding class for mobile-sections seems to be this one ?

On line 29, it seems to always use a cached value when the revision is not set in the URL.

Right, so if an active purge doesn't reach Restbase, you will forever get an old version... the blacklist idea is interesting but...

https://en.wikipedia.org/api/rest_v1/page/mobile-html/Wikipedia:WikiProject_Deletion_sorting%2FUnited_States_of_America is also on that list and is only outdated for about 10 days (still too much, but whatever)

Then again, the humanities reference desk, is also on the blacklist and also outdated by a year.
https://en.wikipedia.org/api/rest_v1/page/mobile-html/Wikipedia:Reference_desk%2FHumanities

I see we don't even know what the repository is for ChangeProp... T274558: Where is the ChangeProp codebase? mediawiki/services/change-propagation in Gerrit or wikimedia/change-propagation on Github? thats hopeful...

I tried looking for why we have a ChangeProp blacklist (i'm assuming repeated failures or something) but it isn't documented. It was merged by @Pchelolo and written by @mobrovac, maybe they remember.

I tried looking for why we have a ChangeProp blacklist (i'm assuming repeated failures or something) but it isn't documented. It was merged by @Pchelolo and written by @mobrovac, maybe they remember.

See T120971: Blacklist automatic updates for especially expensive pages and T94121: Understand and solve wide row issues for frequently edited and re-rendered pages. I assume at some point it was copied from Restbase to ChangeProp.

My understanding from the above is that this blacklist (which is the reason why ANI etc. shows old revisions) was implemented as a "stop-gap measure" in December 2015 per

As a stop-gap, we have now blacklisted the most problematic pages from job queue updates. More thorough work to support large pages more efficiently will be happening in T120171.

but the "thorough work to support [these pages]" in T120171 never took place.

If I may be permitted a few questions:

  1. Is my above understanding correct?
  2. If so, would it be possible to invoke a "one-time refresh" of these pages to confirm this resolves the symptom in the current production Restbase API, and thus the current production Wikipedia mobile app?
  3. And if this does indeed resolve the symptom, would placing these expensive pages on a much more restricted update schedule be a viable solution?

As an aside, many thanks to all involved here - I know this is likely low on y'alls list and I appreciate your attention to it ✨💖

User reported the following:

Please fix the issue of the latest revision of Administrators' Noticeboard/Incidents not being visible on mobile app. We're stuck with seeing an old revision of the page from September 2020. We use AN/I to address problematic behaviour by users (I trust you already know how important the page is); if those users can't access AN/I, how can their behaviour be discussed? I understand the issue is being discussed on Phabricator at https://phabricator.wikimedia.org/T274359 but given the great importance of AN/I, I request you to do whatever possible to get the latest revision of the page online, even if it means making the page open with the desktop skin (Minerva?) or something. Once the phab ticket is resolved the page can be shown with the regular app skin again.

User reported the following:

Please fix the issue of Administrators Noticeboard/Incidents not showing up on the mobile app.

Pages were added to the list for multiple reasons:

  1. They were too big and cause persistent timeouts in Parsoid - this might have been mitigated by now with Parsoid progress over years.
  2. They included too many templates or were changed too frequently, creating extremely wide rows in Cassandra and causing Cassandra to OOM. This problem has for sure been mitigated by changing the storage schema.

So, I'd propose removing the blocklist and see what happens. To remove the list, one would drop this. If thing go south, there's another layer of circuit breakers there which stop re-rendering pages if re-rendering always errors out.

However, this is an experiment and I'm living for vacation very soon, so it's too risky to do it right now. Can do it when I get back.

Pages were added to the list for multiple reasons:

  1. They were too big and cause persistent timeouts in Parsoid - this might have been mitigated by now with Parsoid progress over years.
  2. They included too many templates or were changed too frequently, creating extremely wide rows in Cassandra and causing Cassandra to OOM. This problem has for sure been mitigated by changing the storage schema.

So, I'd propose removing the blocklist and see what happens. To remove the list, one would drop this. If thing go south, there's another layer of circuit breakers there which stop re-rendering pages if re-rendering always errors out.

However, this is an experiment and I'm living for vacation very soon, so it's too risky to do it right now. Can do it when I get back.

I am a big fan of the "unplug it and see who screams" method, but I'm glad there's "another layer of circuit breakers" 😅

There's no rush, and this can definitely wait until you're back from vacation, but I will ask if there is anyone else who could oversee this? This task has picked up momentum and it would be awesome to keep things moving if possible!

From a very outsider/ignorant point of view, and please correct me if I am wrong, the "experiment" would go:

  1. Remove this key (+ sub-keys etc.) from _config.yaml and push live
  2. See if, for example, Wikipedia:Administrators' noticeboard/Incidents returns the latest revision when queried & confirm in the Wikipedia mobile app
  3. If anything starts to break, and we start relying on the other "layer of circuit breakers", revert the change in step 1 and wait for @Pchelolo to return. If everything looks okay, instead ask @Pchelolo's manager to extend their vacation time for being awesome ✨
In T274359#7538263, @Samtar wrote:

There's no rush, and this can definitely wait until you're back from vacation, but I will ask if there is anyone else who could oversee this? This task has picked up momentum and it would be awesome to keep things moving if possible!

This is a good example of our bad bus factor in this component. I suspect no one else is going to volunteer to do so, especially given the timing around the end of the year holidays, fundraiser and upcoming deployment freeze.

Hey @Pchelolo—are you back from your vacation? 🙂

Change 767878 had a related patch set uploaded (by Samtar; author: Samtar):

[operations/deployment-charts@master] changeprop: Remove RESTBase page blacklist

https://gerrit.wikimedia.org/r/767878

I've gone ahead and submitted a patch (767878) which does as described in T274359#7537856 — am I correct that https://grafana.wikimedia.org/goto/mwPxTFYnk?orgId=1 (namely "RESTBase html revision request storage latencies") would show any regression of behaviour and indicate that we should revert?

I've requested review from @Pchelolo, and will note this is entirely untested from my end.

@Aklapper I thought you are in charge of the triage. What would be the proper thing to do?

@Kelson: Seeing previous activity here and T274359#7399247 it seems to have been triaged. Not all teams utilize the Priority setting though (if that's the point)

@Aklapper I wondered if no priorisation and keeping "Open, needs triage", although a team/someone works on a solution, might somehow misleads or impair at some point the development of the ticket.

I ran into this again while trying to scrape/process Parsoid HTML for Commons:Featured_picture_candidates/ pages... :'(

I think this might be related to T305407.

I don't think these are related — this behaviour is being caused by a specific blacklist of pages. Good thought though! :)

From -releng (unrelated messages removed):

<TheresNoTime> legoktm: ref T274359, h/nowlan was going to get it deployed a little while back, but there's a few concerns about what it could break (:
<stashbot> T274359: Mobile REST API delivers year old+ content for very select pages - https://phabricator.wikimedia.org/T274359
<legoktm> I can imagine
<TheresNoTime> If my name wasn't on the patch I'd be all for the "just go for it and see who screams" approach!
<legoktm> TheresNoTime: maybe it would be less daunting if the blacklist were removed gradually instead of all in one go?
<TheresNoTime> I suppose we could just remove ANI and test that?
<legoktm> like, I doubt "Talk:United_States_presidential_election,_2016" is still as big a problem as it used to be. And I do know that "Cyberbot is creating 90% of null edits" was fixed a while back
<TheresNoTime> That's... yeah that's much smarter than me removing it all in one go....

API output for enwiki's ANI seems to be now semi-updated; showing contents from May 17 on May 22, a 5-day delay. Still something that needs to be addressed, but a large improvement.

Change 797354 had a related patch set uploaded (by Samtar; author: Samtar):

[operations/deployment-charts@master] changeprop: Remove WP:ANI from page blacklist

https://gerrit.wikimedia.org/r/797354

API output for enwiki's ANI seems to be now semi-updated; showing contents from May 17 on May 22, a 5-day delay. Still something that needs to be addressed, but a large improvement.

!?

@hnowlan just a polite nudge on this new patch — do you think there's any chance this would get deployed a little easier?

Change 797354 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: Remove WP:ANI from page blacklist

https://gerrit.wikimedia.org/r/797354

rDEPLOYCHARTS9f315cad923d: changeprop: Remove WP:ANI from page blacklist is deployed (thank you @hnowlan) — the mobile-html for Wikipedia:Administrators'_noticeboard/Incidents is now returning the latest revision's content 😄

Having just reinstalled it I can confirm ANI now outputs the current revision on the Android app

In just merged T309506 I mention https://de.wikipedia.org/wiki/Wikipedia:Café which is also affected.

The strange thing here is that at the end of the page it says (in german language) that the page was edited 704 days before, which is June 25th, 2020. But the content has signatures with the date Sept. 11th, 2020 which is 626 days ago. So there is not only one cache involved, but there are (at least) two caches involved.

I was wondering if we could continue to remove a large subset of the pages? I feel that the user-space and miscellaneous pages can stay for testing reasons, but the rest, like the dewiki cafe, can probably be removed.

As far as I can check the ticket is not fixed. Here an example:

As far as I can check the ticket is not fixed. [...]

Agreed entirely 🙂 my above mention of a deploy (rDEPLOYCHARTS9f315cad923d: changeprop: Remove WP:ANI from page blacklist) was only a test to confirm our current understanding of the (possible) cause

[...] Here an example:

This however is a separate (but very similar) issue I think (and is better covered in T226931: Parsoid cache invalidation for mobile-sections seems not reliable), as no wikivoyage.org pages are present in the denylist

Change 803877 had a related patch set uploaded (by Samtar; author: Samtar):

[operations/deployment-charts@master] changeprop: Modify page denylist

https://gerrit.wikimedia.org/r/803877

My change above removes the majority of pages (all bar those in en.wikipedia's user space) and adds a page to test.wikipedia to help with further debugging

Change 767878 abandoned by Samtar:

[operations/deployment-charts@master] changeprop: Remove RESTBase page blacklist

Reason:

See T274359, implementing this incrementally instead

https://gerrit.wikimedia.org/r/767878

Change 803877 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: Modify page denylist

https://gerrit.wikimedia.org/r/803877