Page MenuHomePhabricator

Mobile REST API delivers year old+ content for very select pages
Closed, DeclinedPublic

Description

Mobile REST API call:

$curl -s https://en.wiktionary.org/api/rest_v1/page/mobile-sections/rocker | jq | grep revision
    "revision": "52542455",

which is a revision of April 2019

Standard HTML REST API call:

$ curl -s https://en.wiktionary.org/api/rest_v1/page/html/rocker | grep revision
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="https://en.wiktionary.org/wiki/Special:Redirect/revision/61774146"><head prefix="mwr: https://en.wiktionary.org/wiki/Special:Redirect/"><meta property="mw:TimeUuid" content="f74ed480-6954-11eb-b395-7bbda64f2d61"/><meta charset="utf-8"/><meta property="mw:pageId" content="223755"/><meta property="mw:pageNamespace" content="0"/><link rel="dc:replaces" resource="mwr:revision/61659965"/><meta property="mw:revisionSHA1" content="26eab9753486198f17b6169d978f39bd3ed19506"/><meta property="dc:modified" content="2021-02-07T14:58:36.000Z"/><meta property="mw:html:version" content="2.2.0"/><link rel="dc:isVersionOf" href="//en.wiktionary.org/wiki/rocker"/><title>rocker</title><base href="//en.wiktionary.org/wiki/"/><link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=mediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Csite.styles&amp;only=styles&amp;skin=vector"/><meta http-equiv="content-language" content="en"/><meta http-equiv="vary" content="Accept"/></head><body id="mwAA" lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><section data-mw-section-id="0" id="mwAQ"><div class="disambig-see-also" about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"also","href":"./Template:also"},"params":{"1":{"wt":"Rocker"}},"i":0}}]}' id="mwAg"><i>See also:</i> <b class="Latn"><a rel="mw:WikiLink" href="./Rocker" title="Rocker">Rocker</a></b></div>

Wich is the proper latest revision of February 2020.

Both call should return the content corresponding to the same latest revision, here 61774146

But first reported for MWoffliner at https://github.com/openzim/mwoffliner/issues/1397

Related Objects

StatusSubtypeAssignedTask
Resolvedhnowlan
DeclinedNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I tried looking for why we have a ChangeProp blacklist (i'm assuming repeated failures or something) but it isn't documented. It was merged by @Pchelolo and written by @mobrovac, maybe they remember.

See T120971: Blacklist automatic updates for especially expensive pages and T94121: Understand and solve wide row issues for frequently edited and re-rendered pages. I assume at some point it was copied from Restbase to ChangeProp.

My understanding from the above is that this blacklist (which is the reason why ANI etc. shows old revisions) was implemented as a "stop-gap measure" in December 2015 per

As a stop-gap, we have now blacklisted the most problematic pages from job queue updates. More thorough work to support large pages more efficiently will be happening in T120171.

but the "thorough work to support [these pages]" in T120171 never took place.

If I may be permitted a few questions:

  1. Is my above understanding correct?
  2. If so, would it be possible to invoke a "one-time refresh" of these pages to confirm this resolves the symptom in the current production Restbase API, and thus the current production Wikipedia mobile app?
  3. And if this does indeed resolve the symptom, would placing these expensive pages on a much more restricted update schedule be a viable solution?

As an aside, many thanks to all involved here - I know this is likely low on y'alls list and I appreciate your attention to it ✨💖

User reported the following:

Please fix the issue of the latest revision of Administrators' Noticeboard/Incidents not being visible on mobile app. We're stuck with seeing an old revision of the page from September 2020. We use AN/I to address problematic behaviour by users (I trust you already know how important the page is); if those users can't access AN/I, how can their behaviour be discussed? I understand the issue is being discussed on Phabricator at https://phabricator.wikimedia.org/T274359 but given the great importance of AN/I, I request you to do whatever possible to get the latest revision of the page online, even if it means making the page open with the desktop skin (Minerva?) or something. Once the phab ticket is resolved the page can be shown with the regular app skin again.

User reported the following:

Please fix the issue of Administrators Noticeboard/Incidents not showing up on the mobile app.

Pages were added to the list for multiple reasons:

  1. They were too big and cause persistent timeouts in Parsoid - this might have been mitigated by now with Parsoid progress over years.
  2. They included too many templates or were changed too frequently, creating extremely wide rows in Cassandra and causing Cassandra to OOM. This problem has for sure been mitigated by changing the storage schema.

So, I'd propose removing the blocklist and see what happens. To remove the list, one would drop this. If thing go south, there's another layer of circuit breakers there which stop re-rendering pages if re-rendering always errors out.

However, this is an experiment and I'm living for vacation very soon, so it's too risky to do it right now. Can do it when I get back.

Pages were added to the list for multiple reasons:

  1. They were too big and cause persistent timeouts in Parsoid - this might have been mitigated by now with Parsoid progress over years.
  2. They included too many templates or were changed too frequently, creating extremely wide rows in Cassandra and causing Cassandra to OOM. This problem has for sure been mitigated by changing the storage schema.

So, I'd propose removing the blocklist and see what happens. To remove the list, one would drop this. If thing go south, there's another layer of circuit breakers there which stop re-rendering pages if re-rendering always errors out.

However, this is an experiment and I'm living for vacation very soon, so it's too risky to do it right now. Can do it when I get back.

I am a big fan of the "unplug it and see who screams" method, but I'm glad there's "another layer of circuit breakers" 😅

There's no rush, and this can definitely wait until you're back from vacation, but I will ask if there is anyone else who could oversee this? This task has picked up momentum and it would be awesome to keep things moving if possible!

From a very outsider/ignorant point of view, and please correct me if I am wrong, the "experiment" would go:

  1. Remove this key (+ sub-keys etc.) from _config.yaml and push live
  2. See if, for example, Wikipedia:Administrators' noticeboard/Incidents returns the latest revision when queried & confirm in the Wikipedia mobile app
  3. If anything starts to break, and we start relying on the other "layer of circuit breakers", revert the change in step 1 and wait for @Pchelolo to return. If everything looks okay, instead ask @Pchelolo's manager to extend their vacation time for being awesome ✨
In T274359#7538263, @Samtar wrote:

There's no rush, and this can definitely wait until you're back from vacation, but I will ask if there is anyone else who could oversee this? This task has picked up momentum and it would be awesome to keep things moving if possible!

This is a good example of our bad bus factor in this component. I suspect no one else is going to volunteer to do so, especially given the timing around the end of the year holidays, fundraiser and upcoming deployment freeze.

Hey @Pchelolo—are you back from your vacation? 🙂

Change 767878 had a related patch set uploaded (by Samtar; author: Samtar):

[operations/deployment-charts@master] changeprop: Remove RESTBase page blacklist

https://gerrit.wikimedia.org/r/767878

I've gone ahead and submitted a patch (767878) which does as described in T274359#7537856 — am I correct that https://grafana.wikimedia.org/goto/mwPxTFYnk?orgId=1 (namely "RESTBase html revision request storage latencies") would show any regression of behaviour and indicate that we should revert?

I've requested review from @Pchelolo, and will note this is entirely untested from my end.

@Aklapper I thought you are in charge of the triage. What would be the proper thing to do?

@Kelson: Seeing previous activity here and T274359#7399247 it seems to have been triaged. Not all teams utilize the Priority setting though (if that's the point)

@Aklapper I wondered if no priorisation and keeping "Open, needs triage", although a team/someone works on a solution, might somehow misleads or impair at some point the development of the ticket.

I ran into this again while trying to scrape/process Parsoid HTML for Commons:Featured_picture_candidates/ pages... :'(

I think this might be related to T305407.

I don't think these are related — this behaviour is being caused by a specific blacklist of pages. Good thought though! :)

From -releng (unrelated messages removed):

<TheresNoTime> legoktm: ref T274359, h/nowlan was going to get it deployed a little while back, but there's a few concerns about what it could break (:
<stashbot> T274359: Mobile REST API delivers year old+ content for very select pages - https://phabricator.wikimedia.org/T274359
<legoktm> I can imagine
<TheresNoTime> If my name wasn't on the patch I'd be all for the "just go for it and see who screams" approach!
<legoktm> TheresNoTime: maybe it would be less daunting if the blacklist were removed gradually instead of all in one go?
<TheresNoTime> I suppose we could just remove ANI and test that?
<legoktm> like, I doubt "Talk:United_States_presidential_election,_2016" is still as big a problem as it used to be. And I do know that "Cyberbot is creating 90% of null edits" was fixed a while back
<TheresNoTime> That's... yeah that's much smarter than me removing it all in one go....

API output for enwiki's ANI seems to be now semi-updated; showing contents from May 17 on May 22, a 5-day delay. Still something that needs to be addressed, but a large improvement.

Change 797354 had a related patch set uploaded (by Samtar; author: Samtar):

[operations/deployment-charts@master] changeprop: Remove WP:ANI from page blacklist

https://gerrit.wikimedia.org/r/797354

In T274359#7948614, @EpicPupper wrote:

API output for enwiki's ANI seems to be now semi-updated; showing contents from May 17 on May 22, a 5-day delay. Still something that needs to be addressed, but a large improvement.

!?

@hnowlan just a polite nudge on this new patch — do you think there's any chance this would get deployed a little easier?

Change 797354 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: Remove WP:ANI from page blacklist

https://gerrit.wikimedia.org/r/797354

{9f315ca} is deployed (thank you @hnowlan) — the mobile-html for Wikipedia:Administrators'_noticeboard/Incidents is now returning the latest revision's content 😄

Having just reinstalled it I can confirm ANI now outputs the current revision on the Android app

In just merged T309506 I mention https://de.wikipedia.org/wiki/Wikipedia:Café which is also affected.

The strange thing here is that at the end of the page it says (in german language) that the page was edited 704 days before, which is June 25th, 2020. But the content has signatures with the date Sept. 11th, 2020 which is 626 days ago. So there is not only one cache involved, but there are (at least) two caches involved.

I was wondering if we could continue to remove a large subset of the pages? I feel that the user-space and miscellaneous pages can stay for testing reasons, but the rest, like the dewiki cafe, can probably be removed.

As far as I can check the ticket is not fixed. Here an example:

As far as I can check the ticket is not fixed. [...]

Agreed entirely 🙂 my above mention of a deploy (rDEPLOYCHARTS9f315cad923d: changeprop: Remove WP:ANI from page blacklist) was only a test to confirm our current understanding of the (possible) cause

[...] Here an example:

This however is a separate (but very similar) issue I think (and is better covered in T226931: Parsoid cache invalidation for mobile-sections seems not reliable), as no wikivoyage.org pages are present in the denylist

Change 803877 had a related patch set uploaded (by Samtar; author: Samtar):

[operations/deployment-charts@master] changeprop: Modify page denylist

https://gerrit.wikimedia.org/r/803877

My change above removes the majority of pages (all bar those in en.wikipedia's user space) and adds a page to test.wikipedia to help with further debugging

Change 767878 abandoned by Samtar:

[operations/deployment-charts@master] changeprop: Remove RESTBase page blacklist

Reason:

See T274359, implementing this incrementally instead

https://gerrit.wikimedia.org/r/767878

Change 803877 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: Modify page denylist

https://gerrit.wikimedia.org/r/803877

I found this task while tracking down a problem caused by the other copy of this blacklist, in services/restbase (T316914). It looks like you removed most of the entries without causing any issues. Any reason not to finish the job?

Also, there is another copy of the list in mediawiki/services/change-propagation repo: https://gerrit.wikimedia.org/g/mediawiki/services/change-propagation/deploy/+/16bf19f64074589a54db3711c0a4c2c1ea365f29/scap/templates/config.yaml.j2 – is this one unused? (would be nice to also remove it to avoid confusion)

Hi @Kelson Has this problem been solved?

No.

$ curl -s https://en.wiktionary.org/api/rest_v1/page/mobile-sections/rocker | jq . | grep revision
    "revision": "52542455",

This is still returning the same outdated revision. However, the page is not on the blacklist. I guess this task has been taken over and is now only about the blacklist, while this problem is the same as T226931: Parsoid cache invalidation for mobile-sections seems not reliable.

Change 838762 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop: remove remaining blocklist entries

https://gerrit.wikimedia.org/r/838762

Also, there is another copy of the list in mediawiki/services/change-propagation repo: https://gerrit.wikimedia.org/g/mediawiki/services/change-propagation/deploy/+/16bf19f64074589a54db3711c0a4c2c1ea365f29/scap/templates/config.yaml.j2 – is this one unused? (would be nice to also remove it to avoid confusion)

That repository is no longer used and the Kubernetes configuration takes precedence - I've added a note that this repository is deprecated for now and I'll remove it in future.

Change 838762 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: remove remaining blocklist entries

https://gerrit.wikimedia.org/r/838762

I think this is done now? Thanks @TheresNoTime @hnowlan!

Note that the original test case given in this bug report is still returning outdated data, but this seems to be due to T226931: Parsoid cache invalidation for mobile-sections seems not reliable. It was never on the list of pages that was emptied in this task.

@matmarex Strange to close this ticket considering the bug described in the ticket is not fixed!

It is strange, but this whole task is strange. I think it should have been closed as a duplicate of T226931 at the start, but no one determined that they were really the same issue. Then it was taken over by the reports about WP:ANI being outdated (and was renamed to reflect that in T274359#7531571), and no one noticed that the blacklist causing that doesn't include the page in your original report, so it couldn't be the same problem. In my eyes this task is now about the WP:ANI issue, even though that it not what you complained about originally. Sorry.

If you'd like to reopen it, that's fine by me, but then we'll have two tasks about the same problem, which doesn't seem helpful.

Mentioned in WP:THEYCANTHEARYOU
https://en.wikipedia.org/wiki/Wikipedia:Mobile_communication_bugs#Other_communication-related_issues which is itself mentioned in https://en.m.wikipedia.org/wiki/User:Novem_Linguae/Essays/Community_tension_with_the_WMF#Areas_of_tension

Looking through the backlog here, it seems like this fell through some of the cracks during the creation of the Content-Transform-Team and @Pchelolo's departure. RESTBase is being actively deprecated, which should resolve these issues although without knowing the root cause i can't be certain of this.

Change 838762 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop: remove remaining blocklist entries

https://gerrit.wikimedia.org/r/838762

This commit was merged in October 2022 but reverted in February? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/886086

Benoit74 subscribed.

Reducing priority for Kiwix, we do not really use this backend anymore

This backend no longer exists, accessing the URL from the task description gives: "Error: 403, Mobile Content Service is decommissioned. See https://phabricator.wikimedia.org/T328036"

(worth noting that the issue of Page Content Service sometimes returning outdated revisions still occurs; although it doesn't seem to have a main task that's being used to track it - there are a few listed at T398243#10964593)