Page MenuHomePhabricator

Parsoid cache invalidation for mobile-sections seems not reliable
Closed, ResolvedPublic

Description

Here an example:
https://en.wikivoyage.org/w/index.php?title=Andros_%28Bahamas%29&type=revision&diff=3795101&oldid=3769898

File:Andros Island, Bahamas.jpg has been removed from article Andros (Bahamas) in Wikivoyage in English the 12 June 2019.

Today, the 30 June 2019, if I ask Parsoid:

$ curl -I 'https://en.wikivoyage.org/api/rest_v1/page/mobile-sections/Andros_(Bahamas)'
HTTP/2 200 
date: Sun, 30 Jun 2019 11:15:21 GMT
content-type: application/json; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/mobile-sections/0.14.5"
content-language: en
cache-control: s-maxage=1209600, max-age=0, must-revalidate
content-location: https://en.wikivoyage.org/api/rest_v1/page/mobile-sections/Andros_(Bahamas)
access-control-allow-origin: *
access-control-allow-methods: GET,HEAD
access-control-allow-headers: accept, content-type, content-length, cache-control, accept-language, api-user-agent, if-match, if-modified-since, if-none-match, dnt, accept-encoding
access-control-expose-headers: etag
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
referrer-policy: origin-when-cross-origin
x-xss-protection: 1; mode=block
content-security-policy: default-src 'none'; frame-ancestors 'none'
x-content-security-policy: default-src 'none'; frame-ancestors 'none'
x-webkit-csp: default-src 'none'; frame-ancestors 'none'
x-request-id: 06334f50-9a83-11e9-9747-c9e95e895697
server: restbase1024
vary: Accept-Encoding,X-Seven
etag: W/"3769898/5a9fcc70-6c7b-11e9-942f-f658c0e8f0a2"
x-varnish: 204806208, 682062014 101908520, 38372033 11769974
via: 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1)
age: 71006
x-cache: cp1079 pass, cp3030 hit/2, cp3040 hit/5
x-cache-status: hit-front
server-timing: cache;desc="hit-front"
strict-transport-security: max-age=106384710; includeSubDomains; preload
set-cookie: WMF-Last-Access=30-Jun-2019;Path=/;HttpOnly;secure;Expires=Thu, 01 Aug 2019 00:00:00 GMT
set-cookie: WMF-Last-Access-Global=30-Jun-2019;Path=/;Domain=.wikivoyage.org;HttpOnly;secure;Expires=Thu, 01 Aug 2019 00:00:00 GMT
x-analytics: https=1;nocookies=1
x-client-ip: ..............
set-cookie: GeoIP=............; Path=/; secure; Domain=.wikivoyage.org
accept-ranges: bytes

$ curl 'https://en.wikivoyage.org/api/rest_v1/page/mobile-sections/Andros_(Bahamas)' | grep 'Andros_Island,_Bahamas.jpg' | wc
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 59694  100 59694    0     0   459k      0 --:--:-- --:--:-- --:--:--  455k
      1    1170   59695

I delivers a 20 days old revision instead of the last one.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

It seems the bug reported in T274359 is the same as this one, so reputting here the bug description:

Mobile REST API call:

$curl -s https://en.wiktionary.org/api/rest_v1/page/mobile-sections/rocker | jq | grep revision
    "revision": "52542455",

which is a revision of April 2019

Standard HTML REST API call:

$ curl -s https://en.wiktionary.org/api/rest_v1/page/html/rocker | grep revision
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="https://en.wiktionary.org/wiki/Special:Redirect/revision/61774146"><head prefix="mwr: https://en.wiktionary.org/wiki/Special:Redirect/"><meta property="mw:TimeUuid" content="f74ed480-6954-11eb-b395-7bbda64f2d61"/><meta charset="utf-8"/><meta property="mw:pageId" content="223755"/><meta property="mw:pageNamespace" content="0"/><link rel="dc:replaces" resource="mwr:revision/61659965"/><meta property="mw:revisionSHA1" content="26eab9753486198f17b6169d978f39bd3ed19506"/><meta property="dc:modified" content="2021-02-07T14:58:36.000Z"/><meta property="mw:html:version" content="2.2.0"/><link rel="dc:isVersionOf" href="//en.wiktionary.org/wiki/rocker"/><title>rocker</title><base href="//en.wiktionary.org/wiki/"/><link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=mediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Csite.styles&amp;only=styles&amp;skin=vector"/><meta http-equiv="content-language" content="en"/><meta http-equiv="vary" content="Accept"/></head><body id="mwAA" lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><section data-mw-section-id="0" id="mwAQ"><div class="disambig-see-also" about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"also","href":"./Template:also"},"params":{"1":{"wt":"Rocker"}},"i":0}}]}' id="mwAg"><i>See also:</i> <b class="Latn"><a rel="mw:WikiLink" href="./Rocker" title="Rocker">Rocker</a></b></div>

Wich is the proper latest revision of February 2020.

Both call should return the content corresponding to the same latest revision, here 61774146

But first reported for MWoffliner at https://github.com/openzim/mwoffliner/issues/1397

Possibly related to T274359; never got triaged to Content-Transform-Team in @Pchelolo transition.

From a quick look, asking mobileapps directly for the mobile-sections, it renders the right revision:

"revision":"4150656"

It looks like an issue on the RESTBase level

After following a maintenance procedure to refresh RESTBase cache similar to what has been documented for Wikifeeds, the bug is gone.

This has been documented as an "expected behavior" at T104963#1441047. So, if the issue rises again, we should purge it using the maintenance procedure.

I'm going to close this as resolved, but please reopen if needed.

@MSantos I see that https://en.wikivoyage.org/api/rest_v1/page/mobile-html/Uzbekistan is still delivering an ancient version of the page. The latest version of the page (as of this writing) can be seen by using the latest revision id: https://en.wikivoyage.org/api/rest_v1/page/mobile-html/Uzbekistan/4555712 . So it does not seem to be resolved in this case. Or do you think this is a separate issue worth a separate ticket?

@Brycehughes it's the same case, I've purged the cache manually since this is known issue.

Well, https://en.wikivoyage.org/api/rest_v1/page/mobile-html/Santa_Cruz_de_Mompox is delivering an ancient version too. It was extensively edited (by me!) in April 2022, but old version still showing. Suggests the issues are not resolved...

Also, @MSantos, https://en.wikivoyage.org/api/rest_v1/page/mobile-sections/Cambridge continues to deliver a disambiguation page as I reported above, from May 2018. It should now be showing the Cambridge England article. This causes the Cambridge article to be inaccessible to Kiwix scrapes of Wikivoyage, and it's been this way for a number of years now. Clearly not yet fixed...

@MSantos, thank you. Is there no way to fix this issue more generically, as those were just examples?

@Jaifroid unfortunately not in the immediate future. We are working to deprecate RESTBase (T314025) and as a side-effect, this fix should be worked out of the box. For now, this caching purging procedure is closed to the maintenance level to avoid enabling bad actors stressing the pre-generation functionality happening for some RESTBase endpoints.

@MSantos Thank you for your work on this, but I'm not sure this issue can/should be legitimately closed if it still occurs for seemingly many, many pages. That is, the bug does not seem to be resolved at all. I am tempted to re-open it.

This still fails for endpoints like https://en.wikivoyage.org/api/rest_v1/page/summary/Kaechon . And unfortunately in the summary case, there is no workaround by first supplying the latest revision id. I'm not sure what option there is other than re-opening this ticket, since the bug does not seem to be resolved. If its resolution is dependent on another ticket, then that's fine, but this one is by no means resolved and should be at least kept upon until it is fixed.

@Brycehughes I've just triggered the cache invalidation, let me know if you find any other issue.

Take the example of virtually any Latin American country (Colombia, Argentina, Brazil, Peru....), and if you access, e.g. https://en.wikivoyage.org/api/rest_v1/page/mobile-sections/Colombia , you will notice that we are being served a seriously outdated version (in the case of Colombia, 11th Sept 2021). All of these versions contain seriously outdated information about not being able to travel to the country due to the COVID-19 pandemic.

It seems to me that something more radical than updating one article at a time is needed. Can the entire cache of Wikivoyage be wiped and re-started? This isn't an isolated issue, it's systemic and is causing our offline archives over at Kiwix to be increasingly useless for travellers.

@MSantos thanks. But like @Jaifroid said, I am 100% sure I will find other issues. And like they suggested, it seems like a full Wikivoyage cache wipe or something more serious is needed. This has been an issue for going on four years now.

From a quick look I can consistently reproduce the issue on wikivoyage so I am leaning towards that is not a transient failure:

Parsoid renders the output fine so it shouldn't be related.

Something that looks suspicious is this config entry in changeprop:

I couldn't find any historic reference for that. Is changeprop configured like that on purpose?

From a quick look I can consistently reproduce the issue on wikivoyage so I am leaning towards that is not a transient failure:

Parsoid renders the output fine so it shouldn't be related.

Something that looks suspicious is this config entry in changeprop:

I couldn't find any historic reference for that. Is changeprop configured like that on purpose?

Nice catch! And thank for the reproduction case.

So, It appears it is on purpose per https://gerrit.wikimedia.org/r/c/mediawiki/services/change-propagation/deploy/+/373336

But I see no reasoning or task so I am not sure as to why.

Looking at the same day for the repo, they were working on T169939: End of August milestone: Cassandra 3 cluster in production, which details that there were some problem with partitioning sizes and overall storage... ?

Adding @hnowlan and @Eevans in case they are able to shed some more light on this one. My uninformed take would be to try and add wikivoyage to the rule to at least resolve this issue and decide how to proceed from there.

Adding @hnowlan and @Eevans in case they are able to shed some more light on this one.

I am not able to, sorry :(

My uninformed take would be to try and add wikivoyage to the rule to at least resolve this issue and decide how to proceed from there.

I don't think my take is any more informed than yours but —for whatever it is worth— I agree.

Change 886828 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] Include wikivoyage in page/html rerenders

https://gerrit.wikimedia.org/r/886828

Thank you very much for this test patch. Just to say that if it works for Wikivoyage, it'll also be necessary to add Wiktionary, as it is also affected by the same bug (see https://github.com/openzim/mwoffliner/issues/1397).

Change 886828 merged by jenkins-bot:

[operations/deployment-charts@master] Include wikivoyage in page/html rerenders

https://gerrit.wikimedia.org/r/886828

Mentioned in SAL (#wikimedia-operations) [2023-02-06T11:03:28Z] <akosiaris> deploy changeprop 0.10.19, adding wikivoyage to list of domains the mobile-sections get rerendered for. T226931

Change deploy in all 3 environments (staging, eqiad, codfw).

And problem, indeed fixed: I just added an edit to https://en.wikivoyage.org/wiki/User:JGiannelos_(WMF)/asdf and https://en.wikivoyage.org/api/rest_v1/page/mobile-html/User%3AJGiannelos_%28WMF%29%2Fasdf was updtaed pretty quickly (I checked within 15secs).

So, the next big question is whether we remove that allowlist entirely and support mobile-sections rerenders across all projects, or just add a few projects we want to support mobile-sections rerenders for. I don't feel like I have enough data however to answer that question.

So, the next big question is whether we remove that allowlist entirely and support mobile-sections rerenders across all projects, or just add a few projects we want to support mobile-sections rerenders for. I don't feel like I have enough data however to answer that question.

If you decide to do this, make sure commons and wikidata don't go through this since that could cause a significant increase in space usage in Cassandra and not sure how that impacts RESTBase.

So, the next big question is whether we remove that allowlist entirely and support mobile-sections rerenders across all projects, or just add a few projects we want to support mobile-sections rerenders for. I don't feel like I have enough data however to answer that question.

If you decide to do this, make sure commons and wikidata don't go through this since that could cause a significant increase in space usage in Cassandra and not sure how that impacts RESTBase.

Thanks for that info, it's already more than I had a few minutes earlier. So, removing the allowlist entirely isn't prudent.

With that in mind, the 2 paths I now see are either continue adding projects per requests to the allowlist, or move to a blocklist having at least commons and wikidata in it. Personally, and given the efforts to deprecate RESTBase, I 'd prefer moving very carefully and just add projects to the allowlist per request.

@akosiaris Thank you very much indeed for this fix! It'll make a big difference to us over at Kiwix.

If you've decided to proceed cautiously and only add projects on the basis of whitelisting, then the only other one I'm aware of currently which is suffering the same bug on mobile sections is wiktionary. It would be great to add that one. I'm pinging @Kelson in case he's aware of any other projects affected by this bug on the Kiwix end.

MSantos added a subscriber: MSantos.

Changing assignee to reflect reality.

Change 887082 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] Include wiktionary in page/html rerenders

https://gerrit.wikimedia.org/r/887082

Change 887082 merged by jenkins-bot:

[operations/deployment-charts@master] Include wiktionary in page/html rerenders

https://gerrit.wikimedia.org/r/887082

Mentioned in SAL (#wikimedia-operations) [2023-02-07T09:20:22Z] <akosiaris> add wiktionary to mobile-sections rerenders. T226931

I 've added wiktionary too to the list per the request above. This is as far as I am willing to go for now, given:

  1. lack of proper knowledge as to why the limit existed in the first place
  2. Inability to guesstimate the risk that would arise from enabling mobile-sections rerenders across all the projects.
  3. The RESTBase deprecation plans that if fully panned out, will severely, if not entirely, negate the reason for changeprop's existence, possibly making all this work moot.

I am not against adding more projects in the future, but let's assess for a few weeks that we haven't added some major regression to the platform.

Note that I 've taken 0 action to purge anything from RESTBase's caches. I expect that organic editing will largely fix this over the next few weeks/months. For the long tail of the distribution, null/small edits can be issued to trigger rerenders.

I am gonna tentatively resolve this task, it's large, old, difficult to follow and the originally reported issue has apparently been resolved. Feel free to reopen, but I 'd suggest that if the intent is to ask for new projects to be added to the list, that a new task is opened.

@akosiaris How painful would a full cache purge be? It would be great to have the problem solved once and for all. And some Wikivoyage pages, for example, get very little attention. Yet we're still getting served content from like 2020.

Thank you, @akosiaris. It's good to be cautious and watch out for regressions. If there are none over the next few weeks, perhaps a general cache purge can be thought about (if that's even possible). In any case, the work you've done on this looks like it will make a real and tangible difference. I'll report back from the Kiwix side once we have new scrapes of Wikivoyage and Wiktionary.

@akosiaris How painful would a full cache purge be?

It is unfeasible. The software doesn't have such a functionality built-in. As @MSantos pointed out in T226931#8509935 implementing any kind of functionality related to RESTBase, only to ditch it later when RESTBase is decommissioned (per T314025 and relevant tasks), would be a waste of resources.

It would be great to have the problem solved once and for all.

It would be great indeed to have the problem solved once and for all, but in any case a full cache purge wouldn't solve the problem. There is a running joke in the industry that there are only 2 hard problems in Computer Science, cache invalidation and naming things (and off-by-1 errors). What you are witnessing here is a really bad case of the first one but it is far from being the only one. See T328880 and T271184 for another 2 possible bugs causing similar (and more difficult to solve) issues.

And some Wikivoyage pages, for example, get very little attention. Yet we're still getting served content from like 2020.

Understood, please issue nulledits/dummy edits for these?

Thank you, @akosiaris. It's good to be cautious and watch out for regressions. If there are none over the next few weeks, perhaps a general cache purge can be thought about (if that's even possible).

As pointed out above, the functionality to do a general cache purge doesn't currently exist. We would have to create that somehow. Even if implementing that somehow isn't that substantial (code wise) in the end, the main issue here is that we are talking about a soon to be decommissioned platform (and software) that we shouldn't really be investing in, especially if that investment is around getting more knowledge around it (lack of that knowledge, alongside many other issues, is also a contributing factor into why we are moving away from it).

In any case, the work you've done on this looks like it will make a real and tangible difference. I'll report back from the Kiwix side once we have new scrapes of Wikivoyage and Wiktionary.

Thank you for your kind words. Happy to be able to help. Please do report back, but may I also suggest that if you can get a list of problematic pages when scraping that you first issue null/small edits for all these pages? Let us know then if that doesn't work, we may need to adapt some of those rules.

@akosiaris yeah been in the industry for 15 years. Get both jokes. And understand the problem with cache invalidation. It's just going to be an f'ing pain in the ass to have to do dummy edits to like 15,000 barely looked-at Wikivoyage pages. But hey we're we're a lot better off than we were than we were 12 or 24 or 36 months ago so I'll take what I can get. Thanks for fixing this. (I haven't seen an improvement yet but assume this is moving through the deploy process.)

@akosiaris I'd also argue that naming things is this the hardest of all problems. But hey a topic for a different day.

@akosiaris I'd also argue that naming things is this the hardest of all problems. But hey a topic for a different day.

And you 'd find me in agreement ;-)

Thanks for fixing this. (I haven't seen an improvement yet but assume this is moving through the deploy process.)

You are welcome. It's been deployed already, so if you see old revisions in pages, it's probably that they haven't seen an edit yet. Also, if you can come up with that 15k page list and put it in a parseable file, I think we could have a bot do all the null edits for you. At a rate of 1 per sec, it should be done within 1/4 of a day.

@akosiaris We'll see how it goes over the the next few weeks. Really appreciate the offer re bot — I'll talk to the Wikivoyage folks about building a list. (I would write the bot myself my but I like to write my scripts in Scala because I'm crazy.) In any case, might contact you back (here?) if need be, and thanks again for your work on this. If you don't hear back, you can assume things are working well enough now.

@akosiaris what if we just did a dummy-edit bot for every page on Wikivoyage? It's Wikivoyage, not Wikipedia, so it's not huge. One page a second, let it run for a few days. Any thoughts on this? (Like I said I'd write the script myself but I'm Scala, Typescript... and Python when someone puts a gun to my head... so I'm probably not the best person for the job.) I can easily get you a list of Wikivoyage pages via db dumps.

@akosiaris Actually, the Wikivoyage folks noted that this might flood people's watchlists if we do it for every page. I'll work on get you a list of rarely-edited pages.

@akosiaris sorry about all the messages, but I notice that this still doesn't seem to be working for REST summary calls, although it does seem to work for mobile calls. For example, https://en.wikivoyage.org/api/rest_v1/page/summary/Khustain_Nuruu_National_Park . I recently updated this, noticed the summary does not update, did a dummy edit, and the summary still hasn't updated. Any idea why this might be? Is it possible to fix for summary calls as well? There's not much point in running a bot if the summary calls won't be updated.

@akosiaris sorry about all the messages, but I notice that this still doesn't seem to be working for REST summary calls, although it does seem to work for mobile calls. For example, https://en.wikivoyage.org/api/rest_v1/page/summary/Khustain_Nuruu_National_Park . I recently updated this, noticed the summary does not update, did a dummy edit, and the summary still hasn't updated. Any idea why this might be? Is it possible to fix for summary calls as well? There's not much point in running a bot if the summary calls won't be updated.

I spent a couple of hours but did not manage to figure this one out. RESTBase in this case was fine, it had the new content. It was parts of the global CDN cache that still had the old version and were apparently not purged correctly, but despite looking for hours at the configuration I haven't figured out why. In the end I just purged the CDN caches and it appears ok now. Since I have no smoking gun, I am not sure what to call it yet. If you find out more of these, collect a few and please paste them in a new task.

This comment was removed by Brycehughes.

@akosiaris so far as I can tell, even that link isn't working still. Compare the extract field in https://en.wikivoyage.org/api/rest_v1/page/summary/Khustain_Nuruu_National_Park to the lead section of https://en.wikivoyage.org/wiki/Khustain_Nuruu_National_Park . It's still missing "also known as Hustai National Park". Should I create a new ticket for this?

@akosiaris Aaaaand now it seems to be be working... so maybe the CDN cache clear just takes some time? Okay, I'll working on getting a parseable list of pages to bot over. By the way, apparently there is a distinction between a "null edit" and a "dummy edit". I know what the latter is but I don't know what the former is. And apparently null edits won't show up on people's watchlists. Assuming you know what a null edit is, is that what the bot would run?

@akosiaris Aaaaand now it seems to be be working... so maybe the CDN cache clear just takes some time?

Not really, no. In fact, I double checked all parts of the CDN when I purged them to make sure it was fine. If I had to guess, my money would be in downstream caches from my end. From your perspective that would mean either browser cache or some forward proxy you use to access the internet (and thus wikipedia as well).

Okay, I'll working on getting a parseable list of pages to bot over. By the way, apparently there is a distinction between a "null edit" and a "dummy edit". I know what the latter is but I don't know what the former is. And apparently null edits won't show up on people's watchlists. Assuming you know what a null edit is, is that what the bot would run?

https://en.wikipedia.org/wiki/Help:Dummy_edit and https://en.wikipedia.org/wiki/Wikipedia:Purge#Null_edit will probably help disambiguate between the two. And yes null edits won't show up on watchlists, RecentChanges, etc. That's what a bot would run indeed. They have been somewhat controversial in the distant past and we would still need to get approval from the ewikivoyage community for running it, but depending on the number of pages you come up with and the proposed bot's rate, it's an argument that we can make somewhat convincingly I think.

From your perspective that would mean either browser cache or some forward proxy you use to access the internet (and thus wikipedia as well).

@akosiaris Bizarre. I checked it with both curl and the browser. I don't think there's a forward proxy anywhere in the middle. However, I am in Vietnam at the moment... something fishy might have been going at the hotel (or even country wide). It's a head scratcher. If I notice it again, I'll try it with my VPN turned on and see if that makes a difference.

And yes null edits won't show up on watchlists, RecentChanges, etc. That's what a bot would run indeed. They have been somewhat controversial in the distant past and we would still need to get approval from the ewikivoyage community for running it, but depending on the number of pages you come up with and the proposed bot's rate, it's an argument that we can make somewhat convincingly I think.

I wonder if I could make the argument about running a null bot – slowly – on all pages, since they won't show up on people's watchlists. I'll propose it and gauge the degree of disgust and horror I receive in the feedback.

Also, I notice that there is this bot that was written precisely for this purpose. There is a note at the top saying the bot is down due to a server regression, yet the task (T210307) it complains about is marked as closed and resolved, so not sure what's going on there. I'm also not sure what sort of input it takes (I'm not familiar with how WM bots work.)

Otherwise, I supposed I'll have to write a parser for enwikivoyage-20230201-stub-meta-current.xml.gz, throw out the redirects, sort by last revision date, truncate the list, and then give it to you in your preferred format (XML? CSV?)

@akosiaris would a simple text file of page names by line that need to be updated work for you? Working on getting approval for all articles (not pages) from the enwikivoyage community.

@akosiaris someone suggested that just dummy/null editing really commonly used templates (e.g. {{geo}} and {{pagebanner}) on enwikivoyage might clear the caches for every page that uses that template via transclusion. Do you think this would work? Sure sounds a hell of a lot easier :)

I'm happy to report back that the latest Kiwix Wikivoyage offline ZIM archive is now reflecting the fix implemented here 😊. Articles updated since approx 8th Feb are now yielding the latest page revision in the ZIM. Articles that have not been updated since 8th Feb are still showing outdated page revisions (as expected from the discussion above). We are in the process of testing the Wiktionary ZIMs.

@akosiaris The enwikivoyage community seemed rather nonchalant about my null-edit all pages proposal, and suggested rather I just defer to the WM dev team (i.e. you). So, here is the list of all pages: https://dumps.wikimedia.org/enwikivoyage/20230201/enwikivoyage-20230201-all-titles-in-ns0.gz . The only problem: it includes redirects. But do we care? It's still only about 62,000 pages (a drop in a the bucket compared to enwikipedia), so what are your thoughts on just using this list and then running a null-edit bot on it at a reasonable rate?

Hey everyone,

Sorry for taking so long to respond, other stuff took priority

I'm happy to report back that the latest Kiwix Wikivoyage offline ZIM archive is now reflecting the fix implemented here 😊. Articles updated since approx 8th Feb are now yielding the latest page revision in the ZIM. Articles that have not been updated since 8th Feb are still showing outdated page revisions (as expected from the discussion above). We are in the process of testing the Wiktionary ZIMs.

Happy to hear that! Thanks for reporting back.

@akosiaris someone suggested that just dummy/null editing really commonly used templates (e.g. {{geo}} and {{pagebanner}) on enwikivoyage might clear the caches for every page that uses that template via transclusion. Do you think this would work? Sure sounds a hell of a lot easier :)

It would work for quite a bit of pages. Unless I got something wrong in my SQL {{geo}} and {{pagebanner}} would update 31784 articles (pages in ns0). That certainly doesn't sound like too much, I think you can go ahead and null edit those just fine.

@akosiaris The enwikivoyage community seemed rather nonchalant about my null-edit all pages proposal, and suggested rather I just defer to the WM dev team (i.e. you). So, here is the list of all pages: https://dumps.wikimedia.org/enwikivoyage/20230201/enwikivoyage-20230201-all-titles-in-ns0.gz . The only problem: it includes redirects. But do we care? It's still only about 62,000 pages (a drop in a the bucket compared to enwikipedia), so what are your thoughts on just using this list and then running a null-edit bot on it at a reasonable rate?

I think I can solve easily your redirect problem. And in fact even ignore all pages that have been rerendered past 7th of February, which is when we merged the the patches fixing this. See https://quarry.wmcloud.org/query/71703. The result set is 18273 rows, which is less then 30% of the dumps dataset pasted above.

@akosiaris absolutely no worries on response time. I know how it goes.

I think I can solve easily your redirect problem. And in fact even ignore all pages that have been rerendered past 7th of February, which is when we merged the the patches fixing this. See https://quarry.wmcloud.org/query/71703. The result set is 18273 rows, which is less then 30% of the dumps dataset pasted above.

That would be fantastic. Shall we just use that with a null edit bot? Do you need anything from me? Thanks for all this.

@akosiaris absolutely no worries on response time. I know how it goes.

I think I can solve easily your redirect problem. And in fact even ignore all pages that have been rerendered past 7th of February, which is when we merged the the patches fixing this. See https://quarry.wmcloud.org/query/71703. The result set is 18273 rows, which is less then 30% of the dumps dataset pasted above.

That would be fantastic. Shall we just use that with a null edit bot? Do you need anything from me? Thanks for all this.

Yeah, easy enough. For future reference, I 've just set up Pywikibot's touch.py to run in a PAWS shell, it's pretty well documented, took me about 10-15 minutes of prep work before I was ready to start it.

1 page per 10 seconds, 16k pages, give it a couple of days and you should be all set.

Just to report, as promised, that the Kiwix ZIM files of Wiktionary are now reflecting the fix implemented here, and are showing updates to pages edited since February. We are looking forward to seeing the automated fix fully updating Wikivoyage and Wiktionary for pages edited prior to February. Many thanks.

@akosiaris thanks a ton for setting that up and running it. So, that solves my problem re Wikivoyage. I don't have any stake in Wiktionary like @Jaifroid, but I assume the same process would work there if the folks at Kiwix wanted to push for that. So, I think I can probably sign off on this thread now. Thanks again for your help with all this over all this time.

Just to report, as promised, that the Kiwix ZIM files of Wiktionary are now reflecting the fix implemented here, and are showing updates to pages edited since February. We are looking forward to seeing the automated fix fully updating Wikivoyage and Wiktionary for pages edited prior to February. Many thanks.

@Jaifroid I don't know how (I am guessing some very widely used templates got edited?) but enwiktionary has only 133 out of 8.5M pages that are haven't somehow been updated post 2023-02-07. I 'll get the bot to go through them after enwikivoyage is done, but we are already on the order of 0.001%, so I am inclined to say that for almost all intents and purposes you are all set. Unless that is, you spot some mistake in the logic and I missed some pages.

@akosiaris thanks a ton for setting that up and running it. So, that solves my problem re Wikivoyage. I don't have any stake in Wiktionary like @Jaifroid, but I assume the same process would work there if the folks at Kiwix wanted to push for that. So, I think I can probably sign off on this thread now. Thanks again for your help with all this over all this time.

You are most welcome. Thanks for following up on this old bug report and for the civil discussion.

@akosiaris Thank you - that sounds very positive. Possibly a common template, as you say. I wonder if such templates are shared with other language versions of Wiktionary.

Is the bot you are running language-independent, or would it have to run separately on each language instance of Wikivoyage and Wiktionary?

@akosiaris Thank you - that sounds very positive. Possibly a common template, as you say. I wonder if such templates are shared with other language versions of Wiktionary.

They are not. Global templates is not deployed yet.

Is the bot you are running language-independent, or would it have to run separately on each language instance of Wikivoyage and Wiktionary?

It's language dependent. And I 've only targeted the English language. Targeting more languages (many more that is) is a much more involved process and I don't think I can do that right now.

OK, we can live with that, given that over time the pages will get edited, hopefully, and will catch up.

While English-language versions tend to be the largest and most popular sites, in the case of Wikivoyage it is in fact German Wikivoyage which is about 50% larger than English Wikivoyage (judging by the size of our ZIM archive versions: German 1GB, English 700MB). If you had to target only one other Wikivoyage language it should probably be German...

If you had to target only one other Wikivoyage language it should probably be German...

Doing 1 more wiki is doable, I can put it in the queue once the other ones are done. Extending this to 212 wikis (all of wikivoyages and wiktionaries) is what I can't do.

@Jaifroid enwiktionary done. Regarding de.wikivoyage.org, I see barely 1364 pages (out of 101559) not updated since 2023-02-07. I think for almost all intents and purposes you are set here as well.

By the way, it's pretty interesting that ZIM archive sizes for Wikivoyage end up like that. Page wise, German Wikivoyage is 40% smaller than English Wikivoyage (~100k vs ~166k pages). avg page size is also smaller (2915 bytes vs 3661 bytes). Maybe it has more images or something, but I am not diving down that rabbit hole right now :-).

Ah, that's very interesting! Must be heavier use of images, then! Good to know that it's very actively updated, so I think we can consider this issue as complete (already closed). Many thanks indeed!