Page MenuHomePhabricator

REST endpoints cannot handle requests from ka.wikipedia.org with Georgian titles
Closed, ResolvedPublic

Description

For more detail: T301599

Services such as /page/mobile-html, /page/summary/, which requires giving an article title to make the requests will return errors, but this only happens when providing Georgian titles:
https://ka.wikipedia.org/api/rest_v1/page/mobile-html/Ნარიმან_ნარიმანოვის_სახლ-მუზეუმი

If we provide an English title and then it is fine.
https://ka.wikipedia.org/api/rest_v1/page/summary/Creation_Engine

Event Timeline

cooltey triaged this task as High priority.Feb 12 2022, 1:01 AM

After a bit of debugging it doesnt happen for all Georgian articles.
For example from my local setup:
This request works:

http://127.0.0.1:8888/ka.wikipedia.org/v1/page/mobile-html/%E1%83%A1%E1%83%A3%E1%83%A0%E1%83%90%E1%83%9B%E1%83%98%E1%83%A1%E1%83%90_%E1%83%93%E1%83%90_%E1%83%AE%E1%83%90%E1%83%A8%E1%83%A3%E1%83%A0%E1%83%98%E1%83%A1_%E1%83%94%E1%83%9E%E1%83%90%E1%83%A0%E1%83%A5%E1%83%98%E1%83%90

This request raised an error:

http://127.0.0.1:8888/ka.wikipedia.org/v1/page/mobile-html/%E1%83%A8%E1%83%98%E1%83%9C%E1%83%90%E1%83%A3%E1%83%A0%E1%83%98_%E1%83%AB%E1%83%90%E1%83%A6%E1%83%9A%E1%83%98
[2022-02-18T12:55:24.803Z] FATAL: mobileapps/472044 on nemoworld-wikimedia: Page was deleted
    HTTPError: Page was deleted
        at request.then.query (/home/nemo/myProjects/mobileapps/node_modules/preq/index.js:228:23)
        at tryCatcher (/home/nemo/myProjects/mobileapps/node_modules/bluebird/js/release/util.js:16:23)
        at Promise._settlePromiseFromHandler (/home/nemo/myProjects/mobileapps/node_modules/bluebird/js/release/promise.js:547:31)
        at Promise._settlePromise (/home/nemo/myProjects/mobileapps/node_modules/bluebird/js/release/promise.js:604:18)
        at Promise._settlePromise0 (/home/nemo/myProjects/mobileapps/node_modules/bluebird/js/release/promise.js:649:10)
        at Promise._settlePromises (/home/nemo/myProjects/mobileapps/node_modules/bluebird/js/release/promise.js:729:18)
        at _drainQueueStep (/home/nemo/myProjects/mobileapps/node_modules/bluebird/js/release/async.js:93:12)
        at _drainQueue (/home/nemo/myProjects/mobileapps/node_modules/bluebird/js/release/async.js:86:9)
        at Async._drainQueues (/home/nemo/myProjects/mobileapps/node_modules/bluebird/js/release/async.js:102:5)
        at Immediate.Async.drainQueues [as _onImmediate] (/home/nemo/myProjects/mobileapps/node_modules/bluebird/js/release/async.js:15:14)
        at processImmediate (internal/timers.js:464:21)

Which is raised from this request to Parsoid:

https://ka.wikipedia.org/api/rest_v1/page/html/%E1%B2%A8%E1%83%98%E1%83%9C%E1%83%90%E1%83%A3%E1%83%A0%E1%83%98_%E1%83%AB%E1%83%90%E1%83%A6%E1%83%9A%E1%83%98

More specifically:
This request:

https://ka.wikipedia.org/w/rest.php/ka.wikipedia.org/v3/page/html/%E1%B2%A8%E1%83%98%E1%83%9C%E1%83%90%E1%83%A3%E1%83%A0%E1%83%98_%E1%83%AB%E1%83%90%E1%83%A6%E1%83%9A%E1%83%98

errors with:

{"message":"The specified revision does not exist.","httpCode":404,"httpReason":"Not Found"}

But works fine for a specific revision:

https://ka.wikipedia.org/w/rest.php/ka.wikipedia.org/v3/page/html/%E1%B2%A8%E1%83%98%E1%83%9C%E1%83%90%E1%83%A3%E1%83%A0%E1%83%98_%E1%83%AB%E1%83%90%E1%83%A6%E1%83%9A%E1%83%98/4234062

But https://ka.wikipedia.org/wiki/%E1%B2%A8%E1%83%98%E1%83%9C%E1%83%90%E1%83%A3%E1%83%A0%E1%83%98_%E1%83%AB%E1%83%90%E1%83%A6%E1%83%9A%E1%83%98 actually *doesn't* exist. So I think it's "just" a deleted page. So 404 for the latest revision is correct AFAICT.

(It is possible to fetch an old revision of a deleted page even if the "latest" version is deleted. And in fact it is in general possible for *specific revisions* in the history of a page to be deleted or blocked as well by administrative action, usually due to libel/copyright vio, so it's always possible for a request for a specific revision to return a 40x error even if the "latest revision" request doesn't, and vice-versa.)

So *a* bug seems to be that when making a request for a specific revision (which exists) something in mobile apps is also doing a request for the *latest* revision (which doesn't exist, in this case) and then failing when that latter request fails? But that might be a completely separate bug from the one in the summary, which seemed to be about a title encoding bug of some sort.

I've tried a few pages on Parsoid API and I am not sure if thats the expected behaviour for parsoid

  • Greece on ka.wikipedia.org
    • curl -v "https://ka.wikipedia.org/api/rest_v1/page/html/%E1%B2%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%A0%E1%83%AB%E1%83%9C%E1%83%94%E1%83%97%E1%83%98" --> 404
    • curl -v "https://ka.wikipedia.org/w/rest.php/ka.wikipedia.org/v3/page/html/%E1%B2%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%A0%E1%83%AB%E1%83%9C%E1%83%94%E1%83%97%E1%83%98 --> 404
    • curl -v "https://ka.wikipedia.org/w/rest.php/ka.wikipedia.org/v3/page/html/%E1%B2%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%A0%E1%83%AB%E1%83%9C%E1%83%94%E1%83%97%E1%83%98/4303904" --> With a hardcoded latest revision it works fine
  • American football on ka.wikipedia.org
    • curl -v "https://ka.wikipedia.org/api/rest_v1/page/html/%E1%B2%90%E1%83%9B%E1%83%94%E1%83%A0%E1%83%98%E1%83%99%E1%83%A3%E1%83%9A%E1%83%98_%E1%83%A4%E1%83%94%E1%83%AE%E1%83%91%E1%83%A3%E1%83%A0%E1%83%97%E1%83%98" --> 404
    • curl -v "https://ka.wikipedia.org/w/rest.php/ka.wikipedia.org/v3/page/html/%E1%B2%90%E1%83%9B%E1%83%94%E1%83%A0%E1%83%98%E1%83%99%E1%83%A3%E1%83%9A%E1%83%98_%E1%83%A4%E1%83%94%E1%83%AE%E1%83%91%E1%83%A3%E1%83%A0%E1%83%97%E1%83%98" --> 404
    • curl -v "https://ka.wikipedia.org/w/rest.php/ka.wikipedia.org/v3/page/html/%E1%B2%90%E1%83%9B%E1%83%94%E1%83%A0%E1%83%98%E1%83%99%E1%83%A3%E1%83%9A%E1%83%98_%E1%83%A4%E1%83%94%E1%83%AE%E1%83%91%E1%83%A3%E1%83%A0%E1%83%97%E1%83%98/4167404" --> With a hardcoded latest revision it works fine
  • Pluto
    • Same behaviour
      • Restbase returns 404
      • rest.php returns 404
      • rest.php using a hardcoded revision works
$ psysh
Psy Shell v0.9.12 (PHP 7.4.25 — cli) by Justin Hileman
New version is available (current: v0.9.12, latest: v0.11.1)
>>> $a = '%E1%B2%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%A0%E1%83%AB%E1%83%9C%E1%83%94%E1%83%97%E1%83%98';
>>> urldecode($a)
=> "Საბერძნეთი"
>>> json_encode(urldecode($a))
=> ""\u1ca1\u10d0\u10d1\u10d4\u10e0\u10eb\u10dc\u10d4\u10d7\u10d8""

But U+1CA1 appears to be an invalid character:
http://www.unicode-symbol.com/u/1CA1.html
even though it renders fine?

Is there some unicode normalization involved here?

Note that:

curl -v "https://ka.wikipedia.org/w/rest.php/ka.wikipedia.org/v3/page/html/blahblahblah/4167404"

also renders fine; that is, the Title is actually ignored in preference to by-id lookup if you provide a revision ID.

U+1CA1 was added in Unicode 11; we're now on Unicode 14 and the unicode-symbol.com site is using Unicode 9. Ignore the fact that it works with a hardcoded revision ID, that's just because we're ignoring the title completely on that codepath.

The greece article actually redirects to საბერძნეთი when you go to that page in the browser, and:

$ psysh 
Psy Shell v0.9.12 (PHP 7.4.25 — cli) by Justin Hileman
New version is available (current: v0.9.12, latest: v0.11.1)
>>> $a = 'საბერძნეთი'
=> "საბერძნეთი"
>>> json_encode($a)
=> ""\u10e1\u10d0\u10d1\u10d4\u10e0\u10eb\u10dc\u10d4\u10d7\u10d8""
>>> urlencode($a)
=> "%E1%83%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%A0%E1%83%AB%E1%83%9C%E1%83%94%E1%83%97%E1%83%98"

curl -L -v "https://ka.wikipedia.org/wiki/%E1%83%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%A0%E1%83%AB%E1%83%9C%E1%83%94%E1%83%97%E1%83%98" -> works
curl -L -v "https://ka.wikipedia.org/api/rest_v1/page/html/%E1%83%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%A0%E1%83%AB%E1%83%9C%E1%83%94%E1%83%97%E1%83%98" redirects with a 301

The redirection via 301 redirects to https://ka.wikipedia.org/api/rest_v1/page/html/%E1%B2%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%A0%E1%83%AB%E1%83%9C%E1%83%94%E1%83%97%E1%83%98 which is where we started, and this URL 404s.

So I *think* what is happening is that restbase/rest.php are trying to apply some unicode normalization of some sort to the title, and this results in a title that is not actually present in the database. IIRC core looks up the title as you literally provide it first and *only if this fails* does it try to normalize it and look up the normalized version.

Don't quote me on that, and I don't know what about the title is prompting the 301 from restbase in particular, but I'd try to diagnose what's causing the 301 first.

After more debugging I think the problem is triggered on restbase title normalization.
With a local restbase setup connected to ka.wikipedia.org wiki and parsoid API I managed to reproduce the following behaviour:

Also siteInfo returns case: first-letter so capitalizing the first letter is expected

For example if I manually comment out the title capitalization snippet, Georgian pages are rendered locally (which I guess brakes other things).

Issue sounds similar to this one: https://phabricator.wikimedia.org/T208139

I think a potential way forward could be to update the capitalization logic here: https://github.com/wikimedia/mediawiki-title/blob/11124b191afdd6e9da029c7f9e0137d6b9c87509/lib/mediawiki.Title.phpCharToUpper.js#L6

The inconsistency is that PHP doesn't capitalize georgian letters same way nodejs does:

PHP repl
>>> mb_strtoupper("ოteststring")
=> "ოTESTSTRING"

Node repl
> x = "ოteststring"
'ოteststring'
> x.toUpperCase()
'ᲝTESTSTRING'

Change 767910 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/restbase/deploy@master] Bump to mediawiki-title@0.7.5

https://gerrit.wikimedia.org/r/767910

We're still receiving reports of this bug. It looks like there is a potential solution. Do you have an estimate of when it will be in production?

It is blocked on @Jgiannelos fixes to mediawiki-title / RESTBase being deployed to production (see Arlo's patch above from Mar 4). All of that took a hit when Petr and Clara quit. @WDoranWMF how should we handle this?

MattCleinman raised the priority of this task from High to Unbreak Now!.Apr 15 2022, 9:16 PM

We continue to get complaints from users, increasing priority of this task. That said, it's coming into the weekend, and this can wait for next week.

Change 767910 merged by Hnowlan:

[mediawiki/services/restbase/deploy@master] Bump to mediawiki-title@0.7.5

https://gerrit.wikimedia.org/r/767910

Fix deployed, it appears to have addressed the issue based on some tests on my end.

I am still getting some failures on the ios app but lets wait for caching to expire.

Verified the issue with a couple of failing articles. Looks like we need to manually purge the cache in order for the restbase fix to actually work.

I think that the mwscript supports this kind of operation but added serviceops because it sounds a bit risky to wipe everything from the cache.

I think that the mwscript supports this kind of operation but added serviceops because it sounds a bit risky to wipe everything from the cache.

I'm not sure which cache are you talking about, but if you want to purge all urls for ka.wikipedia.org's restbase namespace from the edge caches, I'd suggest waiting the cache TTL instead, it's easier.

If you are talking about restbase's cache specifically, I'm not sure it will be correctly purged by just requesting a cache purge at the edge.

I am not sure which caches it invalidates but articles were fixed after I manually run action=purge. I thought this would have the same effect:
https://wikitech.wikimedia.org/wiki/Kafka_HTTP_purging

Waiting for TTL to expire was the initial approach but it looks like it didn't fix our issue.

Restbase content doesn't have a TTL, so waiting for the TTL expire won't work. Generally pages are edited pretty quickly, which has the effect of purging RESTBase, but Georgian wiki is low-traffic so that probably won't work either (at least not in a reasonable time frame).

action=purge technically purges the parser cache (which does have a TTL) but as I understand it there's some sort of linkage behind the scenes which triggers a restbase regeneration at the same time.

That said, we shouldn't be caching 404s. There seems to be something else going on, someone is caching the title lookup or something like that, which is probably being cleared as a side-effect of action=purge (like the restbase regeneration is a side-effect).

Four-ish options, all not great:

  • Do a simple client-side script that runs ?action=purge on every title in georgian wikipedia over the course of a week or so. "Big hammer", but reasonable since it's a small-ish wiki. (We could be fancy and only purge titles with problematic letters.)
  • Use a patch like https://github.com/wikimedia/restbase/pull/1297/commits/10aa15501e5f747bce891b174ca7fb12f46c4179 which effectively blacklists content from a particular timestamp range in restbase. (So a "rolling purge" as it were.)
  • Do some more network snooping to try to figure out exactly what requests are being made from the mobile client. (Either run android studio in a VM and use the VM facilities to capture traffic from the VM, or use pcap on linux, or figure out if Android Studio has built-in network capture utilities...)
  • Throw this over to the mobile team and try to figure out if this is an app-side caching issue, or at least narrow this down to a specific request which is failing, so we can further dig into the precise nature of the problem (aka try to figure out *why* action=purge works).

Something interested that I found while debugging the issue is that accessing an article on the app directly works. The issue is reproduced only when i click from the top articles so maybe the way that the titles on this list get populated are causing the requests to fail.

I need to check how wikifeeds populate the list and if there is some caching there.

Restbase content doesn't have a TTL, so waiting for the TTL expire won't work. Generally pages are edited pretty quickly, which has the effect of purging RESTBase, but Georgian wiki is low-traffic so that probably won't work either (at least not in a reasonable time frame).

Yes, that's why I asked.

action=purge technically purges the parser cache (which does have a TTL) but as I understand it there's some sort of linkage behind the scenes which triggers a restbase regeneration at the same time.

change-propagation listens to the same kafka topics (more or less) as purged, and purges content in restbase as a result.

That said, we shouldn't be caching 404s. There seems to be something else going on, someone is caching the title lookup or something like that, which is probably being cleared as a side-effect of action=purge (like the restbase regeneration is a side-effect).

We do not cache 404 long-term at the edge. Not sure about restbase's own cache.

Four-ish options, all not great:

  • Do a simple client-side script that runs ?action=purge on every title in georgian wikipedia over the course of a week or so. "Big hammer", but reasonable since it's a small-ish wiki. (We could be fancy and only purge titles with problematic letters.)

We run about 2k purges per second. Given kawiki has 450k articles, we can go with say 50 rps and be done in 2.5 hours. But yes, purging only the affected articles specifically would be even better.

cooltey lowered the priority of this task from Unbreak Now! to Medium.Apr 26 2022, 4:27 PM

It looks like the issue comes from cached restbase responses from the page/summary API. For example for:
https://ka.wikipedia.org/api/rest_v1/page/summary/%E1%83%A1%E1%83%90%E1%83%A5%E1%83%90%E1%83%A0%E1%83%97%E1%83%95%E1%83%94%E1%83%9A%E1%83%9D

...
  "titles": {
    "canonical": "Საქართველო",
    "normalized": "საქართველო",
    "display": "საქართველო"
  },
...

Where the canonical title has a stale (wrong) value. I think that if we want to properly fix the issue we need to purge all the articles starting with some specific characters from ka.wikipedia.org because page summary API is used in various places.

It looks like the issue comes from cached restbase responses from the page/summary API. For example for:
https://ka.wikipedia.org/api/rest_v1/page/summary/%E1%83%A1%E1%83%90%E1%83%A5%E1%83%90%E1%83%A0%E1%83%97%E1%83%95%E1%83%94%E1%83%9A%E1%83%9D

...
  "titles": {
    "canonical": "Საქართველო",
    "normalized": "საქართველო",
    "display": "საქართველო"
  },
...

Where the canonical title has a stale (wrong) value. I think that if we want to properly fix the issue we need to purge all the articles starting with some specific characters from ka.wikipedia.org because page summary API is used in various places.

Making the request directly to restbase in production returns the same result:

$ curl -s 'https://restbase.discovery.wmnet:7443/ka.wikipedia.org/v1/page/summary/%E1%83%A1%E1%83%90%E1%83%A5%E1%83%90%E1%83%A0%E1%83%97%E1%83%95%E1%83%94%E1%83%9A%E1%83%9D' | jq .titles
{
  "canonical": "Საქართველო",
  "normalized": "საქართველო",
  "display": "საქართველო"
}

So the place where we need to purge the contents is restbase. I'll de-tag serviceops and let PET and the restbase owners come up with the best strategy to solve the issue.

As far as SREs are concerned, as long as we don't just purge 450k urls in 1 second, we're good to go and you have our green light.

Thanks Joe. Meanwhile we lowered the priority since the main issue is mostly solved.

It seems like the problem has been fixed, but old (and broken) versions exist in the cache. This is now article-specific, and the cache will naturally be updated when each article gets updated. I'd like to look into invalidating the cache in a more complete fashion, so let's keep this open for now.

MattCleinman claimed this task.

@JMinor and @JTannerWMF are okay with natural cache replacement. The bug exists only in cached articles. As articles get edited, the cache will get updated. (There is no automatic refresh of this cache, it turns out.) It looks like about 6k articles per month are edited, and this bug cropped up in January/February of this year…. which means given that the fix was deployed 2 weeks ago, it will (mostly) naturally take care of itself in the next few months. (And if we get complaints on specific articles, we can run open the URL with ?action=purge to immediately invalidate.)