Page MenuHomePhabricator

[versioned maps] Investigate: Look into the remaining "Stable cache miss" events
Closed, ResolvedPublic0 Estimated Story Points

Description

As follow up to T308119: [versioned maps] Investigate: many more "Stable cache miss" events than anticipated, have a look at the few remaining Stable cache miss and try to figure out if there's some clear common issue.

Seems to mainly come from ukwiki, vecwiki or arwiki.

See logstash https://logstash.wikimedia.org/goto/c5d2b2d831901c770c14a40d17eb2776

For context: We're talking about 0.006% of all parsers. So that's considered a low number.

Event Timeline

WMDE-Fisch set the point value for this task to 3.May 25 2022, 12:15 PM

From some brief looks into the uk fails:

  • I first checked superset if request might mainly come from direct calls to the API but that seemed not to be the case
  • Then I looked at the pages that are associated with the revids in the logs:

In all cases I checked I saw the following, when just looking at the page using the title:
Template/file changes in this version are pending review. The stable version was checked on 21 June 2022.
So there's a stable version, that's also the most recent version, but there are still Pending changes because some template dependencies are unreviewed. Also the user will always be redirected to the Pending changes tab by default it seems. [1]

So at that point I assume that in these cases there's some flaw when handling these versions and the cache. I guess the difference between the stable version ( using stable=1 ) and the pending version is, that for the former pending template changes are not parsed but instead only the stable template versions are used. Then the parsed results might be stored in the stable version cache. So we have a revision of the page that itself is considered reviewed, but there's a stable cached version of it and a variant that's not in the stable cache, - Or: In these cases there's no version in the stable cache.

[1] https://uk.wikipedia.org/w/index.php?title=%D0%9A%D0%BE%D0%BB%D0%BE%D0%B4%D1%96%D1%97%D0%B2%D0%BA%D0%B0_(%D0%86%D0%B2%D0%B0%D0%BD%D0%BE-%D0%A4%D1%80%D0%B0%D0%BD%D0%BA%D1%96%D0%B2%D1%81%D1%8C%D0%BA%D0%B8%D0%B9_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD)&uselang=en

I picked an ruwiki example.

  • I see a mapdata API request for page https://ru.wikipedia.org/wiki/Бахмут, revision 124593840. There is only 1 map with 1 point, internally named with group id _9e455800913d474352c98b707f76b94497b8ed1d.
  • The revision id is the latest one, but it's not truly stable because of a template that changed.
  • https://ru.wikipedia.org/w/?oldid=124593840 and https://ru.wikipedia.org/w/?stableid=124593840 do indeed show two different versions. Both use the same revision of the page, but a different revision of the template.
  • In other words: Revision 124593840 was and still is the latest revision. It was marked as stable and still is, as far as I can tell. Still requests via oldid= vs. stableid= show different things. oldid= is from core's normal latest revision cache while stableid= is from FlaggedRev's stable revision cache.

One problem I can identify is that we don't make a difference between requests via oldid= vs. stableid= in Kartographer. But the original code in FlaggedRevs does. We can't show different maps for oldid= vs. stableid= requests. This might even be a bug.

We run into a special case we call $latestRevMayBeSpecial in our code. When the requested oldid is identical to the stable one – which it is in this case – we try to read from FlaggedRev's stable cache. This should succeed in this example. The reason why it might not succeed is most probably because we use different ParserOptions. These are used as part of the cache key. We do a trivial ParserOptions::newFromAnon(), assuming the rendering for anonymous users should be the same as for anyone else. But the code that was supposed to render and cache the stable revision uses WikiPage::makeParserOptions() and takes everything into account that qualifies for a cache split, depending on the situation, e.g. the user's interface language.

My current gut feeling is that we should leave all this as it is and accept the remaining risks:

  1. When a user reviews a revision that became unstable because of an edit to a template, the user might not see the difference when the only change is in a <mapframe> in the edited template. The new map becomes only visible after the latest revision is marked as stable again.
  2. We would accept exactly the cache misses we see now as a permanent solution. Note that each of these cache misses happens only once. What's great about this is that the first request by an anonymous user will be a cache hit right away. The downside is that our CPU usage might be a little higher because we trigger two different parser passes, one of which might not be needed when no anonymous visitor shows up. But that should never happen. Search engine bots and other crawlers are anonymous visitors as well.

As a fallback we trigger a regular parse for anonymous users as if called with oldid=, i.e. this doesn't render the stable version. This will end in the regular parser cache, as it should. Unfortunately it doesn't help us much because the next time we will still not find anything in the stable cache.

Conclusion so far:

  1. The cache misses go only away after an anonymous user triggered the stable rev parsing via a normal page view.
  2. When the edit in the template affects the map, the map's hash will change. This will make our current code path fail. It can't find the stable map in fresh parser run we do as a fallback. #1 will fix this.

In other words:

  • Logged in users looking at a stable version might see a broken map. We don't know how many of the cache misses we see have this consequence for the user. But it's reasonable to assume that it's way below 1%.
  • Anonymous users should be unable to run into this situation because the only way they can see the stable version of the map is by looking at the stable version of the page, which will populate the stable cache with exactly what we need.
  • The unstable version will work for both logged in and anonymous users.

The downside is that some other user had to wait twice as long because they had to wait for two different parser passes.

I think these parser passes will be concurrent, though? A normal page view will go through the FlaggedRevs stable version code path, will wait in a pool for a single render of the page using cascading stable templates. A page view using an explicit "oldid" will check the stable cache--possibly ignoring it whether or not there is content, because of some bug, but never waiting for a parse--and if the cache is missed, then the user waits for a single, pooled historical or latest revision render. How would the user wait for two sequential parser passes?

Absolutely right, thanks! I fixed this in my comment.

Change 830847 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Kartographer@master] Log mapdata API requests with mismatching revision/group ids

https://gerrit.wikimedia.org/r/830847

I just evaluated 500 samples of these "cache miss" log entries. The idea is: These cache misses are not a user-facing problem when the API request is still able to deliver the requested mapdata. Our special code path in the Kartographer codebase might cause a redundant parse and create a redundant parser cache entry in the main parser cache. But that's just a little bit of extra CPU load and usually not even noticeable. As long as the requested GeoJSON group hash can be found in the requested revision everything is fine.

Result: In my sample I couldn't find a single request that would fail to return the requested GeoJSON. In other words, I'm pretty sure >99.9% of the logged cache misses are not user facing in any way.

One thing we can do is to log this particular failure (group id couldn't be found) in the mapdata API module, make it a graph and see if it changes over time.

Change 830847 merged by jenkins-bot:

[mediawiki/extensions/Kartographer@master] Log mapdata API requests with mismatching revision/group ids

https://gerrit.wikimedia.org/r/830847

Change 834353 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Kartographer@master] Fix logspam caused by Wikivoyage's usage of the mapdata API

https://gerrit.wikimedia.org/r/834353

To finally summarize our findings:

  • We had to copy a small part of the FlaggedRevs logic to Kartographer. The only (clean) way to avoid this would be a rewrite of significant parts of FlaggedRevs. We just don't have the resources for this.
  • Our implementation is (intentionally) incomplete. It doesn't cover the special case when $wgFlaggedRevsHandleIncludes is set to 2 (FR_INCLUDES_STABLE). In this mode a stable page might become unstable again when an edit is made to a template that's used on the page. Checking if this is the case is expensive and way to much code to reimplement in Kartographer.
  • In this mode the same revision can be stable and unstable the same time. The only difference is either a URL parameter (oldid= vs. stableid=) or if the user is logged in. Logged in users get the normal MediaWiki behavior. Same when a specific revision is requested via the normal URL parameter oldid=. When the user is anonymous or the special URL parameter stableid= is used FlaggedRevs will render a special version of the revision that uses older, stable versions of templates.
  • There is (intentionally) no code in Kartotherian's map rendering pipeline that is aware of this. It considers only the revision id.
  • We know this is not a problem in probably 99.999% of the cases. The only situation where it would ever make a difference is when an unstable edit is made to a template that renders a map, and the edit causes the map to change. This will change the hash ("group id") that is used to identify the map. Since Kartotherian can't distinguish the unstable vs. stable renderings of the same revision one of the two will show a broken image. This issue will fix itself the moment the edit to the template gets marked as stable.
  • The cache misses we see do not mean that these users see broken images. The vast majority will work out fine. The only effect is that some CPU cycles are wasted on an additional parse that might have been unnecessary. As long as the map's hash didn't change the map rendering will succeed no matter what. And as long as the number of cache misses is low we don't need to worry about performance either.
  • Still it's possible a map will appear broken when all conditions are met. There is quite some confusion which one breaks: for anonymous or logged in users?

Scenario A: Logged in user
Logged in users can decide to look at both versions.

  1. Usually they see the latest, unstable version, just as if FlaggedRevs isn't there. Here the map rendering pipeline works just fine.
  2. When the user looks at the special, stabilized version via stableid= and there is a mismatch between the map hash in the stabilized vs. latest version the map rendering might fail. However, this code is usually not even reached because the .png image from before the page became unstable is still in the cache. This works because the .png file is identified by the revision id (which didn't change, remember?) and the map hash (there are now two, but the old one didn't change).

If the user sees a broken image all they need to to is to stabilize the page again. This should come naturally: the breakage is because of an unstable edit that was made to a template. Stabilizing the template again naturally fixes the breakage.

Scenario B: Anonymous user
Anonymous users usually see stabilized versions only. Map rendering requests for these can indeed fail but are usually still covered by the cached .png images from before the unstable edit was made. Same explanation as #2 above.

But yes, anonymous users might see broken maps when a page was not visited for a long time and the expected .png image is not in the cache. The logging we added via https://gerrit.wikimedia.org/r/830847 is meant to show us how often this happens. Unfortunately there is a bug in the logging. To be fixed via https://gerrit.wikimedia.org/r/834353.

Looking at old revisions from a page's history is entirely unaffected by all of this.

Change 834353 merged by jenkins-bot:

[mediawiki/extensions/Kartographer@master] Fix logspam caused by Wikivoyage's usage of the mapdata API

https://gerrit.wikimedia.org/r/834353

thiemowmde changed the point value for this task from 3 to 0.Sep 29 2022, 2:24 PM

With https://gerrit.wikimedia.org/r/834353 finally being deployed we see a dramatic drop from ~60,000/hour to ~2000/hour. The large number was because of Wikivoyage. Wikivoyage maps can consolidate points from all over a page. This is done via group ids, e.g. one for hotels, one for monuments, and so on. The template the community uses for this feature queries a large set of common group ids, no matter if they all exist on the current page. This is fine and it's not tracked as an error any more.

The vast majority of what's tracked now is from zhwiki. The rest is from a few dozen other wikis, notably ukwiki, eswiki, and enwiki. This is because pages on these wiki can be rendered in different "variants", e.g. traditional and simplified Chinese. A good place to see which languages have this feature is https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/language/converters/. Yes, this includes an English variant called "en-x-piglatin". Another, not 100% identical list is in https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/language/LanguageConverter.php$46.

Here is an example:

The Kartotherian infrastructure doesn't support variants. This is not a new problem but one that existed before we added revision support. Proof: Replace revids= in the query with titles=. Same problem.

This is already tracked in T246314: Map cannot be displayed when Simplified Chinese characters are present.

Open question: Why uk, es, and fr wiki? While I suspect these have variants (e.g. Ukrainian can apparently be rendered with Cyrillic vs. Latin characters) I can't figure out how to demonstrate a failing vs. a successful mapdata API request.

TODO: Why uk, es, and fr wiki? How do these have variants?

I'm not sure if we should spend more time here. The numbers do not seems too bad at the moment and our energy is probably spent better somewhere else. I would like to wrap this up with that last investigation. Thanks again Thiemo!

Agreed on removing the stable cache miss logging. The original reason why it was added is taken care of, the remaining logging is related to something different that's dealt with in other parts of the code.

Change 849055 had a related patch set uploaded (by WMDE-Fisch; author: WMDE-Fisch):

[mediawiki/extensions/Kartographer@master] Remove stable cache miss logging

https://gerrit.wikimedia.org/r/849055

Change 849055 merged by jenkins-bot:

[mediawiki/extensions/Kartographer@master] Remove stable cache miss logging

https://gerrit.wikimedia.org/r/849055