Page MenuHomePhabricator

Investigate mapdata API caching
Closed, ResolvedPublic2 Estimated Story Points

Description

This endpoint should have a 24-hour TTL, but responses seem to show cache headers being disabled and varnish x-cache always reads "miss, pass". This task is to look into the API logs to determine whether mapdata is being cached.

Also verify that the API endpoint is unable to return hidden revisions using the current user's permissions, or that if it does it prevents caching.

Event Timeline

requires using &maxage= (and &uselang=content for non-private caching) when making the api request.
https://www.mediawiki.org/wiki/API:Caching_data

TheDJ added a subscriber: Unknown Object (User).EditedNov 4 2021, 10:21 PM
thiemowmde subscribed.

I confirmed:

  • This is not new, just for the sake of completeness: While it's possible to use revids in the mapdata API, it will never return an older revision. In other words: revids are translated into titles, and everything is done based on the title.
  • I deleted the latest revision by deleting the page and restoring all but the latest revision. The mapdata API worked as expected and did not return data from the deleted revision. Not even while I was logged in as admin.
  • Note it's not possible to hide only the revision text of the latest revision.

Caching:

  • See the comments above. Thanks, @TheDJ!
  • https://www.mediawiki.org/wiki/API:Caching_data as well as the code in e.g. https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/api/ApiMain.php$746 explain that any more aggressive caching (i.e. more aggressive than what web browsers do by default) happens only when the API module's cache mode is "public" (this is the case for mapdata) and when a max age is set.
  • That max age can be set from the client with an additional …&maxage=… or …&smaxage=… parameter in the URL, in seconds. In other words: It's up to the client to tell if they are fine with possibly receiving slightly outdated data.
  • It's also possible to enforce this from the server, per API module. See e.g. https://codesearch.wmcloud.org/search/?q=%3EsetCacheMaxAge.
  • My suggestion: Let's enforce some more aggressive caching (e.g. a full day, 3600 seconds) for all mapdata API responses that include a revision ID. This works because the way we plan to implement this creates two very different ways the API behaves:
    1. Requests without a revid return the latest revision, and will continue to do so. But since the latest revision can change any time, the same API request might return something different any time. We can't enforce any caching for this.
    2. This is the exact opposite when revids are supported. These API requests are guaranteed to always return the exact same data. We can cache these as long as we want. The only exception is when the old revision changes, e.g. is deleted. But this is fine, I believe. This doesn't allow arbitrary attack scenarios. Only users that queried the same page before – with the exact same API query – might still get a cached response. But they have seen this before.

Note: While I believe we should do this no matter what, it won't have much of an impact on our cluster. When the Varnish caching works as expected there won't be more than one mapdata API request anyway. The moment the resulting .png is cached there is no reason to query the mapdata API again, for a while.

Pages with more than one map are no exception. Each map will not only have a different URL, but a different sha1 hash as well (not guaranteed in Wikivoyage mode, but still very likely). In other words: Two mapdata API requests on the same page typically don't share any data anyway. They can't benefit from each other's cache.

Just one more reminder that there are OTHER issues than just revid with the current static map implementation/caching, esp when it concerns language variants T246314: Map cannot be displayed when Simplified Chinese characters are present, and things like including images in the description without a size set: T281007: Maps rendering should not depend on user variable image settings (default width).

As the static map rendered is not a logged in user, it requires all such geojson to be rendered as 'anonymous' at all times, but this is currently not enforced (uselang=content + simple style parser always making use of canonical parser options) (both for anonymous and authenticated requests). I do not think this currently leads to regressions, but it is something to keep in mind every time we touch the caching, we do not want to make things worse.

  • My suggestion: Let's enforce some more aggressive caching (e.g. a full day, 3600 seconds) for all mapdata API responses that include a revision ID.

I love this suggestion!

Note: While I believe we should do this no matter what, it won't have much of an impact on our cluster. When the Varnish caching works as expected there won't be more than one mapdata API request anyway. The moment the resulting .png is cached there is no reason to query the mapdata API again, for a while.

There are multiple pipelines to consider. Mapdata will only be called by the static map generator once per day, per map. However, the browser will pull directly from mapdata on demand, so we should start tracking statistics about how often maps are clicked (or find archival analysis). What you've described will still be a dramatic improvement over the uncached, latest revision mapdata.

The point about deleted revision deserves its own investigation. When a revision is deleted, we have a responsibility to purge any associated data that might be cached. In our case, that means the Varnish cache of the page's maps needs to be purged. Let's see how other API modules handle this case, and what the acceptable lag time is.

the browser will pull directly from mapdata on demand […]

Only when we merge T149855, as far as I understand. Which makes revid support a hard dependency for T149855.

awight set the point value for this task to 2.Nov 10 2021, 11:07 AM

To summarize,

  • When the API encounters an error, either make the response no-cache, or give a very short cache.
  • When mapdata is requested by title, respond with a medium-length cache expiry e.g. 15 minutes.
  • When mapdata is requested by revid, respond with a long cache expiry e.g. 24 hours.

We aren't sure what to do about revision deletion, we might need to do something to respect this and hide oversighted data as soon as possible (and purge the images as well). What is the standard here? This might also not be a problem, because the URLs to the bad data will be hidden once the revision is deleted.

@thiemowmde The MediaWiki-Cache tag currently tracks several MediaWiki components, the most prominent of which is wikimedia/objectcache. It is not a generic/yellow tag for anything caching related. If it should remain tagged, please clarify for which component this is a bug or feature request.

The main idea was to let the people that are watching the MediaWiki-libs-BagOStuff tag have a look if our conclusions from the discussion above make sense. Are there some known best practices when an API module should actively ask to be cached, as outlined above?

Please feel free to remove the tag if that's not how it's meant to be used.

MediaWiki-libs-BagOStuff has neither members nor watchers, which matches how I generally understand component tags to be used.

With regards to caching api.php responses, I recommend taking a look at ApiOpenSearch, which is probably Action API module with the highest (frontend) traffic and uses the correct cache control. I'm personally skeptical of going with maxage parameters and such as those require clients to opt-in, and tend to be difficult to keep track of over time to make sure the right values are passed, and to define or document somewhere what the preferred settings are etc. I'd consider it a secondary strategy to use if there are no other options. If possible, I'd have the API module set its own cache control settings correctly with the visibility and duration set and documented inline by its maintainers. This also makes it clear that the module isn't meant to vary by session and the implicit user's rights or language/skin preferences that would otherwise poison caches. This doesn't mean it can't be localised, just that any JS client has to set uselang explicitly if/where needed.

thiemowmde claimed this task.