Page MenuHomePhabricator

Performance review of proposed Kartographer/Kartotherian revid support
Closed, ResolvedPublic

Description

Description

Problem we are solving:
The WMDE Technical Wishes team is working on improvements to maps, and our current feature focus is to serve historical versions of annotated maps. The Kartographer system can currently render only the latest revision of the embedded map on an article, when serving static image thumbnails as you see on high-traffic sites. The lack of versioned maps is especially problematic on wikis such as German Wikipedia, which use FlaggedRevs page stabilization. Kartographer mapframe rendering is currently disabled on these wikis, otherwise the stabilized pages would often include blank maps (since the latest version doesn't match page contents).

Planned implementation:
Each hop in the pipeline for retrieving maps will support a revid parameter. In conjunction with other URL parameters, this is enough information to render any historical version of an embedded mapframe.

Examples:
Example map with annotations: here

Example broken map, missing annotations: here. This can be seen on historical pages: here

Preview environment

Two revisions of a static mapframe, distinguished by having different colors of marker:
https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Kartographer_versioned_maps_example&oldid=534876
https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Kartographer_versioned_maps_example&oldid=534875

Which code to review

(Provide links to all proposed changes and/or repositories. It should also describe changes which have not yet been merged or deployed but are planned prior to deployment. E.g. production Puppet, wmf config, or in-flight features expected to complete prior to launch date, etc.).

  • Changes to Kartographer: deployed
    • mapdata API responds to revids
    • Static and dynamic maps are rendered with versioned map URL parameters.
    • All new features are behind a config flag, only the mapdata API change is enabled by default, the rendering changes ship disabled.
    • Will merge to the main branch soon, we expect this to go quietly.
  • mapdata API client library passes through the revids parameter: deployed
  • Kartotherian passes through revid parameters: deployed
    • A new config option versioned_maps: false disables pass-through, which forces the request to fall back to a title-only mapdata request. Default is to allow versioned requests. This knob is our main safety feature, it converts image requests into their legacy equivalent.
Performance assessment

Please initiate the performance assessment by answering the below:

  • What work has been done to ensure the best possible performance of the feature?
    • Analyzed various caching layers including the Varnish front-end, and the Parser cache: T293841, T292049, T295050, T295363
    • An overall assessment of how each step might have negative impacts and what a rollback looks like: T293843
    • Looking into open and active side-issues which may interact with our changes, such as T269984
    • Each code deployment is expected to have no impact, and each configuration change can be rolled back. In particular, the kartotherian pass-through feature flag can transform new, versioned static map image requests into legacy requests. This means we should be able to avoid a mass purge or denial of service in the event of all anticipated failure modes.
    • If we see the need to make cache invariant to revid (see below), we've already experimented with hiding the parameter from Varnish and have a tentative plan to move the parameters out of the URL and into headers, so that ATS reuses thumbnails more often. If our feature causes any performance regression, implementing this fallback plan will compensate and more.
  • What are likely to be the weak areas (e.g. bottlenecks) of the code in terms of performance?
    • The feature itself is not expected to have any weak areas, it changes some parser cache queries but these would already be happening just for the wrong revision ID.
    • An existing hazard is that, in the worst-case scenario such as after a mass purge of the upload cache, the maps server would go to 4x normal load (cache hit rate is currently at 75%), for at least 1 hour (upload varnish expiry). The cluster would be unable to serve this load, probably resulting in timeouts, possibly resulting in broken map thumbnails that slowly heal over several hours. Anybody purging the entire cache could spin up additional, temporary maps servers to avoid this.
    • We may have missed some DoS vectors for example when receiving malicious mapdata requests.
  • Are there potential optimisations that haven't been performed yet?
    • There are some parameters such as revid, title, and domain which should be cache-invariant (T293914), allowing cached maps to be reused across history, pages, and sites, but we haven't evaluated whether this is worth pursuing. We suspect the gain might be mostly negated due to low edit velocity relative to cache expiry, and an unavoidable variation on the lang used to translate labels. The most promising, potential optimization is that a map might be reused in a common template, possibly making title + `revid invariance a valuable optimization to collapse all pages to use a single image.
    • We're pursuing better cache policies for both performance and functional reasons, for example mapdata API responses can be cached long-term for a specific revision, medium-term for a legacy title-only request, and very short-term for error responses. (T295604, T295130, T269984)
    • Have not looked into the ATS layer at all, its contribution to hit rate is unknown and there could be tuning opportunity. A given embedded map thumbnail never changes, refreshing the base map as slowly as once per month would be fine. Investigation task: T297363.
    • After the revid parameter is deployed, we can analyze web requests and estimate what impact further optimizations would have.
    • Data centers could be balanced. Currently, the main data center takes the full load and codfw nothing at all. Maps probably have a strong regional bias, so caching different thumbnails in each location might be efficient.
  • Please list which performance measurements are in place for the feature and/or what you've measured ad-hoc so far.
    • We have one-off queries against the mediawiki API request log to measure cache performance, mapdata API processing time, and responsiveness. For some results, see tasks linked at the top of this section.
    • Kartotherian map server performance can be monitored using this dashboard and the webrequest log (using superset SQL lab). Any negative impact from the versioned maps feature is expected to show up on the "static snapshot requests" graph, in the form of a slow, one-month slope up to a higher burden as img URLs are replaced incrementally when each cached article expires.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

TODO: Create a section for potential further work, such as further cache key optimization which drops the title and domain parameters, in a way that doesn't cause a switchover spike.

awight renamed this task from [stub] Performance review of proposed Kartographer/Kartotherian revid support to Performance review of proposed Kartographer/Kartotherian revid support.Nov 26 2021, 1:28 PM
awight added a project: Performance-Team.
awight updated the task description. (Show Details)
dpifke subscribed.

Thanks for the advance notice! Please keep us updated on the timeline for this being ready; we will keep it on our radar for next quarter's planning.

After chatting with @ema and considering the short cache expiry we found in T295675, we're probably going to drop the clever Varnish changes and do the initial deployment without any cache tweaks. The new plan is to monitor for an impact on cache hit rate, and maybe do another analysis of web request logs to measure how often revid changes without a mapdata hash change, giving us an idea of the level of cache fragmentation we're causing. If the impact is noticeable, we can follow-up with the Varnish changes. These changes will probably be done slightly differently, however: instead of manipulating the vcl_hash, we'll move the revid parameter to a header like X-Revision-ID, hiding it from ATS. Kartotherian will look for this header, but ATS will be invariant to it.

awight updated the task description. (Show Details)

@dpifke The beta cluster preview is ready for demonstration, please see links in the task description. Let us know if we can help in any way, this feature is currently my team's highest priority.

awight added a subscriber: MSantos.

@Krinkle what about using https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes/elementtiming for the map image? We will then automatically get the metric in the synthetic tools and have the ability to get it from our real users from browser that supports it when we start collecting it.

Sounds like a good idea! fwiw, there is also useful timing information in the webrequest log (see T295675), which shows us how long it takes to render maps. My expectation for the feature we're introducing is that it will have no impact on existing sites, and for newly enabled sites with page stabilization may give slightly lower render times than the latest revision would have, since the stable revision will is more likely to already be in the ParserCache.

@Krinkle @dpifke @Peter
Thanks a lot for handling this so far! We were wondering what the current status of this review is and whether you could already say something about the future timeline?
Or, to put it a bit differently, can we treat the relative silence as a good sign? :)

I'll try to complete the perf review this week. I haven't looked at it yet.

I'll try to complete the perf review this week. I haven't looked at it yet.

@Krinkle any news or updates on this? The suspense is killing us. :)

The below is mostly based on reviewing the code from this diff
https://github.com/wikimedia/mediawiki-extensions-Kartographer/compare/wikimedia:d6b6e4b...wikimedia:5022420

API call to RevisionLookup

The singular RevisionLookup calls look suboptimal. I see Thiemo also pointed this out at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Kartographer/+/730486

I would expect ApiPageSet, ApiBase and/or the services you call to have a built-in way to get the data you need in in a less inefficient/ad-hoc manner. Having said that, I was unable to find anything in the extension, in mapdata.js, or the maps service code that would call this with multiple revisions or multiple titles.

API call to ParserCache

Calling getParserOutput by rev ID stands out as risky. No surprise here given this is central to what this Kartographer feature is going to do, but worth explaining I think.

In this case, you're calling WikiPage::getParserOutput, which uses the ParserOutputAccess service, which as of last year includes a cache for old revisions (T244058). However this oldid-cache is short-lived and should not be exposed through regular pageviews or their API requests. As that would likely cause a great deal of churn and on-demand page parsing which would be rather slow for users. The infra will manage by limiting their concurrency and debouncing them via memc, but it's meant to help the infrastructure. It only helps the user in so far that it makes the request not fail (details at T244058).

From the research at T292049 (Nicely done, @awight and team!) I know that you're aware of FlaggedRevs having its own parser cache to avoid this kind of churn. But, unlike core's new short-lived by-revision parser cache, the persistent ParserCache in FlaggedRevs appears not integrated behind the ParserOutputAccess service.

I have not empirically verified this, but FlaggedRevs hooks into the Skin and OutputPage specifically to render its own content when needed, and when doing so it contacts its own ParserCache and PoolCounter instances. Programmatic access to ParserOutput by other means does not involve FR or its persistent caching backend. This makes historical sense since FR is mainly a UI-only and view-only feature. When editing pages or querying the API, it doesn't enforce "stable" revisions and so use of latest or by-rev in APIs is generally the same with or without FR in terms of demand and thus not needed there.

In addition, we've historically avoided rendering a page with data composed by multiple PHP processes that each have to re-create the same parser context as doing so is inherently slow and fragile (parsing is affected by hooks, user preferences, time of day, and embedded templates and extension functions). E.g. viewing a frequently edited page would mean you either get a cache hit (by page) from a newer and incompatible revision, or a cache miss for the currently-viewed revision that is slow to generate and even then may still end up incompatible with what you had when you first loaded the article. All this gets worse for logged-in users who may have preferences that further reduce or remove cache-ability, thus waiting multiple times for the same page while viewing it (once during the HTML req, and again during the API req).

This is why we generally store things on the backend into ParserOutput, to keep it atomically together, and then pass it on to the client as-needed. Everything you need is either there, or computed ahead of time and put somewhere you can find it such as Score and Math artefacts. (see also the complexity, fragility, and subsequent failure around separately storing and loading Cite references; ref T125329, T123328).

I suspect as such that the new Kartographer API code would be uncached and experience constant re-parsing when called from FR-stable articles. If true, that would likely be a blocking perf concern.

Question: The Kartographer-by-revid feature is motivated by dewiki's use of FlaggedRevs. However, is it not clear to me whether it is intended/required to apply this to any oldid or whether enabling it only on stable FlaggedRevs pageview would be an accepted outcome. The answer might change what approach reduces avoidable risk, complexity, and load.

Short-term: caching of maps.wikimedia.org

I don't know this for sure, but I am assuming that in the common case, static maps are generated on-demand (like other MW thumbnails), and thus the maps service calls out to this during pageviews.

The maps.wm.o URL seems to have a weak cache-control of "max-age=3600". I'm assuming these get purged on edit, but even if not, for the new ones with a revid parram, this should be safe to increase a lot. That would reduce expose to the backend API latency.

Correctness

In reviewing changes to the mapdata JavaScript, I was wondering how this plays into the larger picture. My rough understanding is that mapdata.js is used for rendering during edit previews and on a minority of wikis that don't use static maps rendered by maps.wikimedia.org. Is that correct? If so, that means the backend called to api.php for prop=mapdata from this particular JS is not in the critical path and its latency and cacheability thus not of direct concern if it would only be called by the maps service and CDN-cached after that.

This raises an interesting problem. If the server-side rendering didn't exist and all calls to api.php come from the same browser session as the page view, then you get "correctness" for free. I'm not recommending this of course, but it's important to think about. Within MW api.php, we can make sure that users without a session share a CDN cache, and those with a session get their various preferences and hook influences consistently applied to both the HTML request and the API request, and thus you're as close as possible to getting a stable and consistent result. E.g. imagine a template or Lua module varying by user language, some BetaFeature, or some other preference or parser option.

When the maps service makes its API request, all it has is the basic query parameters (wiki-id, page title, revid, map hash). Assuming the map data is only stored by Kartographer's parser hook into the ParserOutput and not in some other persistent storage place identified by its hash, then that means it will likely sometimes not be able to find this data. If the hash is deterministic and only variable by its own content (not by anything else in the parser state or offset etc.) then that's presumably stable and functional so long as we don't support any kind of variability within the map data (e.g. not generated by or with the use of any templates, Lua modules, or parser functions; unless done in a deterministic way that is the same regardless of current time, user, or user prefs etc.)

Push and pull

If the aforementioned limitation of "maps must not vary by any wikitext features or parser options" is not meant to exist, then that creates an interesting problem because that means you basically cannot reliably re-create the data from these parameters.

A different approach to this data long-term might be called for. One slightly less basic than today, but arguably less fragile and simpler to reason about. Eg. store the data somewhere in a database and uniquely identify it directly by its hash, thus removing the problem of different caching layers having to be aligned and the inability to recompute it. On the flip side, it means we would be storing derived data for each revision instead of merely the page as a whole, something we generally haven't done before and could get out of control quickly. I considered something similar in 2011 as part of T3048, although an important limitation to that proposal was that it would be by revision only, not unbounded beyond that. And that use case has essentially been solved by MCR since then, which allows for relatively easy append-only storage with limited or no duplication of data in permanent storage, and is naturally limited to a single blob by revision. It does this by forcing you to edit it outside the wikitext where thus such variance doesn't exist in the first place. We haven't really figured out yet how we use MCR in user interfaces, so far we've only used it on Commons for Wikibase where it kinda works by virtue of being API and JS-only. It's not editable from the File page without JavaScript, and inaccessible from the regular edit page.

Maps could be a good use case to cross this bridge and figure out how we'd go about letting users add an optional slot to a page, and be able to modify and save both that and the wikitext at the same time. E..g. the maps slot would be a list of map data (possibly visually edited, or as an array fo JSON or something) and the wikitext can reference this.

On the other hand, as I'm writing this I realise that if we're okay with this limitation of varying only by revision ID, then the basic approach we have today of using ParserCache and re-parsing as needed, should be fully functional already (albeit slow on cache miss, but that's remedied by forcing canonical options via c and generally assuming that stuff we view is naturally recently ensured to be in PO; which isn't always true but a reasonable long-term objective). And the extension already uses newFromAnon, and so effectively already enforces that mapdata doesn't vary in any way. The user would immediately see it not working if they tried.

So, as much as it'd be fun to re-think this all (e.g. using a LinksUpdate hook that stores JSON blobs in a new kartographer link table by page_id, hash, blob, and links_updated_timestamp; and remove outdated entries at the same time; or entirely differently by modelling after Math and Score in a more integrated way that stores files eagerly), - it seems like we don't actually need or want this. Great!

Answers to T292049 questions

In T292049, @awight wrote:

There may be an additional memcache tier "on-host" (not sure what this means) in some cases.

There were until recently two layers of primary ParserCache. First, the ParserCache databases, which is not subject to LRU and is stored for the specified duration. And second, as key-value pairs in the main Memcached cluster, where we expect only hot objects to remain for short periods at a time to improve latency and reduce DB load.

As of last year, there is a third layer. When MW queries Memcached for a ParserCache key, our Mcrouter configuration passes those specific query through a small localhost Memcached instance, local to each MW appserver, and dedicated to only keeping a copy of requested ParserCache values and only for 10 seconds. This isn't so much for performance as it is for infrastructure reasons to reduce network bandwidth of memcached servers, since these are the largest objects we send around there.

Both of the Memc layers work transparently and can be forgotten about if you like.

In T292049, @awight wrote:

If we need to lower the retention period for a given page, it can be done by calling ParserOutput->updateCacheExpiry().

Is this currently being considered somewhere? I'd be interested to know where it came up around this feature. It's a useful thing to know about indeed, but afaik would not be needed here.

Summary

  • The API seems to have regressed from cached to mostly uncached, which could be a blocker.
  • The maps service request seems rather short in its CDN cache, I'd recommend increasing this if possible for static image requests that use a revid. If not, I'd be interested in understanding why that is unsafe, as I've likely misunderstood other aspects in that case.

Thanks for these detailed notes! I'll try to address issues one at a time for better threading.

API call to RevisionLookup

The singular RevisionLookup calls look suboptimal. I see Thiemo also pointed this out at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Kartographer/+/730486

I would expect ApiPageSet, ApiBase and/or the services you call to have a built-in way to get the data you need in in a less inefficient/ad-hoc manner. Having said that, I was unable to find anything in the extension, in mapdata.js, or the maps service code that would call this with multiple revisions or multiple titles.

The current code uses ApiPageSet#getLiveRevisionIDs

Note also that Kartotherian only makes singular calls to the mapdata API, so any optimizations here will not be needed yet.

API call to ParserCache

[....] unlike core's new short-lived by-revision parser cache, the persistent ParserCache in FlaggedRevs appears not integrated behind the ParserOutputAccess service.

This sounds like a problem!

Perhaps we'll have to add special handling to peek at caches and pull from the one that's already populated, if possible. Or better yet, include this logic in the ParserOutputAccess service so that it can transparently use the most efficient data source. I'm creating a new subtask T304306 to investigate this question.

I suspect as such that the new Kartographer API code would be uncached and experience constant re-parsing when called from FR-stable articles. If true, that would likely be a blocking perf concern.

Agreed.

Question: The Kartographer-by-revid feature is motivated by dewiki's use of FlaggedRevs. However, is it not clear to me whether it is intended/required to apply this to any oldid or whether enabling it only on stable FlaggedRevs pageview would be an accepted outcome. The answer might change what approach reduces avoidable risk, complexity, and load.

Our main concern is to support mapdata on the stable revision, so I believe that we can consider limiting to this one use case. However, I don't expect this scope reduction to be much help to performance or code complexity. The general use case is that the end-user views a page containing maps, and these are rendered. If it's a stable or latest revision, the parser output is likely available through the caches discussed already. If it's an arbitrary revision, the page must be parsed in order to render, and by definition the parser output is now available at little cost, as I understand it.

The only exception I can imagine is if a rogue script is scanning through old revisions to pull mapdata, but the exact same risk already exists if this script is scanning through old page revisions to render the page as HTML.

Short-term: caching of maps.wikimedia.org

I don't know this for sure, but I am assuming that in the common case, static maps are generated on-demand (like other MW thumbnails), and thus the maps service calls out to this during pageviews.

The maps.wm.o URL seems to have a weak cache-control of "max-age=3600". I'm assuming these get purged on edit, but even if not, for the new ones with a revid parram, this should be safe to increase a lot. That would reduce expose to the backend API latency.

Definitely! We have a task to improve this for the API in T301341: Set cache control for mapdata API with old revisions, and I think the same can be done for static map images since the revid param makes the response much more constant. The static map images might get a slightly lower TTL since underlying map tile and geoshape changes should be reflected, maybe 1 day is reasonable.

This is also discussed in the future plans section of the task, "Are there potential optimisations that haven't been performed yet?".

Correctness

In reviewing changes to the mapdata JavaScript, I was wondering how this plays into the larger picture. My rough understanding is that mapdata.js is used for rendering during edit previews and on a minority of wikis that don't use static maps rendered by maps.wikimedia.org. Is that correct?

Nearly correct, but it's messier. The JavaScript wgKartographerLiveData config variable is populated any time a page is viewed or previewed: https://github.com/wikimedia/mediawiki-extensions-Kartographer/blob/master/includes/Tag/TagHandler.php#L338 . I suspect that mapdata requests are mostly coming from the kartotherian backend as you say, but the client includes code which will fall back from the JS variable to make a mapdata request, under unknown circumstances: https://github.com/wikimedia/mediawiki-extensions-Kartographer/blob/master/modules/box/data.js

I'm afraid I don't follow the rest of the concerns under this section. I don't see any way for the wrong mapdata to enter the system, either on the client or server. Either we view a historical revision (revid is given) and the mapdata is always the same for this revision, or this is an edit preview and mapdata is present as a JS config variable.

Push and pull

If the aforementioned limitation of "maps must not vary by any wikitext features or parser options" is not meant to exist, then that creates an interesting problem because that means you basically cannot reliably re-create the data from these parameters.

This is a question we hadn't explored, but it's orthogonal to versioned maps IMHO. Either the maps are deterministic for a given (revid, language, mapdata geojson hash), which is the assumption that we see across the code (for example when caching the JS wgKartographerLiveData), or it's not deterministic in which case we need to fundamentally change how rendered map images and mapdata are cached. I'm in favor of strengthening the assumption that a map will be deterministic. We can look at harder newFromAnon-style flags if needed.

So, as much as it'd be fun to re-think this all (e.g. using a LinksUpdate hook that stores JSON blobs in a new kartographer link table by page_id, hash, blob, and links_updated_timestamp; and remove outdated entries at the same time; or entirely differently by modelling after Math and Score in a more integrated way that stores files eagerly), - it seems like we don't actually need or want this. Great!

/me mops sweat from brow :-D

@Krinkle I'm curious about the workflow here—is the understanding that there are no more performance blockers to our deployment? We're working on T304813: Mapdata API should use the FlaggedRevs stable revision ParserCache when appropriate which you uncovered above, it's not expected to cause any performance issues until after deployment since the problem will scale with the number of page-stabilized articles having maps. In my opinion, I'd like to fix the issue before deployment so that we're certain it *can* be fixed.

Change 778279 had a related patch set uploaded (by Awight; author: Awight):

[mediawiki/extensions/Kartographer@master] Workaround for FlaggedRevs stable-revision caching

https://gerrit.wikimedia.org/r/778279

Change 778279 merged by jenkins-bot:

[mediawiki/extensions/Kartographer@master] Workaround for FlaggedRevs stable-revision caching

https://gerrit.wikimedia.org/r/778279

@Krinkle we fixed the API cache issue and tests on the beta cluster seem to show that the way we did it works. See https://gerrit.wikimedia.org/r/778279

It's currently a workaround because the solution that seems more proper to address the MediaWiki-extensions-FlaggedRevs caching needs involves some bigger changes in core and might have some more controversial aspects. See the rough sketch in T304813: Mapdata API should use the FlaggedRevs stable revision ParserCache when appropriate. - Still the current approach seem pretty solid in solving the issue with the API calls.

So we currently wonder what the next step would be and if there's another official review/"go" needed from your side to move forward with the deployment and how we can get that :-).