Page MenuHomePhabricator

Estimate impact of our changes on production caches
Closed, ResolvedPublic2 Estimated Story Points

Description

Questions to ask for each cache:

  • Identify any patchsets which will affect cache and list them (we will eventually pass these on to external reviewers).
  • Exactly which cache is affected? For example, the "upload" cache in varnish?
  • Are there existing or new DoS vulnerabilities? (Document in a private task, if severe.)
  • What is the worst-case impact?
  • Is there a step change when we deploy?
  • What's the impact on wikis that already use mapframes? And on ones where we newly enable?

Cache interactions to review:

  • Varnish for static mapframes
  • ParserCache (main)
  • ParserCache (FlaggedRevs)
  • RevisionOutputCache
  • API (also varnish)

Event Timeline

awight set the point value for this task to 5.

What we do will have an impact on all wikis that already have <mapframe> enabled. On these wikis old revisions will start showing correct maps with markers. Here is an example: https://en.wikipedia.org/wiki/Special:Diff/942875073. All revisions after this edit have the same markers and therefor the exact same map. The markers are visible in all these revisions. But before that edit the markers are different. In these revisions an empty map is shown.

Notes:

  • See T295050 for an analysis of the mapdata API's caching behavior.
  • I believe it's impossible to run into any load issues on wikis that currently don't have <mapframe> enabled. There is just nothing to render on these wikis. Maps will be added slowly by the community.
    • A worst-case scenario is a template that – out of a sudden – starts rendering thousands of maps. But this is not new. The same can happen on all wikis that have <mapframe> enabled already.
  • I believe there is nothing to worry about regarding dynamic maps. They will continue to work as before.
    • One special case to care about are old revisions. These currently work because the relevant JSON is embedded in the HTML, where the dynamic JS can find it. There is no reason for us to change this.
    • Warning: T149855 might break this. But again, this is unrelated to what we plan to do with revids.
  • One thing we recently learned is that static maps are relevant for 100% of the wikis. See above. Think of no-JS clients. The following things will happen:
    • All map image URLs in the HTML will change and contain a …&revid=… from now on. This will happen slowly over time whenever a page is purged form the parser cache, or when a user triggers a purge earlier. There should be zero extra load on the parser and parser cache.
    • For almost all of these new URLs an existing map image will be found in the Varnish cache. This is because we want Varnish to cache the map images as before, without any revid information. In other words: 100% of the existing map images will be reused. Even the broken, empty ones.
    • New map images are only created when Varnish dropped them after a while. This happens in the exact same rate as before. It's entirely unrelated to any of the parser or revision caches. The fact that some URLs now contain a revid doesn't change anything here. The revid is stripped and the cache lookup done based on the old URL.

In other words: As long as we don't actively purge Varnish, nothing will happen. The only effect of the additional revid is that when a map image needs to be re-rendered (but only then) the revid helps rendering a better image. One that's not empty, as of now. But there are zero additional requests anywhere. There are not even additional .png images that need to be cached. They are all cached already. Some of them are just empty at the moment.

Note we merged T269984 recently. This will have an impact on the servers. With this change, a new image URL without any group information is created every time a page is previewed. These images are currently not in the Varnish cache. This will create a spike the moment it is deployed. The rate will be the rate in which pages are previewed for the first time. It will flatten fast. Every later preview will usually hit cached map images.

awight changed the point value for this task from 5 to 2.Nov 10 2021, 11:07 AM
awight set Final Story Points to 5.

Minor detail to check: does RevisionOutputCache include mapdata? We want to know if looking at maps on a historical revision causes any unusual load. What are timeouts on this pipeline? Is it possible that timeouts will cause repeated failures?

Worse-case scenario is that we fail to match varnish hash IDs with the legacy URLs, and when we deploy every image needs to be recreated. This will be detectable as a spike in Kartotherian, at which point we revert the varnish rule.