Page MenuHomePhabricator

Decide on Cache-Control headers for map tiles
Closed, ResolvedPublic

Description

With https://gerrit.wikimedia.org/r/#/c/408832/ we're sending public, max-age=86400, s-maxage=86400

This is not ideal, because there is no purpose to s-maxage if it's set the same as max-age.

Instead, we should be setting stale-while-revalidate so the cache can fetch a new tile in the background.

last-modified (always == epoch(0)) and etags (varies between nodes) headers are still wrong and should be fixed or removed (tracked in T187300).

This task is related to T108435

Details

Related Gerrit Patches:
maps/kartotherian/deploy : masterLower client cache TTL to 15 minutes
maps/tilerator/deploy : masterLower client cache TTL to 15 minutes

Event Timeline

Pnorman created this task.Feb 7 2018, 5:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 7 2018, 5:53 PM
Restricted Application added a project: Operations. · View Herald TranscriptFeb 7 2018, 5:53 PM
Gehel added subscribers: ema, BBlack.Feb 8 2018, 9:57 AM

I'm not sure that our varnish configuration honors stale-while-revalidate headers. A quick look through the code shows a test case for the text cluster (maps is behind the upload cluster). @BBlack and @ema might be able to confirm (I'm fully lost in VCL).

Also note that last-modified (always == epoch(0)) and etags (varies between nodes) headers are still wrong.

Yeah it doesn't honor stale-while-invalidate directly at this time. It does implement stale-while-revalidate -like behavior for all cache objects, but it's fixed at 5 minutes under healthy conditions, and up to 60m if there's a major problem going on at the inter-varnish level (e.g. esams can't reach eqiad to refresh).

In the long term we'll eventually transition to asking applications to use something like Surrogate-Control to control our in-house caching and leave Cache-Control purely for external consumption, but we're not there yet.

fgiunchedi triaged this task as Medium priority.Feb 12 2018, 10:04 AM
Gehel updated the task description. (Show Details)Feb 13 2018, 6:09 PM
Restricted Application added a project: Maps. · View Herald TranscriptFeb 14 2018, 12:00 AM
Restricted Application added a project: Discovery. · View Herald TranscriptFeb 14 2018, 12:01 AM
ema added a comment.Feb 14 2018, 10:59 AM

Anything else left to be discussed here?

ema moved this task from Triage to Watching on the Traffic board.Feb 14 2018, 11:00 AM

Because cache-control is in one of the sample config files, we should make sure it's something sensible, even if we use something different in production.

For I'd be inclined towards 1h max-age, then 23h stale-while-invalidate where stale responses can be served.

In the long term we'll eventually transition to asking applications to use something like Surrogate-Control to control our in-house caching and leave Cache-Control purely for external consumption, but we're not there yet.

Will this bring stale-while-revalidate support?

Pnorman claimed this task.Mar 27 2018, 6:17 PM

Going to take another look at a couple of things here

Mholloway moved this task from Backlog to To-do on the Maps-Sprint board.Jul 3 2018, 4:57 PM

As discussed today, let's gradually make these shorter (e.g., 24h, 12h, 6h, 3h, 1h) and closely monitor the effect on load.

These are set (currently hard-coded per source) in config.yaml.j2 in kartotherian/deploy.

Catching up a little here: what I'm seeing right now on tile images, from the public POV, is basically:

Cache-control: public, max-age=86400, s-maxage=86400

It's being passed through straight from kart (no Varnish interference), and implies both Varnish and the Browser can cache for 24h.

The public TTLs perhaps should be much shorter, e.g. 15 minutes; just long enough to help with interactive map scroll/zoom in a typical single session, so that purging on the Varnish side is effective for the user within reasonable time.

Reducing the Varnish-level TTLs seems counter-productive for efficiency at all levels, in the long run. The right answer is probably to have tilerator purge tiles from Varnish when they're updated, and in the shorter term we could work on a simple methodology for banning "bad" content when isolated events happen.

Gehel added a comment.Aug 21 2018, 4:39 PM

Reducing the Varnish-level TTLs seems counter-productive for efficiency at all levels, in the long run. The right answer is probably to have tilerator purge tiles from Varnish when they're updated, and in the shorter term we could work on a simple methodology for banning "bad" content when isolated events happen.

The intuition here (and that intuition might very well be wrong) is that the performance improvement between 1h cache and 24h cache is probably not that much (that is confirmed by @Pnorman's experience on OSM). The previous discussions about proactive cache invalidation was that it is a fairly hard problem, because of the scale of those invalidations (we peak around 500 tiles regenerated per second when we synchronize with OSM). But that previous discussion was a long time ago, I might be wrong on the constraints.

Ideally, we should dig into the request logs, and evaluate the efficiency gain between 1h of cache and 24h. And balance that against the complexity of having proactive invalidation of tiles. Progressively reducing TTL and looking at the increase in number of requests reaching kartotherian seems like a reasonable shortcut (but I'm very open to argue this point).

Having a process to invalidate an area is something we've been looking into. While it is easier than proactive invalidation, it is still not trivial, at least with our understanding of how varnish ban work. We might be missing something here.

Maybe we should take 20 minutes for a face to face meeting to make sure we all have all infos?

Change 454614 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[maps/kartotherian/deploy@master] Lower client cache TTL to 15 minutes

https://gerrit.wikimedia.org/r/454614

Change 454615 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[maps/tilerator/deploy@master] Lower client cache TTL to 15 minutes

https://gerrit.wikimedia.org/r/454615

Change 454615 merged by Gehel:
[maps/tilerator/deploy@master] Lower client cache TTL to 15 minutes

https://gerrit.wikimedia.org/r/454615

Change 454614 merged by Mholloway:
[maps/kartotherian/deploy@master] Lower client cache TTL to 15 minutes

https://gerrit.wikimedia.org/r/454614

Deployed to beta cluster. Note that we won't be able to deploy the updated max-age to production until the production upgrade to Stretch (T198622) is complete.

Gehel added a comment.Oct 2 2018, 9:35 AM

While we validated with @BBlack that the expected invalidation load on varnish is reasonable, we have not checked that this load is also reasonable on eventbus. We should check if it is the case before activating this on production.

Do we want to keep this one open to wait for the stretch migration and checking on the eventbus load, or should we spin up new tasks?

@Gehel, do you mean specifically the additional EventBus load generated by activating resource_change events sent from Tilerator (T109776: Tilerator should purge Varnish cache)? Otherwise it's not clear to me where EventBus is implicated here.

@Gehel, do you mean specifically the additional EventBus load generated by activating resource_change events sent from Tilerator (T109776: Tilerator should purge Varnish cache)? Otherwise it's not clear to me where EventBus is implicated here.

Yes, that's what I meant. maybe I should have added that comment on T109776, but it was also related to the conversation here.

Hey @Mholloway, what is next for this? Resolve, clarify what is needed before resolving, make a follow up task, or something else? Thanks

@Jhernandez Looks like this one has a change waiting to roll out when the Stretch migration is complete, so per our discussion earlier this week let's keep it in To Deploy. Then I think it can be resolved.

MSantos closed this task as Resolved.Apr 12 2019, 4:35 PM