Page MenuHomePhabricator

Add proper expiry headers to kartotherian's responses
Closed, ResolvedPublic

Description

Kartotherian should set expiry header on the tile responses to let Varnish know the policy

This task is connected to T186732

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Yurik moved this task from Backlog to To-do on the Maps-Sprint board.Aug 8 2015, 4:19 AM
MaxSem moved this task from To-do to In progress on the Maps-Sprint board.Aug 19 2015, 10:09 PM
MaxSem claimed this task.
MaxSem moved this task from In progress to To-do on the Maps-Sprint board.Aug 25 2015, 5:44 PM
MaxSem moved this task from To-do to In progress on the Maps-Sprint board.Aug 25 2015, 8:48 PM
MaxSem moved this task from In progress to To-do on the Maps-Sprint board.Sep 1 2015, 7:23 AM
MaxSem moved this task from To-do to Backlog on the Maps-Sprint board.Sep 1 2015, 10:52 PM
Yurik moved this task from All map-related tasks to Tilerator on the Maps board.Sep 13 2015, 6:41 AM
Yurik moved this task from Tilerator to Kartotherian on the Maps board.Feb 7 2016, 10:06 PM
Yurik moved this task from Backlog to Core on the Maps (Kartotherian) board.Apr 21 2016, 9:07 PM
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 21 2016, 9:07 PM
MaxSem removed MaxSem as the assignee of this task.May 26 2016, 10:45 PM
Yurik moved this task from Backlog to To-do on the Maps-Sprint board.May 26 2016, 10:46 PM
Yurik added a subscriber: BBlack.May 29 2016, 9:56 AM

According to Google's dev site:

Cache-Control header was defined as part of the HTTP/1.1 specification and supersedes previous headers (e.g. Expires) used to define response caching policies. All modern browsers support Cache-Control, hence that is all we will need.

I don't think we return Cache-Control header, so should we retarget this task to implement Cache-Control header?

MaxSem added a comment.Nov 2 2016, 9:22 PM

I propose to start with just making every request Cache-Control:public, max-age=86400, s-maxage=86400 then we can bump the lifetime for tiles once we get purges right. API and static images can stay with 1 day lifetime.

Yurik removed a project: Maps.Dec 15 2016, 4:38 AM
MaxSem renamed this task from Add proper Expiry header to kartotherian's responses to Add proper expiry headers to kartotherian's responses.
MaxSem added a subscriber: Gehel.
Gehel added a comment.EditedMar 9 2017, 6:53 PM

We should also remove the ETag header or make it consistent between nodes as it is part of caching logic. And Last-Modified should not be epoch(0), we should drop it as well.

I chatted with @Gehel about this and, whilst we both agree that it is a must have for T155601: Stabilizing Interactive Products, we don't want to start working on it until we've resolved some of the other "must have" tasks to limit context switching due to having too much work in progress.

MaxSem claimed this task.May 2 2017, 12:11 AM
MaxSem moved this task from To-do to In progress on the Maps-Sprint board.May 2 2017, 11:05 PM

Change 351224 merged by MaxSem:
[maps/kartotherian@master] Add caching headers

https://gerrit.wikimedia.org/r/351224

BBlack added a comment.May 9 2017, 8:13 PM

Are you sure you want the tiles public-cacheable as well? It takes load off of us, but it also puts the purging/invalidation of them on update out of our control (in the users' caches). We might want to sync up on how we want VCL to handle/mask the header as well (and all related things), maybe on IRC or Hangouts when we all get a chance.

Gehel added a comment.May 11 2017, 7:28 PM

What we are sure at this point:

  1. etags are incoherent between servers, they sould either be fixed, or removed
  2. last-modified is epoch(0) which is wrong, it should be fixed or removed
  3. in normal operation, a TTL of 24h on tiles is fine, having delay between OSM updating the underlying data and us publishing updated tiles is fine, exposing a cache-control header should be done
  4. in case of issue (if we expose bad tiles, see T159631), it would be nice to be able to invalidate those tiles faster than the 24h delay

I'd say that 1. and 2. are clearly wrong and need to be fixed. 3 is a low hanging fruit (and we already have the fix). 4 is a nice to have in case of trouble. So that's at least the priority order.

debt moved this task from In progress to Backlog on the Maps-Sprint board.May 30 2017, 7:39 PM
debt added a subscriber: debt.

This is still broken and needs to be fixed, but we don't have a full team to work on it at this time; moving to backlog.

Gehel added a comment.Jun 1 2017, 12:54 PM

It looks like fixing this in the application will be non trivial. We don't even understand where those headers come from and grepping the code does not help (I personally don't understand much of how Node works, but I trust @MaxSem analysis on this). Which leaves 3 options:

  • continue digging into kartotherian, mapnik and find a way to fix this in the correct place: probably long and expensive
  • overwrite the headers in varnish: ugly, varnish role is not to fix application issues
  • add a reverse proxy (nginx) in front of kartotherian, just to rewrite headers: I'm not very keen on adding a component just to rewrite headers

My understanding from @BBlack is that we don't want to add application specific configuration in varnish, my understanding from @MaxSem is that this is something that is already done for multiple applications.

debt removed MaxSem as the assignee of this task.Jun 6 2017, 7:45 PM
debt moved this task from Backlog to To-do on the Maps-Sprint board.

Moving to prioritized as it's on our list of things that do need doing.

My understanding from @BBlack is that we don't want to add application specific configuration in varnish, my understanding from @MaxSem is that this is something that is already done for multiple applications.

Both of these statements are true. We have a history of caving in to providing application-specific hacks (esp for MediaWiki), and we're trying to avoid that as much as possible going forward. The only thing that costs us more in the long run than doing the legwork to fix something properly in the application layer is hacking a fix into the Varnish layer instead. Promises to eventually move the Varnish fixups to the applayer are generally never delivered on, as it's hard to apply pressure to fix something that's already "working". The fixups accumulate into enormous debt and complexity over time at the cache layer, doing jobs that aren't properly the caches' to do.

Pnorman added a subscriber: Pnorman.

Removing easy tag, since it's turning out not to be.

debt added a subscriber: gabriel-wmde.EditedAug 3 2017, 7:21 PM

Hi @GWicke - would you or your team be able to help us out with this? Thanks for any info! :)

debt added a subscriber: GWicke.Aug 3 2017, 7:26 PM
debt lowered the priority of this task from High to Normal.Sep 5 2017, 7:16 PM
debt moved this task from To-do to Stalled/Waiting on the Maps-Sprint board.

Not doing this task will require us to manually flush and update Varnish for tiles that are out of sync. We haven't done this in a while, and only have had to do in a few instances, but it's painful, thus why this ticket was written.

Varnish is currently caching too much for too long, meaning that new tiles will take much longer to update for the end users. It's using an old epoch date (1970-?) for it's caching date (time since downloaded not time since last updated).

When we update all tiles with the new map style, we'll need to do a mass purge of all maps tiles in order to purge Varnish completely, which might cause some issues for at least 24 hours. This might cause some issues with some maps where both styles might be present in the same map for a short period of time (24 hours).

Moving to stalled/waiting until such time we have engineering resources to dig into the code and fix this.

debt moved this task from Stalled/Waiting to To-do on the Maps-Sprint board.Sep 19 2017, 7:12 PM

Moving to prioritized - we need to do this, but don't have resources and I don't want it to drop off the board.

As discussed in sprint, the above commits fix some of the issues, but not all of them.

Hi everyone,
We are building a kartotherian instance, and noticed the issues with cache-related http headers.

I think this PR may help to understand why these are not consistent, and hopefully find a proper fix.
https://github.com/kartotherian/cassandra/pull/2

Gehel added a comment.Oct 30 2017, 3:15 PM

@Amatissart that's great! I don't know that part enough to review it myself, but at least that's a step in the right direction. Maybe @Yurik or @MaxSem could have a look...

Gehel moved this task from To-do to Needs review on the Maps-Sprint board.Dec 7 2017, 7:58 PM

Change 408832 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps: backport cache header configuration from upstream

https://gerrit.wikimedia.org/r/408832

Change 408832 merged by Gehel:
[operations/puppet@production] maps: backport cache header configuration from upstream

https://gerrit.wikimedia.org/r/408832

Gehel moved this task from Needs review to Done on the Maps-Sprint board.Feb 8 2018, 9:50 AM

Cache-Control headers are now published. We could still improve the situation (in particular etags and last-modified are wrong), but this is tracked as T186732.

Deskana removed a subscriber: Deskana.Feb 14 2018, 2:16 PM
jmatazzoni added a subscriber: jmatazzoni.

On the collab team "This Quarter" board (where, FYI, we put stuff we're actually working on), should this be in the "Needs Review" or "Code Review Started" or "QA Review" column? And I notice it's not assigned to anyone...

Cache-Control headers are now published. We could still improve the situation (in particular etags and last-modified are wrong), but this is tracked as T186732.

Is there anything left to do on this task, then? Can we close it as resolved?

Gehel added a comment.Feb 26 2018, 7:09 PM

Yep, we can close it.

Side note: it is in the "Done" column on the Maps-Sprint board, which in the workflow we were using means: it is waiting for our product owner to confirm he is happy and knows the work has been done.

Gehel added a comment.Mar 13 2018, 8:44 PM

Check of the cache control header:

curl -v https://maps.wikimedia.org/osm-intl/6/40/18.png > /dev/null 2>| grep Cache-Control
< Cache-Control: public, max-age=86400, s-maxage=86400

@Etonkovidova: you seemed to have an example of where cache control were not working as expected. Could you add that example?

@Gehel - It was on on betalabs which is not quite set up to deal with kartotherian afaik. On https://en.wikipedia.beta.wmflabs.org/wiki/Mavetuna from the Console ( though the Status code: 400 ):

Request URL:  https://maps.wikimedia.org/img/osm-intl,9,37.7533,-122.4248,350x300.png?domain=en.wikipedia.beta.wmflabs.org&title=Mavetuna&groups=_7161fe6360995e63d3bd5da268d6cb9ca6eea240

cache-control:public, s-maxage=30, max-age=30

With curl:

curl -v https://maps.wikimedia.org/img/osm-intl,9,37.7533,-122.4248,350x300.png  2>&1 | grep Cache-Control
< Cache-Control: max-age=3600

@Etonkovidova your test is actually hitting the production maps cluster. The Kartographer fronted is on betalabs, but it is depending on the production tiles / images. It looks this is using the snapshot service (which I don't know much about). So it looks like the snapshot service does not follow the same rules for headers. That's surprising, and probably a mistake, but I'm not familiar enough with the use case to know if there is a good reason to have a lower cache expiration on snapshots.

Etonkovidova closed this task as Resolved.Mar 15 2018, 1:27 PM
Etonkovidova claimed this task.

Thx, @Gehel! Closing this task as Resolved - if some additional work needs to be done later, another ticket can be created.