Page MenuHomePhabricator

Varnish caching around datasets.wikimedia.org is causing breakages
Closed, ResolvedPublic

Description

The varnishes in front of datasets.wikimedia.org disagree on which version they're caching. This is a pretty big problem when timing means that it's happened while the format of some underlying datasets was being changed, the result being that all the search dashboards are broken.

Event Timeline

Ironholds raised the priority of this task from to Needs Triage.
Ironholds updated the task description. (Show Details)
Ironholds added a subscriber: Ironholds.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 22 2015, 9:26 PM
Ironholds set Security to None.Jun 22 2015, 9:26 PM
Ironholds edited subscribers, added: Ottomata, BBlack; removed: Aklapper.

Ja, one of the two misc eqiad varnish hosts has invalid cached data. I don't know how to purge this.

curl -H 'Host: datasets.wikimedia.org'  http://cp1043.eqiad.wmnet/aggregate-datasets/search/app_event_counts.tsv | wc -l
721

curl -H 'Host: datasets.wikimedia.org'  http://cp1044.eqiad.wmnet/aggregate-datasets/search/app_event_counts.tsv | wc -l
598

A note that this has now been fixed manually by (I think) Roan for one file but not for both, so the dashboards are still broken.

I'd really appreciate:

  1. This being fixed;
  2. Someone teaching Andrew (or me!) how to fix it in the future and giving us auth to do so.

Having datasets un-updateable or accessible for >24 hours is really not kosher and I'd like to be able to take support for this kind of thing off the plate of Opsen as a whole, at least for this specific use case.

A note that this has now been fixed manually by (I think) Roan for one file but not for both, so the dashboards are still broken.

I poked Brandon to purge these things, but I can't purge myself, because this requires both root and intimate knowledge of Varnish. Which pretty much narrows it down to just Brandon.

Having datasets un-updateable or accessible for >24 hours is really not kosher and I'd like to be able to take support for this kind of thing off the plate of Opsen as a whole, at least for this specific use case.

It's actually pretty much exactly 24 hours. The header being sent is Cache-Control: max-age=86400, public, must-revalidate which means "anyone (public) can cache this for up to 24 hours (86400 seconds), and they may hold on to data older than that but may only serve it to clients if they first check (must-revalidate) with the server that it hasn't changed".

The culprit is https://gerrit.wikimedia.org/r/#/c/218534/2/modules/statistics/files/datasets.wikimedia.org ; it's associated with T101125, so I complained about his problem at T101125#1390258 and reopened that task.

Best I can tell from reading the relevant puppet manifests, the data in question is being rsynced into place every 30 minutes, so I would recommend setting a much shorter caching duration, maybe more like 5 minutes. Note that the cache timeout can be set separately for caching proxies like Varnish and for clients like browsers by setting s-maxage and max-age to different values. If must-revalidate is set (as it is now), repeated requests for unchanged content are cheap, because a compliant client (and certainly Varnish) will send an If-None-Match or If-Modified-Since request and will get a lightweight 304 Not Modified response.

Ack. James, you lie!

Milimetric removed a subscriber: BBlack.

I'm claiming this task and closing the original task that added caching.

This may seem harsh, but the resolution is to use a cache buster on the end of your URL. Most of our dashboarding was built with that in mind. It makes the most sense to let the client handle the cache busting, that way we can control if you bust hourly, minute-by-minute, or at whatever granularity you need. Here's proof that it would work:

curl -H 'Host: datasets.wikimedia.org'  http://cp1043.eqiad.wmnet/aggregate-datasets/search/app_event_counts.tsv | wc -l
733

curl -H 'Host: datasets.wikimedia.org'  http://cp1044.eqiad.wmnet/aggregate-datasets/search/app_event_counts.tsv | wc -l
733  # it just caught up overnight, but that doesn't mean it'll always be the same as cp1043, you're at the mercy of the cache gods.

curl -H 'Host: datasets.wikimedia.org'  http://cp1043.eqiad.wmnet/aggregate-datasets/search/app_event_counts.tsv?blah | wc -l
739

curl -H 'Host: datasets.wikimedia.org'  http://cp1044.eqiad.wmnet/aggregate-datasets/search/app_event_counts.tsv?blah | wc -l
739

The harsh part being that we won't change the apache cache headers, we need them to be set to 24 hours in the most common case, and we don't want to manage individual settings for each use case. Come talk to me if this messes up your quarterly presentations, I'll probably be able to help you with the dashboards.

Milimetric closed this task as Resolved.Jun 23 2015, 4:06 PM

It's nothing to do with my quarterly presentations. Okay, cache busting it is - should be trivial to work out.

Milimetric moved this task from In Progress to Done on the Analytics-Kanban board.Jun 23 2015, 4:22 PM