Page MenuHomePhabricator

Getting different versions of the same file
Closed, ResolvedPublic5 Estimated Story Points

Description

Downloading the file directly from datasets:

Request URL:https://analytics.wikimedia.org/datasets/periodic/reports/metrics/language/compact-language-links/frwiki.tsv
Request Method:GET
Status Code:200 
Remote Address:208.80.153.248:443
Referrer Policy:no-referrer-when-downgrade

Response Headers
accept-ranges:bytes
access-control-allow-origin:*
age:0
backend-timing:D=137 t=1492615027124791
cache-control:max-age=86400, public, must-revalidate
content-encoding:gzip
content-length:292
content-type:text/tab-separated-values
date:Wed, 19 Apr 2017 15:17:07 GMT
etag:W/"336-54d84a3057177"
last-modified:Wed, 19 Apr 2017 13:02:29 GMT
server:Apache
status:200
strict-transport-security:max-age=31536000; includeSubDomains; preload
vary:Accept-Encoding
via:1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
x-analytics:WMF-Last-Access=19-Apr-2017;WMF-Last-Access-Global=19-Apr-2017;https=1
x-cache:cp1058 pass, cp2012 pass, cp2012 pass
x-cache-status:pass
x-client-ip:100.12.211.156
x-varnish:216752386, 176187081, 907143

Request Headers
:authority:analytics.wikimedia.org
:method:GET
:path:/datasets/periodic/reports/metrics/language/compact-language-links/frwiki.tsv
:scheme:https
accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
accept-encoding:gzip, deflate, sdch, br
accept-language:en-US,en;q=0.8,ro;q=0.6
cache-control:no-cache
cookie:GeoIP=US:Glenside:40.1012:-75.1780:v4; _ga=GA1.2.1371977825.1449511660; CP=H2; ajs_user_id=null; ajs_group_id=null; _pk_ref.7.521d=%5B%22%22%2C%22%22%2C1488296811%2C%22https%3A%2F%2Fbrowser-reports.wmflabs.org%2F%22%5D; _pk_id.7.521d=409267986a2b9832.1487889276.4.1488296811.1488296811.; _pk_ref.6.521d=%5B%22%22%2C%22%22%2C1488384853%2C%22https%3A%2F%2Fvital-signs.wmflabs.org%2F%22%5D; _pk_id.6.521d=a5925fe50e643cb1.1488283658.5.1488385718.1488384853.; _pk_ref.8.521d=%5B%22%22%2C%22%22%2C1492551078%2C%22https%3A%2F%2Fvital-signs.wmflabs.org%2F%22%5D; _pk_id.8.521d=5590ab72c1de26d1.1464011919.89.1492551078.1492551078.; WMF-Last-Access=19-Apr-2017; WMF-Last-Access-Global=19-Apr-2017
pragma:no-cache
referer:https://analytics.wikimedia.org/datasets/periodic/reports/metrics/language/compact-language-links/
upgrade-insecure-requests:1
user-agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36

But when fetching the file with AJAX from a dashiki dashboard or with CURL, we get a cached version:

Request URL:https://analytics.wikimedia.org/datasets/periodic/reports/metrics/language/compact-language-links/frwiki.tsv
Request Method:GET
Status Code:200 
Remote Address:208.80.153.248:443
Referrer Policy:no-referrer-when-downgrade

Response Headers
accept-ranges:bytes
access-control-allow-origin:*
age:64206
backend-timing:D=126 t=1492550759715201
cache-control:max-age=86400, public, must-revalidate
content-encoding:gzip
content-length:1634
content-type:text/tab-separated-values
date:Wed, 19 Apr 2017 15:16:05 GMT
etag:W/"13af-54cfc3214e824"
last-modified:Wed, 12 Apr 2017 18:15:39 GMT
server:Apache
set-cookie:WMF-Last-Access-Global=19-Apr-2017;Path=/;Domain=.wikimedia.org;HttpOnly;secure;Expires=Sun, 21 May 2017 12:00:00 GMT
set-cookie:WMF-Last-Access=19-Apr-2017;Path=/;HttpOnly;secure;Expires=Sun, 21 May 2017 12:00:00 GMT
set-cookie:CP=H2; Path=/; secure
status:200
strict-transport-security:max-age=31536000; includeSubDomains; preload
vary:Accept-Encoding
via:1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
x-analytics:https=1;nocookies=1
x-cache:cp1058 miss, cp2012 hit/1, cp2012 hit/5
x-cache-status:hit
x-client-ip:100.12.211.156
x-varnish:195770694, 176042246 151772666, 2156101 10414316

Request Headers
:authority:analytics.wikimedia.org
:method:GET
:path:/datasets/periodic/reports/metrics/language/compact-language-links/frwiki.tsv
:scheme:https
accept:*/*
accept-encoding:gzip, deflate, sdch, br
accept-language:en-US,en;q=0.8,ro;q=0.6
cache-control:no-cache
origin:http://localhost:5000
pragma:no-cache
referer:http://localhost:5000/dist/metrics-by-project-Dashiki%3ACompactLanguageLinks/
user-agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36

Event Timeline

I'm pretty baffled by this. I can't reproduce the pass ever, and these requests are pretty similar. We may want to go ahead and add a rule to always pass /datasets URIs, since this is probably the behavior we want anyway. Or, figure out some way to PURGE /datasets whenever the public datasets sync runs.

So any Cookie header set would force a pass, meanwhile without it (like with curl etc..) the cached content is served. This is due to a cache::misc setting (where anaytics.w.o lives), that can't know for sure what Cookies can be used for sessions (and can't be cached/leaked), so if it finds any it will just be conservative and pass.

Forcing a fake cookie to get a pass is not a viable solution in my opinion...

For reference, cache policy on anaytics.wikimedia.org is set up here:

# Cache json, yaml, csv, and tsv files 1 day
# (could be all files but wanted to be more restrictive to start)
<IfModule mod_headers.c>
   <FilesMatch "\.(json|yaml|csv|tsv)$">
      Header set Cache-Control "max-age=86400, public, must-revalidate"
  </FilesMatch>
</IfModule>

# M86400 -> issue conditional request 1 day after modification
<IfModule mod_expires.c>
    ExpiresActive On
    ExpiresDefault M86400
</IfModule>

I'm thinking if we add an ETag header (I was reading about it here http://wpcertification.blogspot.com/2010/08/how-to-enable-etag-in-apache-http.html) then that'll solve our problems. Should cache unless content is updated.

@Milimetric: Our http caching is done by varnish , apache just sets the expire headers, that has effect on the client, not the server.

Milimetric triaged this task as Medium priority.Apr 20 2017, 6:51 PM
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.
Milimetric set the point value for this task to 5.

We could simply lower down Cache-Control's max-age to one hour, in order to force Varnish to check after that time (upon receiving a request) if the content has changed or not (If-Modified-Since for example, that would end up in a very fast 304 for thorium). This would give us a lot of flexibility (maximum of one hour of stale data) while keeping the Varnish buffer in front of thorium.

@Milimetric: from https://developers.google.com/web/fundamentals/performance/optimizing-content-efficiency/http-caching it seems that Etag could only work in the way that you want if Cache-Control is set to no-cache (so each request would trigger a If-Modified-Since to thorium to check for stale data).

yeah, I mean 1 hour caches are fine, accomplish some of what we want. It's really weird though, with Cache-Control everything seems possible except for the one way that I figure everyone would want to cache:

cache for 300 days, but if a request comes in and at least 1 hour has passed since a cached response, check if the content changed.

Change 349447 had a related patch set uploaded (by Milimetric):
[operations/puppet@production] Tune cache for analytics.wikimedia.org data files

https://gerrit.wikimedia.org/r/349447

Change 349447 merged by Elukey:
[operations/puppet@production] Tune cache for analytics.wikimedia.org data files

https://gerrit.wikimedia.org/r/349447

Reading this: https://httpd.apache.org/docs/2.4/mod/core.html#fileetag looks like ETag is calculated per resource served. Since statics for dashiki are versioned we really could split our caching to be FOREVER in js/css resources and 1 hour on everything elese

Milimetric moved this task from In Code Review to Done on the Analytics-Kanban board.
Milimetric added a subscriber: Amire80.

@Amire80 this change means we're no longer caching for 1 day, so you'll be able to see changes to files you upload more quickly. Let me know if you have questions.