Page MenuHomePhabricator

Varnish does not honor Cache-Control for Graphoid
Closed, ResolvedPublic

Description

It seems graphoid keeps returning cache miss:

yurik@steppenwolf:~/wmf/graphoid/deploy$ curl -I https://graphoid.wikimedia.org/www.mediawiki.org/v1/png/Extension%3AGraph%2FDemo/1647673/3cbe2b968108670c001e230dca4682a9d03f8814.png
HTTP/1.1 200 OK
Server: nginx/1.6.2
Date: Tue, 12 May 2015 16:26:12 GMT
Content-Type: image/png
Connection: keep-alive
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: Accept, X-Requested-With, Content-Type
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
Content-Security-Policy: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'
X-Content-Security-Policy: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'
X-WebKit-CSP: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'
Cache-Control: public, s-maxage=30, max-age=30
Accept-Ranges: bytes
X-Varnish: 1283394768
Age: 0
Via: 1.1 varnish
X-Cache: cp1045 frontend miss (0)

P.S. I have updated Graphoid to set Cache-Control to 30 seconds on both 200 & 400

Event Timeline

Yurik raised the priority of this task from to High.
Yurik updated the task description. (Show Details)
Yurik added a project: Graphoid.
Yurik added subscribers: Yurik, BBlack, mobrovac.

The parsoidcache varnish instances graphoid runs through are configured to not cache graphoid, intentionally, AFAICS. We could address that, but first we'd need to look into what kind of cache lifetimes the graphoid service is setting in its reply headers, and whether or how we plan to invalidate stale cache entries, if at all.

If it's not critical, it might be best to defer these issues until the restbase service's entry point subsumes the need for parsoidcache for this service, as I imagine response caching will work entirely differently through that.

What @BBlack says. Performance on a different domain isn't as good either, so moving to /api/rest_v1/ should be better for perf in any case.

  • I don't think graphoid sets any caching headers - probably should change that.
  • Current hash is based on graph definition only, but I hope T98837 will get done, in which case we will be able to cache them indefinitely.
  • I would recommend introducing at least a 15 minute caching to reduce graphoid's load in case of the graph making being on a popular page
  • Lastly, I still fail to understand how /api/rest_v1 would benefit with caching - are we building our own Varnish? I seriously doubt a generic Node-JS can ever be as fast as a custom-created caching system.
  • Generally, the application's cache headers are what defines the objects' cache TTLs in varnish, unless we add custom rules for them (which we'd rather not).
  • In general, yes, ideally the request itself (uri, varied headers) identifies a unique and unchanging resource, and therefore cache invalidation is never an issue, in which case we can set long TTLs and let things expire only when pushed out for hotter content.
  • Even with a 15 minute cache TTL, isn't this going to get confusing if an editor is editing the external resources of a graph and can't see his changes for 15 mins?
  • The point of the restbase API part is that if graphoid is going to land there eventually (and it is my understanding that most services, especially ones currently deployed through parsoidcache and sounding like *oid, eventually will do so - perhaps relatively soon?), then the way we go about caching graphoid will change at that point (the caching contracts would be graphoid<->restbase and restbase<->varnish), so it might waste less effort to simply solve it in the new world instead of the old.
Yurik renamed this task from Verify Varnish caching of the Graphoid content to Varnish does not honor Cache-Control for Graphoid.May 12 2015, 4:27 PM
Yurik updated the task description. (Show Details)

Change 210747 had a related patch set uploaded (by Yurik):
Enable Varnish caching for graphoid

https://gerrit.wikimedia.org/r/210747

Change 210754 had a related patch set uploaded (by BBlack):
enable frontend caching for graphoid T98803

https://gerrit.wikimedia.org/r/210754

Change 210754 merged by BBlack:
enable frontend caching for graphoid T98803

https://gerrit.wikimedia.org/r/210754

BBlack claimed this task.
BBlack moved this task from Backlog to Done on the Traffic board.

Seems to be working now:

bblack@palladium:~$ curl -I https://graphoid.wikimedia.org/www.mediawiki.org/v1/png/Extension%3AGraph%2FDemo/1647673/3cbe2b968108670c001e230dca4682a9d03f8814.png
HTTP/1.1 200 OK
Server: nginx/1.6.2
Date: Wed, 13 May 2015 19:31:54 GMT
Content-Type: image/png
Content-Length: 79866
Connection: keep-alive
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: Accept, X-Requested-With, Content-Type
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
Content-Security-Policy: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'
X-Content-Security-Policy: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'
X-WebKit-CSP: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'
Cache-Control: public, s-maxage=30, max-age=30
Accept-Ranges: bytes
X-Varnish: 2306951492 2306950734
Age: 6
Via: 1.1 varnish
X-Cache: cp1058 frontend hit (5)

Change 210747 abandoned by BBlack:
Enable Varnish caching for graphoid

Reason:
Ended up using https://gerrit.wikimedia.org/r/#/c/210754/ instead (which needed a few pre-commits to clean up general structure).

https://gerrit.wikimedia.org/r/210747

  • I would recommend introducing at least a 15 minute caching to reduce graphoid's load in case of the graph making being on a popular page

I'd just like to point out that this is the wrong way to look at this. We don't employ caching (just) for "reducing load" and we're not setting the TTL to values that we think would be sufficient to reduce said load.

We use it to deliver content faster and more importantly, we use it to beat the speed of light, by employing geographically distributed caching PoPs. You should design with this in mind. You should also set caching TTLs with that in mind and in a more scientific manner than '15 mins should be enough for everyone" :)

@faidon, thanks for the comment. Agree that we should apply more scientific method to it - but at present we have very little data to work with - graphs have just started, the load is expected to grow, and that's when I will start optimizing.

@faidon, thanks for the comment. Agree that we should apply more scientific method to it - but at present we have very little data to work with - graphs have just started, the load is expected to grow, and that's when I will start optimizing.

I'm not familiar with Graphoid and this may be a superficial incorrect comment but: we cache text & media content for 30 days and PURGE when there are changes. I think you have to justify the deviation from that rather than calling this an "optimization".

Let me state that I don't agree at all with this approach. As specified in the initial deployment task T90487 : Expected load - very very low at first. We also agreed with @Yurik graphoid would be put behind RESTBase which would then take care of caching / storing rendered graphs.

Given that graphoid has been in production for a very short while, I don't see the need in jumping the horse with caching right now.

@faidon, per my discussion with @Anomie, T98940, I do plan to introduce a different storage system for graphs, after which it should be possible to make graphs cachable for a long time. At this point, graphs use external data which is not possible to track to cause invalidation. Thus a relatively short TTL.

I expect everything about this will change when this moves to operating through the restbase entry point on the text balancers, anyways. The point of going forward with the immediate changes was to honor a short cacheability header (30s currently) being sent by the app, to absorb any spikes in case someone puts even a single graph on a very popular page in the interim. It was sending all traffic directly to graphoid...

Also re: "Expected load - very very low at first" - I suspect this is because in the initial design, it was intended that the majority/modern clients do their own client-side rendering, with the PNGs from graphoid service only as a fallback. However, @Yurik indicated on IRC that currently *all* clients are using the PNG method, because it was deemed that the JS method's load time was too long because it had to load some significant external JS libraries into the client first... so that kind of changes everything as well...