Page MenuHomePhabricator

[Bug] WDQS GUI caching
Closed, ResolvedPublic

Description

Caching breaks the app when files have breaking changes. Apparently, the browser keeps using stale assets.
Also, the same asset is served with multiple different etags and timestamps from different servers, causing redundant traffic and potentially confusing caches.

We should have a one file build or change caching behavior.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Strange ... Having a quick look at the HTTP responses and our nginx configuration, it seems that the only resource on which we set caching headers are the queries themselves (everything under /bigdata/namespace/wdq/sparql), and even then we send a max-age=60. We do send appropriate etags and last modified headers on static resources (CSS, JS, images, ...). Note that we should certainly have those static resources be content based addressed and have long lasting caching headers.

@Jonas do you have an example of a resource which stays cached for longer than expected?

Side note: nginx computes etags based on last modified date and file size. The last modified date might differ (not much, but enough) between our 2 servers, which means that we might have inconsistent etags. We should probably disable them...

@Gehel sorry I can not reproduce the error.
What I can say that on every deployment with braking changes the app is breaking, so basically every Monday.

Maybe we should put max-age on GUI files too? Longer one, of course, something like 5 days or at least a day.

I checked the dates on both services and they are within 1s of each other, so time difference should not be an issue.

Note that we should certainly have those static resources be content based addressed and have long lasting caching headers.

Yes. Otherwise the varnishes have no idea that it changed after a deployment.

Note that requests with If-Modified-Since and the If-None-Match are nor answered from the varnish by consulting the cache for the corresponding normal request.

I checked the dates on both services and they are within 1s of each other, so time difference should not be an issue.

As the etag is based on mtime, if mtime are different, etags are different. etags don't care about small difference, they are either the same or not. I'm not clear on how trebuchet works and if we could ensure that mtimes are identical accross hosts or not...

grunt-hashres or something similar could be used to transform HTML before plublication and ensure that URLs change whenever content changes. (note: I have no personal experience with grunt-hashres).

When reloading the query interface multiple times, I only sometimes get the 304 response as expected. Sometimes I get a 200 response with a new copy of the data - and a different timestamp and etag. Here are two responses for https://query.wikidata.org/vendor/wdqs-explorer/vis.js:

HTTP/1.1 304 Not Modified
Date: Tue, 19 Apr 2016 14:16:08 GMT
Content-Type: application/javascript
Server: nginx/1.9.3
Last-Modified: Mon, 22 Feb 2016 22:23:01 GMT
Etag: "56cb8a45-118e49"
X-Served-By: wdqs1001
Content-Encoding: gzip
x-varnish: 690103699, 2499762542, 1219053830
Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
Accept-Ranges: bytes
Age: 0
X-Cache: cp1061 miss(0), cp3009 miss(0), cp3010 frontend miss(0)
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
x-analytics: WMF-Last-Access=19-Apr-2016;https=1
x-client-ip: 87.138.110.76
X-Firefox-Spdy: 3.1

HTTP/1.1 200 OK
Date: Tue, 19 Apr 2016 14:37:50 GMT
Content-Type: application/javascript
Content-Length: 231830
Server: nginx/1.9.4
Last-Modified: Tue, 12 Apr 2016 21:03:03 GMT
Etag: "570d6287-118e49"
X-Served-By: wdqs1002
Content-Encoding: gzip
x-varnish: 690662416, 2499930515, 1219092302 1219089585
Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
Accept-Ranges: bytes
Age: 96
X-Cache: cp1061 miss(0), cp3009 miss(0), cp3010 frontend hit(1)
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
x-analytics: WMF-Last-Access=19-Apr-2016;https=1
x-client-ip: 87.138.110.76
X-Firefox-Spdy: 3.1

So it seems that the responses I get from different varnish or nginx instances are inconsistent.

Those two are from two different servers. But I wonder why they are different. The files are:

-rw-r--r-- 1 root root 1150537 Feb 22 22:23 gui/vendor/wdqs-explorer/vis.js
-rw-r--r-- 1 root root 1150537 Apr 12 21:03 gui/vendor/wdqs-explorer/vis.js

The files are the same, but modified time differs. Probably because the second server was reimaged, so the timestamps are from when it was imaged (not sure why trebuchet does that but that seems to be the case).

@Smalyshev yes, that's what I suspected. But that's not how load balancing for static resources should work. It screws with the logic of IFM headers. Varnish should hide the differences, there should be only one state of the virtual resource represented by a url.

Basically, this kind of load balancing causes more load, because static resources that are already cached in the browser are often delivered again, when the request is hitting a different box, where the resources has a different etag.

The simplest fix would probably be to just force the mtime on all the static files.

Change 284428 had a related patch set uploaded (by Gehel):
Introduce grunt-hashres to ensure caching coherence of assets

https://gerrit.wikimedia.org/r/284428

@Smalyshev yes, that's what I suspected. But that's not how load balancing for static resources should work. It screws with the logic of IFM headers. Varnish should hide the differences, there should be only one state of the virtual resource represented by a url.

Basically, this kind of load balancing causes more load, because static resources that are already cached in the browser are often delivered again, when the request is hitting a different box, where the resources has a different etag.

The simplest fix would probably be to just force the mtime on all the static files.

Forcing mtime would need to be done by Trebuchet during deployment. Varnish does not have shared state as far as I know, so it will not be able to hide the mtime differences. While it might be possible to do it via trebuchet, I would prefer to remove etags, push for a very large caching time and have a clean cache busting solution (https://gerrit.wikimedia.org/r/284428).

@Gehel grunt-hashres should fix the issue of stale assets being used, but it will not fix the issue of the same asset being served from different serves with different etags.

It seems to me that we should also not send an etag (and maybe also no Last-Modified header), and add a Cache-Control header with"max-age" set to a high value (and no must-revalidate flag). Together with content based urls, this should be sufficient.

Do you know how to achieve this?

@daniel the caching header configuration is well under my area of expertise. All those are set in nginx conf. So as soon as we have content based URLs I can take care of it.

On the content based URLs side and / or aggregation of resources, I'm less sure. We don't really have a build step at the moment so the solutions we have rely on too much human intervention for my taste (and it's not my time that is going to take the cost, so I'm unwilling to push this on others).

General note: the grunt-hashres change was meant mainly as an example of what I was talking about with Stas. Grunt, Javascript and frontend dev in general are not really my cup of tea. The important part for me is that we ensure that resources are addressed by content, so that we can start caching strictly based on TTL (dropping etags, last-modified, ...). hashres was just the first thing that mostly did what I was thinking. If you know of a better solution, let me know!

Change 284709 had a related patch set uploaded (by Jonas Kress (WMDE)):
[WIP] Grunt build script

https://gerrit.wikimedia.org/r/284709

Change 284428 abandoned by Gehel:
Introduce grunt-hashres to ensure caching coherence of assets

Reason:
Abandoning this in favor of https://gerrit.wikimedia.org/r/#/c/284709/ (which does have integration with filerev - I did not see it at first).

https://gerrit.wikimedia.org/r/284428

I know I'm repeating myself, but for some this is probably not a repeat: This is reimplementing part of what Mediawiki provides, but at the same time is not further along nor provides any advantage over what Mediawiki provides. Mediawiki its JS module loader correctly implements more than what this task is about. The code needed to display the WDQS GUI on a special page would be small. One thing isn't solved for Mediawiki: A step that runs npm to pull in JS components. I started T133388 (an RFC) for that. But the WDQS GUI repo doesn't implement that either.

I'd suggest moving the WDQS GUI to a Mediawiki extension, as that is required for some things we want to do in the future anyway. @Jonas do you want to go that route?


Yes, as pointed out, deploying WDQS requires more manual steps since the gui repo was created. I would have liked for it to get more automated instead. Please remember that this is a production service, other people need to be able to contribute and need to deploy security fixes. For my taste that would require being capable of continuous deployment.

@JanZerebecki I don't see a need for WDQS GUI to be a Mediawiki extension. I don't want to force people to have a Mediawiki installation when using a SPARQL endpoint GUI. Nevertheless there could be an extension using the WDQS GUI.

Mediawiki its JS module loader correctly implements more than what this task is about

Is this loader available outside mediawiki? If not, can we make it available as a separate library so other projects could use it?

The code needed to display the WDQS GUI on a special page would be small.

Of course, you can include any HTML on special page of a wiki, but why do that? I'm not sure I see a point in it. That looks like completely artificial dependency to me.

I'd suggest moving the WDQS GUI to a Mediawiki extension,

I'm not sure how this makes sense. WDQS GUI has nothing to do with Mediawiki and Mediawiki extensions. If we have some build that works for mediawiki exts, we should generalize it to work with other things, not make unrelated projects pretend that they are Mediawiki extensions.

I don't want to force people to have a Mediawiki installation when using a SPARQL endpoint GUI.

I didn't say they need to.

Is this loader available outside mediawiki? If not, can we make it available as a separate library so other projects could use it?

T133460

If we have some build that works for mediawiki exts, we should generalize it to work with other things, not make unrelated projects pretend that they are Mediawiki extensions.

Sure, I didn't imply anything else, any JS components that are created for the GUI should stay JS components.

You left out the most important part:

that is required for some things we want to do in the future anyway.

@Jonas Why do you think we should not allow users to run/write SPARQL queries on wiki?

@JanZerebecki I would love to see SPARQL on wikidata.org and I also like the idea of having SPARQL entities!

Writing/running SPARQL on wiki would be fine, but I'm not sure this would be done using this GUI.

Hi!
Following today's issue with the query GUI, I had a look at the code and thought of some optimizations, without being aware of this discussion, leading to the writing of a simple building script.
Given I did it without this conversation in mind, feel free to ignore it or just take is as inspiration. If you want to use a more elaborate build process, particularly fast in development mode, I would recommend having a look at brunch which is specialized in building this kind of front app. See here how it relates to Grunt, Gulp and others

That gerrit link does not work for me. Is it a draft? Those are not visible to other users.

Part of the cache invalidation may be solved without the other parts discussed in this ticket by setting the ETag to the latest git commit or the git content hash of the latest tree or each file. Maybe the ETag can be prepared for nginx during scap/deployment.

Change 288250 had a related patch set (by Maxlath) published:
bundling resources in single js and css files - decreasing network request for js and css from 79 to 2 - reducing files size with minification and maximum compression

https://gerrit.wikimedia.org/r/288250

@JanZerebecki oh sorry, that should be published now

Change 284709 merged by jenkins-bot:
Grunt build script

https://gerrit.wikimedia.org/r/284709

This should be fixed now since new build script uses hashes.

Change 288250 abandoned by Jonas Kress (WMDE):
bundling resources in single js and css files - decreasing network request for js and css from 79 to 2 - reducing files size with minification and maximum compression

https://gerrit.wikimedia.org/r/288250