Page MenuHomePhabricator

[RFC] Caching for results of wikidata Sparql queries
Closed, ResolvedPublic

Description

We would like to discuss various caching approaches for the Wikidata Query Service.

Direct Wikipedia Usage
  • Graph extension allows direct querying of the service to produce these examples. In this case, queries are ran from the Graphoid service, or directly from browsers if the graph is interacted with.
  • Graph design - when graph is being designed in the graph sandbox, a query will be made on each keystroke.
Wikidata tools & bots
External sites

Proposals and Ideas

  • Invalidate cache on relevant item change - this is an ideal scenario, but it is highly unlikely, as we would have to track all items that participated in the query, plus evaluate all new items if they would match the original query.
  • Cache all responses in Varnish for a reasonable duration depending on the server load - e.g. 1 hour or 1 day
  • Invalidate cache by doing the same request but with an extra URL parameter or an extra header, e.g. &refresh=1 or Refresh: 1.
    • To prevent DOS, ignore refresh parameter/header if the cached response is less than a minute old
    • VCL will ignore the extra parameter when constructing cache key
  • Do not cache unless some request specifies that it is ok to cache
    • We need to be clear why non-caching should be the default behavior

Please discuss and vote on what the default behavior should be.

Necessary in any case for caching:

  • move from misc cluster to one that is for higher request rate
  • fix that varnish is not caching because of chunked encoding
  • fix response headers so that caching is allowed for whatever duration is decided

Event Timeline

aude raised the priority of this task from to Medium.
aude updated the task description. (Show Details)
aude subscribed.
aude set Security to None.

Please note that a SPARQL query takes at least a few hundred ms to execute (and up to 30s even), thus doing that during page save/ render is probably not an option.

I agree that the query should go through varnish, but as usual - what is the purging policy? It will be very complicated to track each piece of data that appears in the result back to the original, and to invalidate it when original changes. We could introduce "its ok to be stale" argument, and allow manual cache flushing.

  • In VCL, check if request headers Allow-Stale is set, return the cached value. Or use some max age setting.
  • If request has no special headers, treat it as "cache for a minute maximum" - to prevent DOSing

question: why is this task limited in scope to the Graph extension?

I think Graph extension was shown as an example that can easily overwhelm the system without proper caching. More elaborate, research-oriented usages are usually different because they are being done interactively by the query developer.

If we want a broad solution (and not just a stop gap, that might work for graphs), I think we should go for Query entities again. These would address the problems brought up over here and should also allow for better maintainability.

Jonas renamed this task from Caching for results of wikidatasparql queries for Graphs to [RFC] Caching for results of wikidatasparql queries for Graphs.Feb 12 2016, 3:21 PM

I think it is quite obvious that we need at least some caching before the Query entities are finally deployed, because Graph extension is not the only possible source of DoS requests.

There are a lot of use cases were we are fine with cached (old) results. We could make a lot of people very happy with this.

I think it is quite obvious that we need at least some caching before the Query entities are finally deployed, because Graph extension is not the only possible source of DoS requests.

I didn't quite get what you're trying to say… Query entities would inherently give us a way to cache results and they could easily use their own (private) blazegraph instance which would protect them (to a certain degree) from public service outages (which is something you could of course do for all kinds of internal querying).
Protecting us from a DoS from Graph queries or query entity queries specifically shouldn't be too hard: We can control how often we allow users to purge the data and all query changes need an edit at some point (which is obviously visible).

If a good caching strategy can be made to work without Query entities, why not reuse it for Scribunto access? 😃

we might want to track entity usage of whatever entities are used and that could be incorporated into cache invalidation (essentially it's arbitrary access)

If a good caching strategy can be made to work without Query entities, why not reuse it for Scribunto access? 😃

Inline queries have a potentially high maintenance cost and don't have a history, therefore I would prefer not to do that without Query entities, even if we can address the performance issues.

The response does not even allow the browser to cache it. Lets solve the low hanging fruit by setting Cache-Control: max-age=86400 per doc to cache it for a day. This would be very useful when developing graphs using the Graph sandbox (it regenerates a graph on each keystroke). And obviously E-Tag would be good, hashing the content.

Current Response Headers:

access-control-allow-origin:*
age:0
content-encoding:gzip
content-type:application/sparql-results+json
date:Sat, 13 Feb 2016 02:25:06 GMT
server:nginx/1.9.4
set-cookie:WMF-Last-Access=13-Feb-2016;Path=/;HttpOnly;Expires=Wed, 16 Mar 2016 00:00:00 GMT
set-cookie:CP=H2; Path=/
status:200 OK
version:HTTP/1.1
via:1.1 varnish, 1.1 varnish
x-analytics:https=1;nocookies=1
x-cache:cp1069 pass(0), cp3019 frontend pass(0)
x-client-ip:2a02:2698:6c23:9f7f:dddc:236c:e386:a532
x-served-by:wdqs1002
x-varnish:488633255, 212270014

Hmm... I'm not sure I want to do this in generic case. Not all queries should be cached - in fact, there are plenty of queries that change and should not be cached, and that's the whole point - such as data quality queries that may drive bots, etc. On the other hand, there are queries that can be cached. So I wonder what if we make some path parameter of header that would control which caching header nginx returns? So that the client could control whether they want cached or uncached result (with the default being the current state of affairs).

On a more general note, I think the better solution for this would be to have some kind of intermediate data store (either on wiki or maybe in restBase?) that would fetch query data and cache it with various times and the graphs would use that store.

Yes as Stas says we will need to make it possible (and obvious) for people to get up-to-date results for maintenance tasks, work lists for editathons and current events.

@Smalyshev, I think the question is what to make "default":

  • should query results be cached by default, and not cached when a certain parameter is given, or the other way around?
  • should that parameter be part of the query, or should it be a header?
  • do we want to treat identical queries sent with and without "force" as being the same, so that if I force a query, it updates the cache for the non-forced ones? If so, will we need special varnish handling of this?
  • should non-cached queries still be cached for a much shorter period, like 5 seconds?

My opinion:

should query results be cached by default, and not cached when a certain parameter is given, or the other way around?

Not cached by default. But that may depend of what "default" is :) And how it behaves in production - if we get too much load, I may change my opinion.

should that parameter be part of the query, or should it be a header?

Can be both, but query option is a must since query options are supported by various caches much better than headers, and same with clients.

do we want to treat identical queries sent with and without "force" as being the same, so that if I force a query, it updates the cache for the non-forced ones?

That would be nice, but depends on which cache we use and how it works. I strongly suspect the sets of cacheable and non-cacheable queries will be largely disjoint anyway, so in practice it may not matter that much. As I see it, there are two kind of queries - those that you don't care about fresh to-the-second results - e.g. birthdays of US presidents probably didn't change since yesterday - and those that you do care. E.g. list of entities with broken "country" property may have changed since I last run the bot that fixes it, and I want the actual data.

Also, doing one extra query run is not a big deal, doing 1000s query runs is what we worry about.

should non-cached queries still be cached for a much shorter period, like 5 seconds?

Well, there should be a possibility to run a completely non-cached query. It is a must. However, should it be an option or no-option situation, that may be open for discussion. Currently I am on a "no-option" position, but a lot depends on how exactly we will use it or for what.

@Smalyshev I completely agree with the concept of an intermediate service between the nanosparqlserver and the client. I think that this service should "broker" requests (based on an options configuration object), and eval whether a query is re-executed against the BG db or the results could be returned from the "cache", i.e. an "offline" "response only" db.

I have been looking at Huginn https://github.com/cantino/huginn recently. This is an application that delegates tasks to agents. This (or similar app) may be suitable for MW extension usage just by using agents or webhooks instead of inline queries.

I think it may make sense to check out if RESTBase can be helpful here. @GWicke, any thoughts?

@Smalyshev, I would need more context to usefully comment on this.

In particular, I wonder if there are a small number of queries that get a lot of hits, and if those queries can be cached for long enough to result in worthwhile hit rates.

When discussing a use case like graphs, there are also a lot more caching layers and change propagation systems to consider.

@GWicke the context is that Yuri is building an interface that allows Graphs to query SPARQL endpoint. Since running the query each time graph is displayed is too expensive, we want some intermediate caching store that would store the results, possibly for the time defined in the query.

As far as I can see, we do not need change propagation there - in fact, I don't think it's possible as to figure out which change belongs to the result of the query is harder than running the query anew, in general case. We just need intermediary storage with expiration. So I wondered if RESTbase would be a good platform for it.

Clarification: there is a graph design sandbox, which re-renders the graph on every change. In production, the data is pulled by graphoid service (rarely, as it is behind varnish cache), and by client browsers (when users click the graph to make them interactive).

Yurik renamed this task from [RFC] Caching for results of wikidatasparql queries for Graphs to [RFC] Caching for results of wikidata Sparql queries.Feb 15 2016, 9:14 PM
Yurik updated the task description. (Show Details)

Adding @BBlack as he is our Varnish guru

IIRC, the problem we've beat our heads against in past SPARQL-related tickets is the fact that SPARQL clients are using POST method for readonly queries, due to argument length issues and whatnot. On the surface, that's a dealbreaker for caching them as POST isn't cacheable. The conflict here comes from a fundamental limitation of HTTP: the only idempotent/readonly methods have un-ignorable input data length restrictions. There are probably ways to design around that in a scratch design, but SPARQL software is already-written...

@BBlack, for some unknown reason our SPARQL installation uses GET.

@BBlack this is not POST issue, we are still using GET since POST patch was never approved.

I perceive the use of Varnish as not directly related to how an object broker could manage this use case (expensive querying of the wdqs nano sparql api), though it is probably related to any UI elements (i.e. the query editor or results renderer) that may generally be connected to the query service.

If a REST solution (like RESTBase) is used, a client request could either GET the results from cache with an ID or trigger a query event webhook that forwards (and stores) the response from the nanosparql server directly with callback. The basic API design could be something like GET /query/:owner/:qid or /query/hooks/:owner/:qid, where the first case would just return the results from a db cache and the second would trigger an event that returns (and stores) a payload from the nanosparql server.

A typical use case for this is a static query that returns dynamic results updated on a regular frequency (e.g. daily) from a single client. The payload event handler for the sparql server callback could also be controlled based on client quota and retention policies.

@Christopher we're not sure we need object broker if we can do with just Varnish. That would be much simpler (less moving parts) and faster to implement. We could though look at something like quarry, see T104762

@Smalyshev, I agree - it would be much better to have a stable mature caching technology for this - makes things much simpler.

I may be wrong, but the headers that are returned from a request to the nginx server wdqs1002 say that varnish 1.1 is already being used there. And, for whatever reason, it misses, because repeating the same query gives the same response time. For example, this one returns in 25180>26966 ms.

http://query.wikidata.org/sparql?query=PREFIX+wd%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0APREFIX+wdt%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0APREFIX+wikibase%3A+%3Chttp%3A%2F%2Fwikiba.se%2Fontology%23%3E%0APREFIX+p%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2F%3E%0APREFIX+v%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fstatement%2F%3E%0APREFIX+q%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fqualifier%2F%3E%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0A%0ASELECT+%3FcountryLabel+(COUNT(DISTINCT+%3Fchild)+AS+%3Fnumber)%0AWHERE+%7B%0A++%3Fchild+wdt%3AP106%2Fwdt%3AP279*+wd%3AQ855091+.++%0A++%3Fchild+wdt%3AP27+%3Fcountry+.%0A++SERVICE+wikibase%3Alabel+%7B%0A++++bd%3AserviceParam+wikibase%3Alanguage+%22en%22+.%0A++++%3Fcountry+rdfs%3Alabel+%3FcountryLabel%0A++%7D+%0A++%0A%7D+GROUP+BY+%3FcountryLabel+ORDER+BY+DESC(%3Fnumber)

Even though Varnish cache should work to proxy nginx for optimizing delivery of static query results, it lacks several important features of an object broker. Namely, client control of object expiration (TTL) and retrieval of "named query results" from persistent storage. A WDQS service use case may in fact be to compare results from several days ago with current results. Thus, assuming the latest results state is what the client wants my actually not be true.

Possibly, the optimal solution would use the varnish-api-engine (http://info.varnish-software.com/blog/introducing-varnish-api-engine) in conjunction with a WDQS REST API (provided with a modified RESTBase?). Is the varnish-api-engine being used anywhere in WMF? Also, delegating query requests to an API could allow POSTs. Simply with Varnish cache, the POST problem would remain unresolved.

I may be wrong, but the headers that are returned from a request to the nginx server wdqs1002 say that varnish 1.1 is already being used there.

It's varnish 3.0.6 currently (4.x is coming down the road).

And, for whatever reason, it misses, because repeating the same query gives the same response time.

It misses because the response is sent with Transfer-Encoding: chunked. If it were sent un-chunked with a Content-Length, the varnish would have a chance at caching it. However, the next thing you'd run into is that the response doesn't contain any caching-relevant headers (e.g. Expires, Cache-Control, Age). Lacking these, varnish would cache it with our configured default_ttl, which on the misc cluster where query.wikidata.org is currently hosted, is only 120 seconds.

Even though Varnish cache should work to proxy nginx for optimizing delivery of static query results, it lacks several important features of an object broker. Namely, client control of object expiration (TTL) and retrieval of "named query results" from persistent storage.
A WDQS service use case may in fact be to compare results from several days ago with current results. Thus, assuming the latest results state is what the client wants my actually not be true.

I think all of this is doable. Named query results is something we talked about in the previous discussion re GET length restrictions. POSTing (and/or server-side configuring, either way!) a complex query and saving it as a named query through a separate query-setup interface, then executing the query for results with a GET on just the query name.

I don't think we really want client control of object expiration (at least, not "varnish cache object expiration"), but what we want is the ability to parameterize named queries based on time, right? e.g. a named query that gives a time-series graph might have parameters for start time and duration. You might initially post the complex SPARQL template and save it as fooquery, then later have a client get it as /sparql?saved_query=fooquery&start=201601011234&duration=1w. Varnish would have the chance to cache those based on the query args as separate results, and you could limit the time resolution if you want to enhance cacheability.

If it's for inclusion from a page that wants to graph that data and always show a "current" graph rather than hardcoded start/duration (and I could see use-cases for both in articles), you could support a start time of now with an optional resolution specifier that defaults to 1 day, like &start=now/1d. The response to such a query would set cache-control headers that allow caching at varnish up to 24H (based on now/1d resolution), which means everyone executing that query gets new results about once a day and they all shared a single cached result per day.

The important thing here is there's no need for a client to have control over result object expiration if the query encodes everything that's relevant to expiration and the maximum cache lifetime is set small enough that other effects (e.g. data updates to existing historical data) are negligible in the big picture.

Possibly, the optimal solution would use the varnish-api-engine (http://info.varnish-software.com/blog/introducing-varnish-api-engine) in conjunction with a WDQS REST API (provided with a modified RESTBase?). Is the varnish-api-engine being used anywhere in WMF? Also, delegating query requests to an API could allow POSTs. Simply with Varnish cache, the POST problem would remain unresolved.

We're not using the Varnish API Engine, and I don't see us pursuing that anytime soon. Most of what it does can be done other ways, and more importantly it's commercial software.

There seems to be some confusion as to whether POST is or isn't still an issue here...

Also, a whole separate issue is that WDQS is currently mapped through our cache_misc cluster. That cluster is for small lightweight miscellaneous infrastructure. WDQS was probably always a poor match for that, but we put it there because at the time it was seen as being a lightweight / low-rate service that would mostly be used directly by humans to execute one-off complicated queries. The plans in this ticket sound nothing like that, and cache_misc probably isn't an appropriate home for a complex query services that's going to backend serious query load from wikis and the rest of the world...

@BBlack, thanks for raising these issues, this is definitely something we should address.

If it were sent un-chunked with a Content-Length, the varnish would have a chance at caching it.

It may be how Blazegraph sends it. But maybe we coould un-chunk it on the way (nginx? inside blazegraph?).

I don't think we really want client control of object expiration (at least, not "varnish cache object expiration"),

Yes, but I was thinking more of a scenario where Varnish has an object in cache, but may decide to still go to backend, or serve from cache, depending on incoming parameters. Is this possible? In general, is it possible in Varnish to decouple the lifetime of object in cache from the decision of whether to serve the object from cache or from backend? E.g. so that if the client is OK with being served "old" object, it will be served, but if not, then the object will be served only if it's relatively "fresh", even though it can be retained for longer for less demanding clients. Not sure if we'd need it that complex, but would be nice to know whether it's possible.

You might initially post the complex SPARQL template and save it as fooquery

I am not sure we want to go into explicitly saving named query, as I am not sure it can scale. That said, I didn't explore saved queries too deep yet, so I'll keep it in mind and try to look into it ASAP. But so far I think caching is better alternative and also much simpler.

The plans in this ticket sound nothing like that, and cache_misc probably isn't an appropriate home for a complex query services

Correct. We probably need to consider this.

I am not sure we should mix response caching and stored procedures (named queries) in the same task. Stored proc could add additional settings for cache, but I feel this is outside of the scope of this ticket.

@BBlack, my understanding is that Wikidata has almost no "timeseries" - requested with before <= X < after. Time-based filters rarely relate to "now". For example, it could be a query to get all the generals who were involved in a war of 1812 - fixed dates. Of course there could be a query like "current president", but that's not time, that's preferred value, i.e. regular stale cache problem that's common to all queries.

It seems there are two types of queries - "viewers" that draw pretty pictures, and "editors" that are used for editing wikidata and finding errors. While "viewers" can get stale data (hours/days), "editors" need more recent data (minutes). "editors" do not need absolutely latest data, because if you run a query and it gives you that Q55 has bad data, you will still load Q55 to see what it has, and edit it. And even then, there could be a race condition. Btw, with ETag, even if something is cached in Varnish for a minute, client may not have to re-download several hours later if the new result has the same ETag as the last one -- Varnish should take care of that.

We can distinguish between "editor" and "viewer" queries with an extra URL param, or with a header (can support both). The query param could be handled and removed in VCL. My concern - should the result be cached by default for "viewer" usecase, and require "editors" to supply an extra param, or the other way around? My opinion is that developers will go out of their way to make sure it works right (e.g. by adding a "no-caching" parameter), but in general will not deal with performance (it already works fast enough, I don't see a problem, I won't look closely at the documentation to discover the "caching is ok" parameter).

@Smalyshev, i just spoke with @Milimetric in analytics - he has a similar problem in the sense that external api is used by graphs. And we agreed that its not really a problem, and does not even need that much caching. The reason for it is because there will be very few graphs (relatively) that will query backend, and when they do, they will be cached in the Varnish that handles graph images. So I suspect this is won't be a problem any time soon in here as well.

Since running the query each time graph is displayed is too expensive, we want some intermediate caching store that would store the results, possibly for the time defined in the query.

Is the graph extension actually re-requesting the data on each view, or would this only happen on parser cache miss / edit?

I'm still not sure how effective query service caching can be in this context:

In particular, I wonder if there are a small number of queries that get a lot of hits, and if those queries can be cached for long enough to result in worthwhile hit rates.

@GWicke afaik the data is only requested (?) when the graph is interacted with + on parcher cache miss + edit.

some of my concerns about the affect of including a graph (with a slowish query) on page save timing, e.g. when just fixing a typo on a wikipedia page. It still would be nice to be able to purge the graph, on request or something, and at some time interval.

also want the query service to be able to handle whatever added load this brings, and caching could help protect it.

some of my concerns about the affect of including a graph (with a slowish query) on page save timing, e.g. when just fixing a typo on a wikipedia page.

To get sensible hit rates for relatively rare events like edits, we would need to cache results for a long time, on the order of weeks. Would this be acceptable without automatic purging?

Data is requested in these cases:

  • user views a page with a graph -> request to graphoid -> varnish cache miss -> graphoid re-renders the graph
  • user views a page with an interactive graph and clicks it to interact with it
  • editor clicks "page preview"
  • editor changes a graph and saves -> graph has a new hash -> (same as first bullet because it's a cache miss)

Data is NOT requested:

  • during the page save
  • after page save that did not change the graph itself (image has the same hash, so it is still cached)
  • if graph is part of a template, and tons of pages get updated until that page is visited by some user

user views a page with an interactive graph and clicks it to interact with it

This is the scenario which IMHO could drive most load. The rest is either rare or protected by cache already like this one:

user views a page with a graph -> request to graphoid -> varnish cache miss -> graphoid re-renders the graph

So, mainly as it seems the concern is for interactive graphs. Maybe we could start with non-interactive ones?

That said, regardless of graphs I wouldn't mind having some caching model, at least short-term, both as performance measure and as low-key "accidental DoS" protection (we've had cases in the past when some broken bot sent a million instances of the same query, I'd like for varnish to talk to such bots in the future).

That's relatively easy - I could disable wikidataquery: protocol in the browser, while still allowing it on Graphoid backend. Thing is, I highly doubt there are that many people who will click on an interactive graph - simply because it requires programming skills to create an interactive graph, plus most of the time users will not click on it even when reading a page. So I think we are being overly cautious here without a reason.

Misunderstood by Phabricator

@Jonas – Did you mean to tag this as TechCom-RFC or do you not want their input? Right now it's just generally tagged as #RfC, which essentially is merely "needs discussion".

@Jdforrester-WMF I'm not sure we have yet something for ArchCom here. We need more discussion and form a coherent proposal, and also discuss it with ops.

After talking about it, the first task to enable it seems to try and make nginx buffer WDQS results - at least smaller ones.

Change 274864 had a related patch set uploaded (by Smalyshev):
Add caching headers for nginx

https://gerrit.wikimedia.org/r/274864

Change 274864 merged by Gehel:
Add caching headers for nginx

https://gerrit.wikimedia.org/r/274864

Should be working now, with 1 minute lifetime for now. We'll see how it works and if we want to change it.

Smalyshev claimed this task.