Page MenuHomePhabricator

Provide a 5-minute timeout in WDQS for trusted users using OAuth
Open, LowPublic

Description

Currently, WDQS's timeout prevents many legitimate queries involving large datasets (e.g. various counting queries tied to large wikis, see example at bottom). The performance and attack issues are understood.

This ticket requests that in addition to the anonymous public service, there would be an OAuth-dependent service for logged in users, that would permit users with a "trusted query runner" bit (see below) to run queries with a significantly longer -- perhaps 5 minutes? -- timeout.

The bit could be defined and managed by the Wikidata community. That community could either come up with criteria for giving out the bit, or just empower Wikidata admins to give out the bit at their discretion.

Obviously, Ops and Search must retain responsibility for the cluster's health, so would be empowered to remove the "trusted query runner" bit if a user is seen to be abusing or attacking the servers.

Example query that times out on larger wikis:
http://tinyurl.com/y973aawx

I can run this query -- which really impresses people, like an instant x-ray of a Wikipedia's coverage of people -- for smaller languages, but not for, say, nl, de, fr, zh, ru, en.

Event Timeline

Ijon created this task.Nov 6 2017, 10:42 PM
Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptNov 6 2017, 10:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev added a subscriber: Smalyshev.
Base added a subscriber: Base.Nov 7 2017, 12:07 AM
Base added a comment.EditedNov 7 2017, 1:04 AM

While I support the request completely, the query linked can be optimised by using just one label language and dumping slow label service, I was able to get results for Wikipedia as big as German: http://tinyurl.com/ydczsfpo

Jane023 added a subscriber: Jane023.Nov 7 2017, 8:52 AM

I have come to understand (mostly through trial and error, and then asking around) that to reduce the time on the query you need to start with the thing that has the least number of items in the group you want to query. As these "sub groups" get bigger and bigger, I need to become more and more creative over time to split these sub-groups in order to make the same query on a yearly basis. It would be nice if these large "sub-groups" could have their own reporting instances in query-land somehow, so e.g. all women-items, all people-born-in-city items, all items-with-sitelinks-in-specific-project (language WP, language WS, Commons), etc.

While I support the request completely, the query linked can be optimised by using just one label language and dumping slow label service, I was able to get results for Wikipedia as big as German: http://tinyurl.com/ydczsfpo

You can even keep the label service if you do the heavy lifting in a subquery: http://tinyurl.com/y9zxr5jp – see T166139: Query optimizer for labels should be using sub queries for doing that optimization automatically.

thiemowmde triaged this task as Low priority.Dec 6 2017, 11:04 PM
doctaxon added a comment.EditedFeb 23 2018, 3:46 PM

"The quantity of entities in Wikidata has risen very much.

My opinion is, that the SPARQL timeout period should be enlarged, because many queries, that once was running, don't run today any more because of query timeouts.

Maybe we should get a SPARQL server that allows a longer timeout period."


Okay, thank you for merging, but this task is very old, too.

@Aklapper Can you assign this task to anyone of WMF, who provides the SPARQL servers. I guess, that the timeout duration can be increased server-side.

Thank you
Martin ...

@doctaxon: I usually do not assign tasks to developers as I am not a manager of some unlimited pool of developers. :) Many tasks remain open until someone volunteers to investigate and provide a patch. The associated project tags provide information and links about individuals and teams interested in these areas.

Maybe we should get a SPARQL server that allows a longer timeout period.

The problem with it, along with obvious resource constraints (hardware costs money, supporting service instances costs money & needs people), if we have longer timeout this means we will have more heavy queries there. Which means the server would be more loaded. Which means, the next heavy query would take even more time, as it would be competing with server resources with the heavy queries already running on this server. This has a potential to evolve into a complete gridlock.

There's of course a point where it's still OK because the server has enough resources. But I don't think queries that return many millions of results should be our target, at least not for the generic endpoint, and maybe not at all. We may have a case for such queries, but I'd like to hear it and discuss it before we approach the solution for this.

How about a separate server just for "heavy" queries? For everyone? If that goes down occasionally, hey, so what?

To be extra fancy, maybe trigger it with a SPARQL comment, "# timeout:10" or something; every query with such a comment goes to the separate server?

@bd808 @chasemp What is your meaning about this?

@bd808 @chasemp What is your meaning about this?

@Smalyshev and his team are in charge of the WDQS software and servers.

Ah! I didn't know about, thank you

@Smalyshev wrote:

This will always be the case, we will never be able to serve arbitrary requests that require unlimited time to perform.

@Smalyshev: ... but you didn't explain the reason why not. We really don't need requests of unlimited time, but simply any requests that can be put to a longer timeout duration. Not unlimited, but longer limit. @Magnus has made a good suggestion. Why do we not discuss about it with the target, to get a good result to provide any bigger requests. Wikidata is growing more and more, if we stand still, then SPARQL will be useless in near future. I hereby invite @MichaelSchoenitzer_WMDE to participate on this discussion, because he has a lot of knowledge about SPARQL.

Please let's discuss about Magnus' suggestion stated above, because I think, that it is a really good solution.

Thank you very much, Martin

but you didn't explain the reason why not.

Because it is a shared service, and running queries with no limit on execution time will deny other users opportunity to use this service.

if we stand still, then SPARQL will be useless in near future

I do not think running 3-5M queries a day qualifies as "useless". Let's not exaggerate here - not allowing to run an extreme outlier query does not make the service "useless". We could discuss solutions for extreme outliers, but we should understand what we're talking about here and approach it realistically.

Semantics and drama aside, I think the point is that some useful queries are already creaking, or failing entirely.

Species without image times out, and I'm not even requesting taxon name (and we all know to stay away from the SERVICE labels for long queries by now).

Women without image runs sometimes, but fails on other occasions, so it will probably stop working soon, as it does for men.

I would not call either of those "extreme outliers"; I use both in scripts (taxa limited to 700K, sadly), for very practical reasons. As far as I can tell, they do not fail because of complexity, but because of data size. 30 seconds is an arbitrary limit that was set a while ago. It keeps the servers running, but with data size growing, more and more queries will begin to fail. These failing queries will only make up a small percentage of queries run, but they are very useful. So we would like a mechanism to run those few useful, simple, un-optimizable queries. A separate server that runs fewer but longer queries would do that.

@Smalyshev @Magnus

Thank you, Magnus, this was my opinion, but my English is not so good like yours, so ...

Let's discuss about Magnus' suggestion, it sounds really good.

Species without image times out

I run this one: http://tinyurl.com/yaak4f5n and it worked in 50s, which is close to the limit, but still under it. It produced 1874294 results. I would call a query that produces almost 2M result an outlier, though I'll discard "extreme" to avoid emotional labels. The point is, queries that produce millions of results are heavy. They consume a lot of CPU, I/O and memory capacity. We can run one of them occasionally, but we can't run unlimited number of them all the time. We've had downtimes in the past because people set a bot that run such queries (and worse) in the loop and that effectively killed the service.

That said, I recognize the need of such queries being run. And am looking for solution for it. Potential solutions could be:

  1. Separate gated access point (e.g. with OAuth) for select users that we can trust with self-policing that allows longer timeouts (needs some R&D time to implement)
  2. Separate service (needs some budgeting and resourcing)
  3. Offlining expensive queries - i.e. running them by bot in a controlled manner, so they do not overload the service, and storing the results somewhere.
  4. Different way of doing same things (i.e. do we need the list of all 1.8M entities needing images? What we'd do with this 1.8M list? LIMIT 500 query on the same finishes very fast)
  5. Other ideas?

I'll think about this more and see what solutions we can arrive at. More ideas welcome.

As a side note, we're running on reduced capacity now (due to T188045: wdqs1004 broken) so this may not be the most suitable time to test heavy queries until that one is resolved :) I haven't seen any disruptions yet (that's why we have redundant configuration) but I think it'd be wise not to push it too far right now.

30 seconds is an arbitrary limit that was set a while ago.

Ah yes, also minor point, but we've been on 60s limit for a year now. Once T178492: Create a more controlled WDQS cluster is done, and production services use that endpoint, we could talk about bumping the limit again if necessary.

Maybe 120s limit? I think, we have to test it, but how?

I can see the timeout brings a lot of frustration.
And I think we can agree that bumping the limit every few months cannot be the solution.

As a first step I would propose to create a process where one can submit a query that times out, so we can better understand the problem and the needs of the users.
This will also help to optimize the queries and the query optimizer which will help to avoid a lot of timeouts.

After we have gained more insights we could discuss what needs of the users should be satisfied with the services and which needs should/can be satisfied with other services.

Magnus added a comment.EditedFeb 27 2018, 8:57 AM

Thanks @Smalyshev Yes these queries are outliers, but there is a good reason to run them (in this case, find missing images for items), and just by Wikidata growth, the number of outliers (queries that timeout) will increase, to the point where they are no longer outliers. Technical improvements (improved software, better hardware) can buffer some of this, but not all. As you show, some queries can be re-formulated (though the reason why one works but not the other remains a mystery to most users), but again, not forever.

As for my preferences for implementation:

  1. OAuth-regulated access point - rather not. It complicates script based queries. Can work around that, but...
  2. Separate service - yes, though ideally I'd just use the same URL and add a keyword to the SPARQL (we used to have /* SLOW_OK */ on the original toolserver SQL queries), which would then use a separate server in the background, and be transparent for the end user
  3. Offlining - if I could submit my query, get a token back, and then can query the token once every 5 min or so if it's done, and get the result back then, that would be fine with me
  4. What I do with the 1.8M items is I run full-text searches on them on Commons and Flickr, in this case. I am perfectly happy to cache the result myself for days or weeks and work on that list, but I need to get the list in the first place. LIMIT and OFFSET do not work for these when the full query doesn't, I tried
  5. For my own purposes, we could even have a Wikidata page with "named queries" (heading with the name, paragraph with the query, or something) that gets run on a regular basis; offlining, with a twist. This could produce static JSON files (maybe compressed) somewhere, under the query name. Additionally, the query service could offer returning the data via a fake query ("# NAMED QUERY Bob" etc), but that's optional, though it would have the benefit of using the same interface
  6. Transparent offlining - queries with a keyword get run with increased timeout (say, 10min), but only one of them at a time; the connection stays open though, and eventually returns, even if it takes hours. Feels dirty, though :-(
doctaxon added a comment.EditedFeb 27 2018, 2:47 PM

@Jonas
This is such a query that runs into a timeout. It should give all Swedish women that have no sitelink to dewiki with counting the sitelinks and listing some properties to each item, and is needed for a dewiki community project. I cannot limit it because of the needed double ORDER BY, so the query has to run with unlimited result. If there is a longer timeout, the query will give a successful result. It will be great, if you can optimize it, if possible. Can you calculate, how much time the query will take to complete?

select
   ?sitelinks
   ?itemLabel
   ?itemDescription
   (group_concat(distinct ?P106l; separator=', ') as ?P106s) 
   (sample(?P18) as ?P18s)
   (group_concat(distinct ?P569l; separator=', ') as ?P569s)
   (group_concat(distinct ?P569prec; separator=', ') as ?P569precs)
   (group_concat(distinct ?P569cal; separator=', ') as ?P569cals)
   (group_concat(distinct ?P19l; separator=', ') as ?P19s) 
   (group_concat(distinct ?P570l; separator=', ') as ?P570s)
   (group_concat(distinct ?P570prec; separator=', ') as ?P570precs)
   (group_concat(distinct ?P570cal; separator=', ') as ?P570cals)
   (group_concat(distinct ?P20l; separator=', ') as ?P20s) 
   (group_concat(distinct ?P27l; separator=', ') as ?P27s) 
   ?item
with {
   select ?item ?sitelinks
   where {
      hint:Query hint:optimizer "None".
      ?item wdt:P27 wd:Q183; wdt:P21 wd:Q6581072; wdt:P31 wd:Q5; wikibase:sitelinks ?sitelinks .
      filter not exists {[] schema:about ?item; schema:isPartOf <https://de.wikipedia.org/>}
   }
} as %subquery
where {
   hint:Query hint:optimizer "None".
   include %subquery.
   optional {?item wdt:P106 ?P106. optional {?P106 wdt:P2521 ?P106l. filter(lang(?P106l) = 'de')}}
   optional {?item wdt:P18 ?P18.}
   optional {?item wdt:P27/rdfs:label ?P27l.}
   optional {?item wdt:P569 ?P569.}
   optional {?item p:P569/psv:P569 [wikibase:timePrecision ?P569prec; wikibase:timeCalendarModel ?P569cal]}
   optional {?item wdt:P570 ?P570.}
   optional {?item p:P570/psv:P570 [wikibase:timePrecision ?P570prec; wikibase:timeCalendarModel ?P570cal]}
   optional {?item wdt:P19/rdfs:label ?P19l.}
   optional {?item wdt:P20/rdfs:label ?P20l.}
   bind(substr(str(?P569), 1, 10) AS ?P569l).
   bind(substr(str(?P569), 1, 10) AS ?P569l).
   bind(xsd:integer(substr(str(?item), 33)) as ?num).
   service wikibase:label {bd:serviceParam wikibase:language 'de,[AUTO_LANGUAGE]'.}
}
group by ?sitelinks ?itemLabel ?itemDescription ?item
order by desc(?sitelinks) asc(?num)

@doctaxon use MINUS instead of FILTER NOT EXISTS, and stop killing the optimizer, and then %subquery runs in less than 5 seconds.

SELECT ?item ?sitelinks WHERE {
  ?item wdt:P27 wd:Q183;
        wdt:P21 wd:Q6581072;
        wdt:P31 wd:Q5;
        wikibase:sitelinks ?sitelinks.
  MINUS {
    ?article schema:about ?item;
             schema:isPartOf <https://de.wikipedia.org/>.
  }
}

It only returns some two thousand results, so the surrounding query shouldn’t be a problem either…

Oh, and the outer query is trying to get all the labels of the place of birth, place of death, country of citizenship, etc., in all languages. That’s also a terrible idea.

This runs in ~20 s: P6752 – it’s not equivalent, but I hope it helps.

It only returns some two thousand results, so the surrounding query shouldn’t be a problem either…

No, the "minus"-variation gives a timout, too.

?item wdt:P27 wd:Q183 was my mistake, these are the German women. I need the Swedish, sorry: ?item wdt:P27 wd:Q34

Oh, and the outer query is trying to get all the labels of the place of birth, place of death, country of citizenship, etc., in all languages. That’s also a terrible idea.

Not in all languages, but in German or the first alternative language. I want to get at least one alternative.

Separate service - yes, though ideally I'd just use the same URL

If it will be separate service, it will be separate URL. Having two distinct services under the same URL would not really work well with how our LVS is set up, unless we do some complex tricks with redirects, which I don't really want to get into.

Separate service - yes, though ideally I'd just use the same URL

LIMIT worked for me, OFFSET of course wouldn't work better then the original once numbers are bigger, since OFFSET essentially needs to run through all the data set to get to the offset. If you need whole dataset, LIMIT/OFFSET would make the issue only worse, quadratically. However, if for the use case dealing with subset of data is enough, then LIMIT might help. Of course, depends on use case (e.g. I am still not sure what 2M results are being used for, specifically). But LIMIT/OFFSET is certainly not going to solve this case.

For my own purposes, we could even have a Wikidata page with "named queries"

Making proposal about something like this is long on my TODO list, see also https://commons.wikimedia.org/wiki/User:TabulistBot - but I haven't got it to working condition due to lack of time. I could pick it up.

Transparent offlining - queries with a keyword get run with increased timeout (say, 10min), but only one of them at a time

I don't think this is feasible - there's no real way to ensure "only one" part since servers are completely independent, and even "one per server" is not completely trivial, though can probably be done with some query parameter magic. More worrying is that this requires support for very long requests from the whole pipeline from frontend down to Blazegraph, and I'm not sure how well this will actually work - HTTP is not really meant to do hour-long requests, and it would be a real shame to run one only to discover there's no way to deliver the data back since the connection has died in the meantime.

Additionally, this doesn't really solve the issue as one query, provided it's big enough, can take down the whole server (Java handles OOM really badly, which is mostly mitigated by the fact that most queries will time out before they can clog enough memory, but if we remove timeout the risk grows significantly). If we had it in some kind of gated setup, it'd be fine to take this risk, but exposing it to the Internet where people are not exactly known for not trying to break things just for lulz seems too big a risk for me.

So, in general, this seems to need some thought yet... I'd like to hear more details of what exactly is done with those 2M results, maybe I'll have some other ideas.

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Mar 5 2018, 4:15 PM

So far all the examples for queries that are timing out are queries that produce massive amount of data or are very complex. But a simple pattern of a timing-out query I see very often is the following:

SELECT ?item ?sitelinks WHERE {
  ?item wdt:P31 wd:Q;
        wikibase:sitelinks ?sitelinks .
  MINUS { ?item ^schema:about/schema:isPartOf <https://de.wikipedia.org/> }
} ORDER BY DESC(?sitelinks) LIMIT 10

Getting the items of one area with the most sitelinks and optionally a specific sitelink missing. This is a widely used pattern which is used often to find out what the most "important" topics of an area are, often used to find white spots in Wikipedias coverage or to generate working list. These queries timeout as soon as the amount of articles in the area is to big. Which is quite often the case and getting more common the more wikidata grows. It doesn't matter what the limit is.

The query can be tuned a bit, if you can guess the amount of sitelinks needed to get into the top-10. But that is hard and only a small improvement and therefore often not sufficient. For example this query runs sometimes but most of the time it times-out:

SELECT ?item ?sitelinks WHERE {
  ?item wdt:P31 wd:Q5;
        wikibase:sitelinks ?sitelinks .
  filter (?sitelinks > 40).
  MINUS { ?item ^schema:about/schema:isPartOf <https://de.wikipedia.org/> }
} ORDER BY DESC(?sitelinks) LIMIT 10

Due to growth of Wikidata it will always time out in near future. Maybe you see any other tricks to improve this type of query?

So far all the examples for queries that are timing out are queries that produce massive amount of data or are very complex

That's the reason why queries time out. There's no magic to it - queries that time out are those that require the engine to process a huge amount of data. It can be either because interpretation of the query by the engine is wrong (such as Optimizer bugs) or the query is written in a bad way, or the query is genuinely requiring the engine to do a lot of work. In the first two cases, you can try to rewrite the query in a way that would direct engine to do less work to achieve the same result. In the latter case, there's nothing to be done - if you want to use a shared service to consume more resources, that means everybody else gets proportionally less resources. We indeed may need to look into how to serve the resource-heavy requests too, but please realize that apparent textual simplicity of the query does not mean it doesn't process millions of entities. That's pretty much always the reason query is slow (bugs excluded of course).

That's the reason why queries time out. There's no magic to it - queries that time out are those that require the engine to process a huge amount of data.

Well that's obvious. But what I was referring to is that while above examples produce huge amount of data in the output and can therefore never be significantly faster by definition, the example I gave does not give big amounts of data as output.
It is slow because the SPARQL-Query first generates a list of all humans – which is huge – then sorts the list and only then applies the limit. Also the filter is only applied after generating the list – thus not improving the query time much. There is no way to avoid the generation of the huge interim list in the first place now - but there could be!
So in contrary to the queries my Magnus and others this type of query could be faster.
A (not very nice but working) way to do so would be to add a binned or qualitative information to the RDF-Database containing an estimate of the number of site-links, so that on could add something like:

?item wikibase:popularity wikibase:verypopular.

to restrain the list to items with more than some-constant number of site links. Not very pretty, but something like that could be implemented to improve this very common and still very inefficient pattern described above.

But what I was referring to is that while above examples produce huge amount of data in the output and can therefore never be significantly faster by definition, the example I gave does not give big amounts of data as output.

Yes, as you correctly noted, what matters is not the output but the quantity of data it has to process. The large output usually means large amount of data, but the other way is not true - even queries outputting one item or nothing at all can require large working sets.

Maybe you see any other tricks to improve this type of query?

Adding a query hint helps:

SELECT ?item ?sitelinks WHERE {
  ?item wdt:P31 wd:Q5;
        wikibase:sitelinks ?sitelinks. hint:Prior hint:rangeSafe true. # tell BlazeGraph that we’re not mixing ints/floats/whatever here
  FILTER(?sitelinks >= 40)
  MINUS { ?item ^schema:about/schema:isPartOf <https://de.wikipedia.org/> }
}
ORDER BY DESC(?sitelinks)
LIMIT 10

However, you still need the FILTER, unfortunately.

Ijon added a comment.Jun 4 2018, 5:02 PM

Yes. Another example of a query timing out without lots of output is this simple count of scholarly articles (millions)

SELECT (COUNT(?article) AS ?count)
WHERE {
?article wdt:P31/wdt:P279* wd:Q13442814
}

I think such counts are very reasonable queries to occasionally want to run, so yes, we do still need a solution -- with all appropriate measures (only trusted users, quotas, whatever) -- for this issue.

Ijon: Counting and generating lists of items with a P31=… can now be done also with the new search features: haswbstatement:P31=Q13442814. Maybe this feature can be improved in the future to be able to also search including Subclasses (P279).
SPARQL might not be the right tool for everything.

Another example of a query timing out without lots of output is this simple count of scholarly articles (millions)

I think this one should be easy to do by first fetching all types with something like http://tinyurl.com/y9pnxg82 and then running counts for each of the types. Though there might be intersections...
Though I see that direct count is slow too. Not sure why, it should be much faster. I'll look into it.

OK, so this query is fast:

SELECT (COUNT(*) AS ?count)
WHERE {
?article wdt:P31 wd:Q13442814 .
}

and this is slow:

SELECT (COUNT(?aricle) AS ?count)
WHERE {
?article wdt:P31 wd:Q13442814 .
}

I think for @Ijon's case the former works as well. Also, this one:

SELECT (count(*) as ?cnt) WHERE {
  SERVICE gas:service {
     gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" .
     gas:program gas:in wd:Q13442814 . # one or more times, specifies the initial frontier.
     gas:program gas:out ?type . # exactly once - will be bound to the visited vertices.
     gas:program gas:linkType wdt:P279 .
     gas:program gas:traversalDirection "Reverse" .
  }
  ?article wdt:P31 ?type .
}

works for me, though it takes very close to a minute so may work or fail depending on circumstance. Also this one seems to be of the same speed:

SELECT (COUNT(*) AS ?count)
WHERE {
?article wdt:P31/wdt:P279* wd:Q13442814 .
  hint:Prior hint:gearing "reverse" .
}
Ijon added a comment.Jun 25 2018, 11:41 PM

Thanks for the insight on the difference between those two versions. That is indeed usable, for this particular count.

But to the more general request? We do need some way to work around the timeout, sometimes, without abusing our personally knowing @Smalyshev.

@Ijon I appreciate the issue, but having unauthenticated endpoint with extended timeout kinda defeats the purpose of timeout, and adding authentication is not trivial, so it has to wait until I get to it. Unless somebody wants to contribute a patch :)

Any developers working on this task should notice T236500: large number of 504 errors from ulsfo.