Page MenuHomePhabricator

WDQS: Automatically kill queries > 60s
Closed, DeclinedPublic

Description

Per last week's conversation with @dcausse , the timeout for WDQS is 60s. That means any query that doesn't finish in 60s will never reach the client. As such, there's no point in allowing it to continue.

Creating this ticket to:

  • Validate that this is actually safe (what about workloads that add triples? Would if we kill these queries will we lose that data? And if so, do we care?)
  • Create automation that kills queries that run for > 60s.
  • Verify operation

Event Timeline

I've been looking at queries over 60s data on our end to understand user behaviour. I noticed a lot of queries go over the strictly 60000 limit but the exact query time hovers around 60000+5/10 max and only some (ca. 1000 queries) go way above.

Firstly, how come there's queries that hover around the slightly over 60s mark? Because the timeout isn't actually exactly 60s?

How can I find queries that truly were successful and the user did not see the timeout? I'm trying to find out how many external users (if any) might have "special" privileges. Is it just being assigned a 200 instead of a 500 http status? Is there any criteria for being allowed to run queries that go so much over the limit?

Thank you for indulging my questions! Much appreciated

A few random points:

  • We don't have a way to reliably kill queries in Blazegraph. There is an endpoint to kill individual queries, but in quite a few cases, it doesn't actually work (or at least not in the way you would expect)
  • Timeouts are weird and complex. We might be implementing read timeouts and not request length timeouts. If we still see some data being exchanged, that resets the timeout to zero. In some cases, Blazegraph is able to stream results before the end of the computation, and that might lead to overall query length being longer than the read timeout.
  • We also have timeouts implemented at a number of different layers (Blazegraph, nginx, envoy, haproxy, ats, ...) all with potentially slightly different semantics and configurations.

Ah wow okay, that helps give me more context. Would you have recommendations then, what might be a good enough heuristic to look for queries that time out?

I think that the only reliable way to detect timeouts is to parse the response body. And even then, it's probably not trivial.

Your heuristic of anything > 60 second overall query time is good! Anything > 59 seconds is probably better. At least this would track with the intent of our timeouts, if not with the actual implementation.

I would expect a data analyst to tell us how good this heuristic is, maybe by looking at a sample of individual queries to see if we can categorize them reliably between timed out or not timed out. And I would also expect an analyst to tell us how we can improve this heuristic! I'm not trying to avoid responsibility here, I'm just trying to be explicit that this is a behaviour that we don't entirely understand about our system and that external insight is probably better than our current understanding.

Ah wow okay, that helps give me more context. Would you have recommendations then, what might be a good enough heuristic to look for queries that time out?

You should look at the raw query logs in event.wdqs_external_sparql_query, and filter records with a status code 500 and query_time` > 60 sec. That's still an approximation, since blazegraph
does not return a dedicated status code for timeouts.

However, when a query time outs balzegraph logs a java.util.concurrent.TimeoutException. Searching for instances of java.util.concurrent.TimeoutException in logstash should give you more precision. Devs on your team should have access.

This is strictly for queries that timed out by the database.
As @Gehel pointed out, we might still have other sources of timeouts. E.g. nginx not receiving *any* response from the database and retuning. This usually happens in balzegraph failure scenarios.

You should look at the raw query logs in event.wdqs_external_sparql_query, and filter records with a status code 500 and query_time` > 60 sec. That's still an approximation, since blazegraph does not return a dedicated status code for timeouts.

Yeah this is what I've done in the past!

But, I'll give it a shot with looking at the specific error in logstash and to @Gehel no issue at all! I appreciate your input, I'll also go gather insights from data analysts and see what we can come up with.

Thanks both

BTracy-WMF triaged this task as Medium priority.Dec 3 2025, 8:54 PM
BTracy-WMF moved this task from Incoming to Analysis on the Wikidata-Query-Service board.

Let's not spend time in improving Blazegraph operations, we're movign to a different backend. It is unlikely that we can have a robust implementation of killing queries on Balzegraph anyway.