Page MenuHomePhabricator

Investigate Superset Druid Timeouts
Closed, ResolvedPublic

Description

@SCherukuwada reported a timeout when querying against druid. Example query: https://superset.wikimedia.org/r/942 The current query timeout is 10s. This task is to investigate extending the Druid timeout duration and providing general guidance on writing queries against Druid.

Event Timeline

BTullis added subscribers: elukey, Ottomata, Milimetric and 2 others.

I'm looking into this issue now, to see if there is anything I can do to help.

It seems that the value for the Druid timeout on both the 'analytics' and 'public' Druid cluster has been set at different values over time.
As a reminder, the Druid 'public' cluster specifically supports public-facing applications, primarily AQS.

Here is a brief summary of what I have gleaned about how these timeout values have changed over time, which might inform decision making on how to proceed.

  • To begin with we only had one Druid cluster which was created in 2016.
  • At the time the timeout for queries was set to 5 minutes.
  • Then as part of the design process for Wikistats 2.0, a decision was made in 2017 to create a second Druid cluster, specifically for hosting public data
  • When the servers for thisd public Druid cluster were deployed in October 2017 the same 5 minute query timeout was applied to them: (link).
  • Both clusters continued with this 5 minute timeout until April 2019, when an incident (T219910: AQS alerts due to big queries issued to Druid for the edit API) caused a number af alerts to fire, due to a high volume of expensive queries being made to AQS on the public cluster.
  • As a result of this, stricter timouts were rapidly deployed (T219910#5103549), reducing the maximum timeout on the public Druid cluster from 5 minutes to 30 seconds.
  • These changes were deployed to the public cluster in April 2019 and subsequently applied one week later to the analytics cluster.
  • Then in June 2019 another incident occurs with the public cluster (incident report and ticket: T226035: Dropping data from druid takes down aqs hosts ), during which we experience an 11 minutes outage of Wikistats 2.0. (This would have been longer, but for @elukey being available to help on a Saturday).
  • As a result of this incident, even stricter timeouts are applied, from 30 seconds to 5 seconds: The change is applied to the public cluster and to the analytics cluster on the same day, June 26th 2019.
  • Four months later, a ticket is raised by a Superset user in October 2019, stating that certain dashboards fail to load: T234684: Superset not able to load a reading dashboard
  • The timeout is increased on the analytics cluster from 5 seconds to its current value of 10 seconds on the 8th of October 2019. The public druid cluster retains its 5 second query timeout.

Given this history and the fact that AQS traffic has been successfully isolated into its own cluster, I think it would be reasonable to increase the timeout for the analytics cluster once more, whilst leaving the public cluster at 5 seconds.

I would suggest that perhaps we revert the analytics cluster back to its 30 second timeout value and be on the lookout for any unexpected consequences of this.
It might even be reasonable to increase it to the 3 minute timeout to match what Presto is currently configured with in Superset, but I'd be surprised if we needed to do so, given that 10 seconds has caused few complaints since 2019.

More than happy to take advice and guidance from other team members, who may know more about the behaviour of Druid than I was able to ascertain from the records above.

BTullis triaged this task as Medium priority.Dec 7 2021, 11:07 AM
BTullis moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.

thanks a lot @BTullis for the thorough report!
I support moving query timeout for Druid internal to 2'55' (so that the query fails before the Superset timeout :)
If this leads to resource contention we'll reevaluate, but it seems low probability.

Change 744811 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Increase the timeout for Druid on the analytics cluster

https://gerrit.wikimedia.org/r/744811

Very nice summary Ben :)

I agree that the timeout can be increased in the Analytics cluster, but I'd suggest to do it in steps to see how the cluster behaves. IIUC we want to increase the timeout between Brokers and Historicals, and this has the consequence of consuming connections slots on the historical and broker side. For example:

  • a request comes in and hits a broker, one of its http threads to handle queries is busy (IIRC jetty doesn't use any fancy ioloop or similar, so one request == one thread dedicated to it)
  • the broker uses one of its threads in the connection pool to the historicals to fetch data from (one or more) historical(s).
  • the target historical(s) assign one of their spare http threads to handle HTTP connections to this new request, until it is completed. Max working time for an Historical is 10s (if I recall correctly from the yaml config) so there is a fail-safe.

The historicals have some extra buffer in their http threads so they can handle more connections than the sum of the brokers' threads.

If the requests are small and short, there is generally no problem. If the requests are longer (and with higher timeouts we are encouraging them) it may happen that few users with big and long queries saturate the connection pools at various layers causing impact to other people. Due to the 10s timeout at the Historical level, it may probably happen that the Brokers will be maxed out by requests, while the historicals are free (because of connections already timed-out).

It may seem that we have a ton of free space (25 http slots * 5 brokers) but if you think about multiple charts in the same dashboard, people hitting refresh, etc.. it is not that big :)
I haven't touched Druid in a while so what I wrote may not be up-to-date or correct, in case apologies, but you surely got the message that I wanted to deliver (namely pay attention to queues happening due to connections piled up).

To wrap up - I think that increasing the timeout is good, but I'd suggest to test 30s/60s first and then proceed to 2'55'.

Nice catch @elukey! I'd suggest using a different timeout for answers-from-historical (smaller) and overall query (larger). The broker does some work (aggregation) and can query multiple (possibly many) historical nodes. Having different values for them would make sense to me :)

Thanks both, that's a really useful set of insights. Looking into the configuration, there are a lot of different ways to configure timeouts and lots of possible combinations!
I now agree that it would be sensible to increase the value incrementally and keep a close eye on the queues. I've changed the CR to a 30 second timeout and we'll see where that gets us.

Change 744811 merged by Btullis:

[operations/puppet@production] Increase the timeout for Druid on the analytics cluster

https://gerrit.wikimedia.org/r/744811

Mentioned in SAL (#wikimedia-analytics) [2021-12-09T11:08:05Z] <btullis> roll restarting druid historical daemons on analytics cluster T297148

odimitrijevic raised the priority of this task from Medium to High.Dec 9 2021, 3:18 PM
odimitrijevic moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.