Page MenuHomePhabricator

Superset Timeout Logging
Closed, DeclinedPublic

Description

We currently don't have a way to determine when Superset timeouts happen. This could be part of Superset query logging task. Even if there is no additional work, as part of this task lets triage the current timeouts and document how dashboard creator can go about troubleshooting.

Event Timeline

odimitrijevic triaged this task as High priority.
odimitrijevic moved this task from Incoming (new tickets) to Visualize on the Data-Engineering board.
odimitrijevic added a subscriber: razzi.

I'm not 100% sure that this ticket is necessary any more. For context, it was created at a time when we had only 5 presto workers and our users were experiencing frequent failures to render graphs using the Presto/Hive backend in Superset.
We made many adjustments to try to alleviate the issue, mostly by increasing the timeout period: e.g T299141, T294768, and T294771.

We also created some documentation around the problem: T294046: Write document about "Fast Enough Superset" and the resulting document.

The most notable improvement came about as a result of tripling the size of the presto cluster: T323783: Add an-presto10[06-15] to the presto cluster
However, adding more presto workers initially made the performance problems worse it took several more tickets to find the cause of the performance issue: T325809: Presto is unstable with more than 5 worker nodes.

We eventually found that the problem was caused by the large flood of kerberos requests associated with the intra-cluster presto authentication: T329831: KDC performance tuning for TCP requests

Since increasing the number of kerberos KDC worker daemons, this has stablised the Presto cluster performance and we have not had any recent reports of superset queries timing out.
We have a ticket about creating a general purpose presto query logger: T269832: Add a presto query logger which is likely to provide a more comprehensive mechanism for gathering data about particularly slow or problematic presto queries.

If we do wish to go ahead with this ticket as it stands, then we might want to look into the custom statsd logging options, as explained here: https://superset.apache.org/docs/installation/event-logging/#statsd-logging

However, I'm not yet convinced of the value of it compared with some of our other work, given that our users are not complaining about Superset performance at the moment. @Gehel what do you think? Should we decline the ticket?

I'll be bold and decline this ticket, but please feel free to reopen it if anyone feels strongly that we should be instrumenting Superset like this..
I think that the better way for us to gain visibility into query timeouts is probably as part of T269832, although that only covers Presto, not Druid.