Superset Timeout Logging
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	odimitrijevic
	Nov 1 2021, 3:02 PM

Description

We currently don't have a way to determine when Superset timeouts happen. This could be part of Superset query logging task. Even if there is no additional work, as part of this task lets triage the current timeouts and document how dashboard creator can go about troubleshooting.

Related Objects
Search...

Status	Assigned	Task
Resolved	brouberol	T353782 Decommission an-tool1010
Resolved	brouberol	T347710 Migrate the Analytics Superset instances to our DSE Kubernetes cluster
Declined	None	T294772 Superset Timeout Logging

Event Timeline

odimitrijevic created this task.Nov 1 2021, 3:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 1 2021, 3:02 PM

odimitrijevic added a parent task: T294259: Presto/Superset User Experience Improvement.Nov 1 2021, 3:02 PM

odimitrijevic added a project: Data-Engineering.

odimitrijevic assigned this task to • razzi.Nov 1 2021, 3:34 PM

odimitrijevic triaged this task as High priority.

odimitrijevic moved this task from Incoming (new tickets) to Visualize on the Data-Engineering board.

odimitrijevic added a project: superset.wikimedia.org.Feb 2 2022, 2:24 AM

odimitrijevic removed • razzi as the assignee of this task.Feb 2 2022, 2:30 AM

odimitrijevic added a subscriber: • razzi.

odimitrijevic removed a project: Data-Engineering-Kanban.Apr 7 2022, 6:44 PM

odimitrijevic removed a parent task: T294259: Presto/Superset User Experience Improvement.Apr 7 2022, 6:57 PM

JArguello-WMF moved this task from Visualize to Data Products & Metrics on the Data-Engineering board.Jun 29 2023, 11:25 PM

BTullis added a project: Data-Platform-SRE.Jul 15 2023, 12:22 AM

• lbowmaker moved this task from Data Products & Metrics to Icebox (not considered in current quarter) on the Data-Engineering board.Nov 10 2023, 2:26 PM

Gehel moved this task from Incoming to Bugs / Issues on the Data-Platform-SRE board.Dec 7 2023, 1:51 PM

This might get fixed by T347710

Gehel added a parent task: T347710: Migrate the Analytics Superset instances to our DSE Kubernetes cluster.Mar 6 2024, 1:42 PM

Gehel moved this task from Bugs / Issues to 2024.03.25 - 2024.04.14 on the Data-Platform-SRE board.Mar 22 2024, 8:57 AM

Gehel edited projects, added Data-Platform-SRE (2024.03.25 - 2024.04.14); removed Data-Platform-SRE.

Gehel moved this task from Backlog to Needs Review on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.Mar 22 2024, 9:09 AM

I'm not 100% sure that this ticket is necessary any more. For context, it was created at a time when we had only 5 presto workers and our users were experiencing frequent failures to render graphs using the Presto/Hive backend in Superset.
We made many adjustments to try to alleviate the issue, mostly by increasing the timeout period: e.g T299141, T294768, and T294771.

We also created some documentation around the problem: T294046: Write document about "Fast Enough Superset" and the resulting document.

The most notable improvement came about as a result of tripling the size of the presto cluster: T323783: Add an-presto10[06-15] to the presto cluster
However, adding more presto workers initially made the performance problems worse it took several more tickets to find the cause of the performance issue: T325809: Presto is unstable with more than 5 worker nodes.

We eventually found that the problem was caused by the large flood of kerberos requests associated with the intra-cluster presto authentication: T329831: KDC performance tuning for TCP requests

Since increasing the number of kerberos KDC worker daemons, this has stablised the Presto cluster performance and we have not had any recent reports of superset queries timing out.
We have a ticket about creating a general purpose presto query logger: T269832: Add a presto query logger which is likely to provide a more comprehensive mechanism for gathering data about particularly slow or problematic presto queries.

If we do wish to go ahead with this ticket as it stands, then we might want to look into the custom statsd logging options, as explained here: https://superset.apache.org/docs/installation/event-logging/#statsd-logging

However, I'm not yet convinced of the value of it compared with some of our other work, given that our users are not complaining about Superset performance at the moment. @Gehel what do you think? Should we decline the ticket?

I'll be bold and decline this ticket, but please feel free to reopen it if anyone feels strongly that we should be instrumenting Superset like this..
I think that the better way for us to gain visibility into query timeouts is probably as part of T269832, although that only covers Presto, not Druid.

BTullis moved this task from Needs Review to Done on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.Mar 26 2024, 10:42 AM

Superset Timeout LoggingClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Superset Timeout Logging
Closed, DeclinedPublic
Actions

Related Objects
Search...