Increase Superset Timeout
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	odimitrijevic
	Nov 1 2021, 2:59 PM

Description

This task is to increase the Superset timeout to 3m from the current 1m in order to mitigate the dashboard timeouts until a different solution is found to improve the query performance.

Details

Subject	Repo	Branch	Lines +/-
Increase envoy upstream timeout for superset	operations/puppet	production	+3 -0
superset: set webserver timeout to 180 seconds	operations/puppet	production	+7 -2
superset: set webserver timeout to 180 seconds	operations/puppet	production	+5 -2
superset: Make sqllab timeout 3 minutes	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		odimitrijevic	T294259 Presto/Superset User Experience Improvement
		Resolved		BTullis	T294771 Increase Superset Timeout

Event Timeline

odimitrijevic created this task.Nov 1 2021, 2:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 1 2021, 2:59 PM

odimitrijevic added a parent task: T294259: Presto/Superset User Experience Improvement.Nov 1 2021, 2:59 PM

Also change Presto timeout in Superset (currently 55s)

• razzi added a project: User-razzi.Nov 3 2021, 7:30 PM

Interestingly, this query times out after just 10 seconds: https://superset.wikimedia.org/r/824

I wonder if the 55s that is currently set is used up across all the nodes, that 55s of cpu time is consumed in 10 real seconds.

EDIT: this is a druid query, so its timeout settings are independent

• razzi moved this task from Default to Ready for action on the User-razzi board.Nov 9 2021, 6:09 AM

Change 737738 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: Make sqllab timeout 3 minutes

https://gerrit.wikimedia.org/r/737738

gerritbot added a project: Patch-For-Review.Nov 9 2021, 4:28 PM

Mentioned in SAL (#wikimedia-analytics) [2021-11-09T16:30:06Z] <razzi> set superset presto timeout to 170s: {"connect_args":{"session_props":{"query_max_run_time":"170s"}}} for T294771

Change 737738 merged by Razzi:

[operations/puppet@production] superset: Make sqllab timeout 3 minutes

https://gerrit.wikimedia.org/r/737738

Deployed the change to config, will test it shortly

The query has been running for over a minute, so it looks to be working

• razzi moved this task from Next Up to Done on the Data-Engineering-Kanban board.Nov 9 2021, 4:56 PM

Maintenance_bot removed a project: Patch-For-Review.Nov 9 2021, 5:10 PM

• razzi moved this task from Testing to Done on the User-razzi board.Nov 10 2021, 5:07 PM

The setting we changed appears to apply only to SQL lab; the superset dashboards continue to time out after 1 minute

• razzi moved this task from Done to In progress on the User-razzi board.Nov 22 2021, 9:16 PM

Change 740712 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: set webserver timeout to 180 seconds

https://gerrit.wikimedia.org/r/740712

gerritbot added a project: Patch-For-Review.Nov 23 2021, 4:42 AM

• razzi moved this task from Done to In Code Review on the Data-Engineering-Kanban board.Nov 23 2021, 5:25 AM

• razzi moved this task from In progress to Waiting for reply on the User-razzi board.Nov 24 2021, 5:18 PM

Change 740712 merged by Razzi:

[operations/puppet@production] superset: set webserver timeout to 180 seconds

https://gerrit.wikimedia.org/r/740712

Maintenance_bot removed a project: Patch-For-Review.Nov 30 2021, 6:10 PM

Change 742808 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: set webserver timeout to 180 seconds

https://gerrit.wikimedia.org/r/742808

gerritbot added a project: Patch-For-Review.Nov 30 2021, 9:43 PM

Change 742808 merged by Razzi:

[operations/puppet@production] superset: set webserver timeout to 180 seconds

https://gerrit.wikimedia.org/r/742808

• razzi moved this task from In Code Review to Ready to Deploy on the Data-Engineering-Kanban board.Dec 1 2021, 5:17 PM

@odimitrijevic and I have been testing this; the change is not doing what we want (increase timeout to 3 minutes), and we came across a new error as we gave the superset server some traffic to test timing out:

error: Traceback (most recent call last):
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/superset/views/base.py", line 182, in wraps
    return f(self, *args, **kwargs)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_appbuilder/security/decorators.py", line 149, in wraps
    permission_str, self.class_permission_name
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_appbuilder/security/manager.py", line 1478, in has_access
    if current_user.is_authenticated:
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/werkzeug/local.py", line 347, in __getattr__
    return getattr(self._get_current_object(), name)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/werkzeug/local.py", line 306, in _get_current_object
    return self.__local()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_login/utils.py", line 26, in <lambda>
    current_user = LocalProxy(lambda: _get_user())
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_login/utils.py", line 335, in _get_user
    current_app.login_manager._load_user()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_login/login_manager.py", line 359, in _load_user
    return self.reload_user()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_login/login_manager.py", line 321, in reload_user
    user = self.user_callback(user_id)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_appbuilder/security/manager.py", line 2007, in load_user
    return self.get_user_by_id(int(pk))
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_appbuilder/security/sqla/manager.py", line 236, in get_user_by_id
    return self.get_session.query(self.user_model).get(pk)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 1018, in get
    return self._get_impl(ident, loading.load_on_pk_identity)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 1135, in _get_impl
    return db_load_fn(self, primary_key_identity)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/loading.py", line 286, in load_on_pk_identity
    return q.one()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3490, in one
    ret = self.one_or_none()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3459, in one_or_none
    ret = list(self)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3535, in __iter__
    return self._execute_and_instances(context)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3557, in _execute_and_instances
    querycontext, self._connection_from_session, close_with_result=True
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3572, in _get_bind_args
    mapper=self._bind_mapper(), clause=querycontext.statement, **kw
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3550, in _connection_from_session
    conn = self.session.connection(**kw)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1145, in connection
    execution_options=execution_options,
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1151, in _connection_for_bind
    engine, execution_options
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 433, in _connection_for_bind
    conn = bind._contextual_connect()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2302, in _contextual_connect
    self._wrap_pool_connect(self.pool.connect, None),
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
    return fn()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 364, in connect
    return _ConnectionFairy._checkout(self)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
    rec = pool._do_get()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/pool/impl.py", line 132, in _do_get
    code="3o7r",
sqlalchemy.exc.TimeoutError: QueuePool limit of size 10 overflow 10 reached, connection timed out, timeout 30 (Background on this error at: http://sqlalche.me/e/13/3o7r)

Currently what we're seeing is upstream request timeout with dashboards that time out such as IP Masking Dashboard (open it in multiple tabs if it doesn't time out alone):

@oljadimitrijevic had the idea to change the superset timeout to something like 15 seconds to see if that timeout applies, and once we're sure superset is timing out, find where superset logs its timeouts (adding logging for that if necessary) so that we can pinpoint what part of the stack is timing out soonest.

A useful trick for testing timeouts is to use a sql datastore (like mysql_staging) and run select sleep(1000), which can be done easily in sqllab, as well as in a chart using the following steps:

open charts: https://superset.wikimedia.org/chart/list/
click new chart
select a mysql datasource such as staging.referer_data
find and select chart type "Big Number"
click create chart
click "+ Add metric"
click "CUSTOM SQL"
enter sleep(1000) or whatever; click "save",
open devtools if you're using the network tab to capture timing
click "RUN QUERY"

sit back and relax and wait for the timeout.

You can also use my test timeout chart here: https://superset.wikimedia.org/r/922

• razzi moved this task from Waiting for reply to Testing on the User-razzi board.Dec 2 2021, 1:06 AM

• razzi moved this task from Ready to Deploy to Paused on the Data-Engineering-Kanban board.Dec 2 2021, 1:17 AM

I'm assigning this ticket to myself to see if I can make any more sense of it. Hope that's OK @razzi.

SGTM @BTullis hope you can figure it out

Here's my understanding of what could be timing out

client

javascript makes ajax request (I tested with curl --connect-timeout 1000 and it still timed out after 60 though)

infra

Apache Traffic Server (ATS) (not sure if superset is behind this)
Envoy (does tls termination; it appears to have no timeout by default)

an-tool1010

Apache (/etc/apache2/apache2.conf has Timeout 300)
Gunicorn (/etc/superset/gunicorn_config.py has timeout = 185)
Superset

/etc/superset/superset_config.py has:

wikimedia_superset_timeout = datetime.timedelta(minutes=3).seconds

SQLLAB_TIMEOUT = wikimedia_superset_timeout
SUPERSET_WEBSERVER_TIMEOUT = wikimedia_superset_timeout

Potentially some of these settings are incorrect. It would be good to figure out what exactly is timing out through logging / selectively changing config.

Thanks @razzi. I think, but I'm not 100% sure, that it is ATS that is dropping the connection.
I used the superset staging environment through an SSH tunnel, therfore bypassing ATS and the TLS offloading, but still passing through Apache and Gunicorn I was able to get a connectino to time out at 180 seconds.

Hoverver, when it passes through ATS as well I never get above 120 seconds, which is what is set for the client keepalive timeout for ATS:
https://wikitech.wikimedia.org/wiki/HTTP_timeouts#Frontend_TLS

I will continue to investigate.

Interesting @BTullis! I never got more than 65 seconds from the perspective of the client (dev tools) so I think ATS might be timing out at 120 seconds as you suggest but it appears something else is timing out at 60. IMO a timeout of 2 minutes would be fine, if we can get it to actually do that; considering the timeout is currently 1 minute, going up to 2 minutes is doubling it

On that same article you linked, I see a clue:

@razzi, you're spot on! Many thanks.
At first glance I thought that we didn't run envoy on an-tool1010, since trafficserver was already doing TLS offloading at the edge, but I was wrong.

On every caching proxy server (taking cp1075 as an example) we have this record in the trafficserver remap.config file:

btullis@cp1075:/etc/trafficserver$ sudo grep superset remap.config 
map http://superset.wikimedia.org https://an-tool1010.eqiad.wmnet

It uses a plaintext source_url (http://superset.wikimedia.org) but maps it to an HTTPS origin_url (https://an-tool1010.eqiad.wmnet).

Sure enough, when we look at an-tool1010 we can see that envoy is running on port 443.

btullis@an-tool1010:/etc/envoy$ sudo netstat -ltnp|grep 443
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      12555/envoy

Here is that 65 second timeout mentioned in /etc/envoy/envoy.yaml too.

route_config:
  virtual_hosts:
  - domains: ['*']
    name: non_sni_port_80
    routes:
    - match: {prefix: /}
      route: {cluster: local_port_80, timeout: 65.0s}

So I'll look at how to change this value first, then see if we get hit by trafficserver's 120 second timeout after that.

It looks as though we might be able to change envoy's configuration on the fly, to test this before applying a puppet change.
https://wikitech.wikimedia.org/wiki/Envoy#Runtime_configuration
I'll look to see whether or not there might be any benefit to this.

We use envoy in front of all UIs that we run since ATS backends can call our endpoints from any caching pop, so potentially traversing non secure paths between DCs (and hence requiring TLS).

The profile that we use is profile::tlsproxy::envoy, profile::tlsproxy::envoy::upstream_response_timeout should be the value to tune (to change the default 65 seconds value).

Change 743947 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Increase envoy upstream timeout for superset

https://gerrit.wikimedia.org/r/743947

Change 743947 merged by Btullis:

[operations/puppet@production] Increase envoy upstream timeout for superset

https://gerrit.wikimedia.org/r/743947

Thanks elukey. It works! And we didn't get hit by the 2 minute keepalive timeout in ATS either. My query ran to the full 180 seconds until the timeout configured within Superset itself.

BTullis moved this task from In Progress to Done on the Data-Engineering-Kanban board.Dec 6 2021, 11:45 AM

Go team! This ticket wins the award for "most hours spent per line of code changed". Thanks @BTullis @elukey as always

Dwisehaupt mentioned this in T329789: Change SUPERSET_WEBSERVER_TIMEOUT flag in config from 1 minute to 3 minutes.Feb 15 2023, 7:35 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 15 2023, 8:10 PM

BTullis mentioned this in T294772: Superset Timeout Logging.Mar 22 2024, 11:58 AM

	F34846857: image.png
	Dec 6 2021, 11:44 AM

	F34822783: image.png
	Dec 3 2021, 5:51 PM

	F34822538: image.png
	Dec 3 2021, 5:33 PM

	F34804607: image.png
	Dec 2 2021, 12:25 AM

Increase Superset TimeoutClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

client

infra

an-tool1010

Increase Superset Timeout
Closed, ResolvedPublic
Actions

Related Objects
Search...