This task is to increase the Superset timeout to 3m from the current 1m in order to mitigate the dashboard timeouts until a different solution is found to improve the query performance.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | odimitrijevic | T294259 Presto/Superset User Experience Improvement | |||
Resolved | BTullis | T294771 Increase Superset Timeout |
Event Timeline
Interestingly, this query times out after just 10 seconds: https://superset.wikimedia.org/r/824
I wonder if the 55s that is currently set is used up across all the nodes, that 55s of cpu time is consumed in 10 real seconds.
EDIT: this is a druid query, so its timeout settings are independent
Change 737738 had a related patch set uploaded (by Razzi; author: Razzi):
[operations/puppet@production] superset: Make sqllab timeout 3 minutes
Mentioned in SAL (#wikimedia-analytics) [2021-11-09T16:30:06Z] <razzi> set superset presto timeout to 170s: {"connect_args":{"session_props":{"query_max_run_time":"170s"}}} for T294771
Change 737738 merged by Razzi:
[operations/puppet@production] superset: Make sqllab timeout 3 minutes
The setting we changed appears to apply only to SQL lab; the superset dashboards continue to time out after 1 minute
Change 740712 had a related patch set uploaded (by Razzi; author: Razzi):
[operations/puppet@production] superset: set webserver timeout to 180 seconds
Change 740712 merged by Razzi:
[operations/puppet@production] superset: set webserver timeout to 180 seconds
Change 742808 had a related patch set uploaded (by Razzi; author: Razzi):
[operations/puppet@production] superset: set webserver timeout to 180 seconds
Change 742808 merged by Razzi:
[operations/puppet@production] superset: set webserver timeout to 180 seconds
@odimitrijevic and I have been testing this; the change is not doing what we want (increase timeout to 3 minutes), and we came across a new error as we gave the superset server some traffic to test timing out:
error: Traceback (most recent call last): File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/superset/views/base.py", line 182, in wraps return f(self, *args, **kwargs) File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_appbuilder/security/decorators.py", line 149, in wraps permission_str, self.class_permission_name File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_appbuilder/security/manager.py", line 1478, in has_access if current_user.is_authenticated: File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/werkzeug/local.py", line 347, in __getattr__ return getattr(self._get_current_object(), name) File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/werkzeug/local.py", line 306, in _get_current_object return self.__local() File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_login/utils.py", line 26, in <lambda> current_user = LocalProxy(lambda: _get_user()) File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_login/utils.py", line 335, in _get_user current_app.login_manager._load_user() File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_login/login_manager.py", line 359, in _load_user return self.reload_user() File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_login/login_manager.py", line 321, in reload_user user = self.user_callback(user_id) File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_appbuilder/security/manager.py", line 2007, in load_user return self.get_user_by_id(int(pk)) File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_appbuilder/security/sqla/manager.py", line 236, in get_user_by_id return self.get_session.query(self.user_model).get(pk) File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 1018, in get return self._get_impl(ident, loading.load_on_pk_identity) File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 1135, in _get_impl return db_load_fn(self, primary_key_identity) File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/loading.py", line 286, in load_on_pk_identity return q.one() File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3490, in one ret = self.one_or_none() File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3459, in one_or_none ret = list(self) File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3535, in __iter__ return self._execute_and_instances(context) File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3557, in _execute_and_instances querycontext, self._connection_from_session, close_with_result=True File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3572, in _get_bind_args mapper=self._bind_mapper(), clause=querycontext.statement, **kw File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3550, in _connection_from_session conn = self.session.connection(**kw) File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1145, in connection execution_options=execution_options, File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1151, in _connection_for_bind engine, execution_options File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 433, in _connection_for_bind conn = bind._contextual_connect() File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2302, in _contextual_connect self._wrap_pool_connect(self.pool.connect, None), File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect return fn() File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 364, in connect return _ConnectionFairy._checkout(self) File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout fairy = _ConnectionRecord.checkout(pool) File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 495, in checkout rec = pool._do_get() File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/pool/impl.py", line 132, in _do_get code="3o7r", sqlalchemy.exc.TimeoutError: QueuePool limit of size 10 overflow 10 reached, connection timed out, timeout 30 (Background on this error at: http://sqlalche.me/e/13/3o7r)
Currently what we're seeing is upstream request timeout with dashboards that time out such as IP Masking Dashboard (open it in multiple tabs if it doesn't time out alone):
@oljadimitrijevic had the idea to change the superset timeout to something like 15 seconds to see if that timeout applies, and once we're sure superset is timing out, find where superset logs its timeouts (adding logging for that if necessary) so that we can pinpoint what part of the stack is timing out soonest.
A useful trick for testing timeouts is to use a sql datastore (like mysql_staging) and run select sleep(1000), which can be done easily in sqllab, as well as in a chart using the following steps:
- open charts: https://superset.wikimedia.org/chart/list/
- click new chart
- select a mysql datasource such as staging.referer_data
- find and select chart type "Big Number"
- click create chart
- click "+ Add metric"
- click "CUSTOM SQL"
- enter sleep(1000) or whatever; click "save",
- open devtools if you're using the network tab to capture timing
- click "RUN QUERY"
sit back and relax and wait for the timeout.
You can also use my test timeout chart here: https://superset.wikimedia.org/r/922
I'm assigning this ticket to myself to see if I can make any more sense of it. Hope that's OK @razzi.
SGTM @BTullis hope you can figure it out
Here's my understanding of what could be timing out
client
- javascript makes ajax request (I tested with curl --connect-timeout 1000 and it still timed out after 60 though)
infra
- Apache Traffic Server (ATS) (not sure if superset is behind this)
- Envoy (does tls termination; it appears to have no timeout by default)
an-tool1010
- Apache (/etc/apache2/apache2.conf has Timeout 300)
- Gunicorn (/etc/superset/gunicorn_config.py has timeout = 185)
- Superset
/etc/superset/superset_config.py has:
wikimedia_superset_timeout = datetime.timedelta(minutes=3).seconds SQLLAB_TIMEOUT = wikimedia_superset_timeout SUPERSET_WEBSERVER_TIMEOUT = wikimedia_superset_timeout
Potentially some of these settings are incorrect. It would be good to figure out what exactly is timing out through logging / selectively changing config.
Thanks @razzi. I think, but I'm not 100% sure, that it is ATS that is dropping the connection.
I used the superset staging environment through an SSH tunnel, therfore bypassing ATS and the TLS offloading, but still passing through Apache and Gunicorn I was able to get a connectino to time out at 180 seconds.
Hoverver, when it passes through ATS as well I never get above 120 seconds, which is what is set for the client keepalive timeout for ATS:
https://wikitech.wikimedia.org/wiki/HTTP_timeouts#Frontend_TLS
I will continue to investigate.
Interesting @BTullis! I never got more than 65 seconds from the perspective of the client (dev tools) so I think ATS might be timing out at 120 seconds as you suggest but it appears something else is timing out at 60. IMO a timeout of 2 minutes would be fine, if we can get it to actually do that; considering the timeout is currently 1 minute, going up to 2 minutes is doubling it
@razzi, you're spot on! Many thanks.
At first glance I thought that we didn't run envoy on an-tool1010, since trafficserver was already doing TLS offloading at the edge, but I was wrong.
On every caching proxy server (taking cp1075 as an example) we have this record in the trafficserver remap.config file:
btullis@cp1075:/etc/trafficserver$ sudo grep superset remap.config map http://superset.wikimedia.org https://an-tool1010.eqiad.wmnet
It uses a plaintext source_url (http://superset.wikimedia.org) but maps it to an HTTPS origin_url (https://an-tool1010.eqiad.wmnet).
Sure enough, when we look at an-tool1010 we can see that envoy is running on port 443.
btullis@an-tool1010:/etc/envoy$ sudo netstat -ltnp|grep 443 tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN 12555/envoy
Here is that 65 second timeout mentioned in /etc/envoy/envoy.yaml too.
route_config: virtual_hosts: - domains: ['*'] name: non_sni_port_80 routes: - match: {prefix: /} route: {cluster: local_port_80, timeout: 65.0s}
So I'll look at how to change this value first, then see if we get hit by trafficserver's 120 second timeout after that.
It looks as though we might be able to change envoy's configuration on the fly, to test this before applying a puppet change.
https://wikitech.wikimedia.org/wiki/Envoy#Runtime_configuration
I'll look to see whether or not there might be any benefit to this.
We use envoy in front of all UIs that we run since ATS backends can call our endpoints from any caching pop, so potentially traversing non secure paths between DCs (and hence requiring TLS).
The profile that we use is profile::tlsproxy::envoy, profile::tlsproxy::envoy::upstream_response_timeout should be the value to tune (to change the default 65 seconds value).
Change 743947 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Increase envoy upstream timeout for superset
Change 743947 merged by Btullis:
[operations/puppet@production] Increase envoy upstream timeout for superset
Thanks elukey. It works! And we didn't get hit by the 2 minute keepalive timeout in ATS either. My query ran to the full 180 seconds until the timeout configured within Superset itself.