Page MenuHomePhabricator

Increase Superset Timeout
Closed, ResolvedPublic

Assigned To
Authored By
odimitrijevic
Nov 1 2021, 2:59 PM
Referenced Files
F34846857: image.png
Dec 6 2021, 11:44 AM
F34822783: image.png
Dec 3 2021, 5:51 PM
F34822538: image.png
Dec 3 2021, 5:33 PM
F34804607: image.png
Dec 2 2021, 12:25 AM
F34804520: image.png
Dec 2 2021, 12:05 AM
F34762283: image.png
Nov 22 2021, 9:16 PM
F34739395: image.png
Nov 9 2021, 4:55 PM
F34728329: image.png
Nov 3 2021, 7:40 PM

Description

This task is to increase the Superset timeout to 3m from the current 1m in order to mitigate the dashboard timeouts until a different solution is found to improve the query performance.

Event Timeline

odimitrijevic triaged this task as High priority.
odimitrijevic moved this task from Incoming (new tickets) to Visualize on the Data-Engineering board.

Also change Presto timeout in Superset (currently 55s)

Interestingly, this query times out after just 10 seconds: https://superset.wikimedia.org/r/824

image.png (144×802 px, 24 KB)

I wonder if the 55s that is currently set is used up across all the nodes, that 55s of cpu time is consumed in 10 real seconds.

EDIT: this is a druid query, so its timeout settings are independent

Change 737738 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: Make sqllab timeout 3 minutes

https://gerrit.wikimedia.org/r/737738

Mentioned in SAL (#wikimedia-analytics) [2021-11-09T16:30:06Z] <razzi> set superset presto timeout to 170s: {"connect_args":{"session_props":{"query_max_run_time":"170s"}}} for T294771

Change 737738 merged by Razzi:

[operations/puppet@production] superset: Make sqllab timeout 3 minutes

https://gerrit.wikimedia.org/r/737738

Deployed the change to config, will test it shortly

image.png (193×1 px, 43 KB)

The query has been running for over a minute, so it looks to be working

The setting we changed appears to apply only to SQL lab; the superset dashboards continue to time out after 1 minute

image.png (822×1 px, 191 KB)

Change 740712 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: set webserver timeout to 180 seconds

https://gerrit.wikimedia.org/r/740712

Change 740712 merged by Razzi:

[operations/puppet@production] superset: set webserver timeout to 180 seconds

https://gerrit.wikimedia.org/r/740712

Change 742808 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: set webserver timeout to 180 seconds

https://gerrit.wikimedia.org/r/742808

Change 742808 merged by Razzi:

[operations/puppet@production] superset: set webserver timeout to 180 seconds

https://gerrit.wikimedia.org/r/742808

@odimitrijevic and I have been testing this; the change is not doing what we want (increase timeout to 3 minutes), and we came across a new error as we gave the superset server some traffic to test timing out:

error: Traceback (most recent call last):
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/superset/views/base.py", line 182, in wraps
    return f(self, *args, **kwargs)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_appbuilder/security/decorators.py", line 149, in wraps
    permission_str, self.class_permission_name
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_appbuilder/security/manager.py", line 1478, in has_access
    if current_user.is_authenticated:
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/werkzeug/local.py", line 347, in __getattr__
    return getattr(self._get_current_object(), name)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/werkzeug/local.py", line 306, in _get_current_object
    return self.__local()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_login/utils.py", line 26, in <lambda>
    current_user = LocalProxy(lambda: _get_user())
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_login/utils.py", line 335, in _get_user
    current_app.login_manager._load_user()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_login/login_manager.py", line 359, in _load_user
    return self.reload_user()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_login/login_manager.py", line 321, in reload_user
    user = self.user_callback(user_id)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_appbuilder/security/manager.py", line 2007, in load_user
    return self.get_user_by_id(int(pk))
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/flask_appbuilder/security/sqla/manager.py", line 236, in get_user_by_id
    return self.get_session.query(self.user_model).get(pk)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 1018, in get
    return self._get_impl(ident, loading.load_on_pk_identity)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 1135, in _get_impl
    return db_load_fn(self, primary_key_identity)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/loading.py", line 286, in load_on_pk_identity
    return q.one()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3490, in one
    ret = self.one_or_none()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3459, in one_or_none
    ret = list(self)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3535, in __iter__
    return self._execute_and_instances(context)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3557, in _execute_and_instances
    querycontext, self._connection_from_session, close_with_result=True
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3572, in _get_bind_args
    mapper=self._bind_mapper(), clause=querycontext.statement, **kw
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3550, in _connection_from_session
    conn = self.session.connection(**kw)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1145, in connection
    execution_options=execution_options,
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1151, in _connection_for_bind
    engine, execution_options
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 433, in _connection_for_bind
    conn = bind._contextual_connect()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2302, in _contextual_connect
    self._wrap_pool_connect(self.pool.connect, None),
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
    return fn()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 364, in connect
    return _ConnectionFairy._checkout(self)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
    rec = pool._do_get()
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/sqlalchemy/pool/impl.py", line 132, in _do_get
    code="3o7r",
sqlalchemy.exc.TimeoutError: QueuePool limit of size 10 overflow 10 reached, connection timed out, timeout 30 (Background on this error at: http://sqlalche.me/e/13/3o7r)

Currently what we're seeing is upstream request timeout with dashboards that time out such as IP Masking Dashboard (open it in multiple tabs if it doesn't time out alone):

image.png (728×1 px, 154 KB)

@oljadimitrijevic had the idea to change the superset timeout to something like 15 seconds to see if that timeout applies, and once we're sure superset is timing out, find where superset logs its timeouts (adding logging for that if necessary) so that we can pinpoint what part of the stack is timing out soonest.

A useful trick for testing timeouts is to use a sql datastore (like mysql_staging) and run select sleep(1000), which can be done easily in sqllab, as well as in a chart using the following steps:

  • open charts: https://superset.wikimedia.org/chart/list/
  • click new chart
  • select a mysql datasource such as staging.referer_data
  • find and select chart type "Big Number"
  • click create chart
  • click "+ Add metric"
  • click "CUSTOM SQL"
  • enter sleep(1000) or whatever; click "save",
  • open devtools if you're using the network tab to capture timing
  • click "RUN QUERY"

sit back and relax and wait for the timeout.

You can also use my test timeout chart here: https://superset.wikimedia.org/r/922

image.png (562×1 px, 75 KB)

BTullis moved this task from Paused to In Progress on the Data-Engineering-Kanban board.
BTullis added a subscriber: razzi.

I'm assigning this ticket to myself to see if I can make any more sense of it. Hope that's OK @razzi.

SGTM @BTullis hope you can figure it out

Here's my understanding of what could be timing out

client
  • javascript makes ajax request (I tested with curl --connect-timeout 1000 and it still timed out after 60 though)
infra
  • Apache Traffic Server (ATS) (not sure if superset is behind this)
  • Envoy (does tls termination; it appears to have no timeout by default)
an-tool1010
  • Apache (/etc/apache2/apache2.conf has Timeout 300)
  • Gunicorn (/etc/superset/gunicorn_config.py has timeout = 185)
  • Superset

/etc/superset/superset_config.py has:

wikimedia_superset_timeout = datetime.timedelta(minutes=3).seconds

SQLLAB_TIMEOUT = wikimedia_superset_timeout
SUPERSET_WEBSERVER_TIMEOUT = wikimedia_superset_timeout

Potentially some of these settings are incorrect. It would be good to figure out what exactly is timing out through logging / selectively changing config.

Thanks @razzi. I think, but I'm not 100% sure, that it is ATS that is dropping the connection.
I used the superset staging environment through an SSH tunnel, therfore bypassing ATS and the TLS offloading, but still passing through Apache and Gunicorn I was able to get a connectino to time out at 180 seconds.

image.png (276×1 px, 68 KB)

Hoverver, when it passes through ATS as well I never get above 120 seconds, which is what is set for the client keepalive timeout for ATS:
https://wikitech.wikimedia.org/wiki/HTTP_timeouts#Frontend_TLS

I will continue to investigate.

Interesting @BTullis! I never got more than 65 seconds from the perspective of the client (dev tools) so I think ATS might be timing out at 120 seconds as you suggest but it appears something else is timing out at 60. IMO a timeout of 2 minutes would be fine, if we can get it to actually do that; considering the timeout is currently 1 minute, going up to 2 minutes is doubling it

On that same article you linked, I see a clue:

image.png (314×1 px, 53 KB)

@razzi, you're spot on! Many thanks.
At first glance I thought that we didn't run envoy on an-tool1010, since trafficserver was already doing TLS offloading at the edge, but I was wrong.

On every caching proxy server (taking cp1075 as an example) we have this record in the trafficserver remap.config file:

btullis@cp1075:/etc/trafficserver$ sudo grep superset remap.config 
map http://superset.wikimedia.org https://an-tool1010.eqiad.wmnet

It uses a plaintext source_url (http://superset.wikimedia.org) but maps it to an HTTPS origin_url (https://an-tool1010.eqiad.wmnet).

Sure enough, when we look at an-tool1010 we can see that envoy is running on port 443.

btullis@an-tool1010:/etc/envoy$ sudo netstat -ltnp|grep 443
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      12555/envoy

Here is that 65 second timeout mentioned in /etc/envoy/envoy.yaml too.

route_config:
  virtual_hosts:
  - domains: ['*']
    name: non_sni_port_80
    routes:
    - match: {prefix: /}
      route: {cluster: local_port_80, timeout: 65.0s}

So I'll look at how to change this value first, then see if we get hit by trafficserver's 120 second timeout after that.

It looks as though we might be able to change envoy's configuration on the fly, to test this before applying a puppet change.
https://wikitech.wikimedia.org/wiki/Envoy#Runtime_configuration
I'll look to see whether or not there might be any benefit to this.

We use envoy in front of all UIs that we run since ATS backends can call our endpoints from any caching pop, so potentially traversing non secure paths between DCs (and hence requiring TLS).

The profile that we use is profile::tlsproxy::envoy, profile::tlsproxy::envoy::upstream_response_timeout should be the value to tune (to change the default 65 seconds value).

Change 743947 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Increase envoy upstream timeout for superset

https://gerrit.wikimedia.org/r/743947

Change 743947 merged by Btullis:

[operations/puppet@production] Increase envoy upstream timeout for superset

https://gerrit.wikimedia.org/r/743947

Thanks elukey. It works! And we didn't get hit by the 2 minute keepalive timeout in ATS either. My query ran to the full 180 seconds until the timeout configured within Superset itself.

image.png (501×1 px, 71 KB)

razzi removed a project: User-razzi.

Go team! This ticket wins the award for "most hours spent per line of code changed". Thanks @BTullis @elukey as always