Page MenuHomePhabricator

Superset not able to load a reading dashboard
Closed, ResolvedPublic5 Story Points

Description

[John K] Much of my original reading dashboard...which used to work, is throwing up performance errors. You can look at each chart to try and figure out why. Unlike the browser family issue, where there are hundreds of potential values, for referers there are < 10.
https://superset.wikimedia.org/superset/dashboard/13/

Details

Related Gerrit Patches:

Event Timeline

Nuria created this task.Oct 4 2019, 6:27 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 4 2019, 6:27 PM
Nuria added a comment.EditedOct 4 2019, 10:25 PM

I see errors reaching to 1001 but i thought it was 1003 the one that answered to superset?

The broker.log has similar error on druid1003

raceback (most recent call last):
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/superset/viz.py", line 415, in get_df_payload
    df = self.get_df(query_obj)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/superset/viz.py", line 200, in get_df
    self.results = self.datasource.query(query_obj)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/superset/connectors/druid/models.py", line 1356, in query
    query_str = self.get_query_str(client=client, query_obj=query_obj, phase=2)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/superset/connectors/druid/models.py", line 958, in get_query_str
    return self.run_query(client=client, phase=phase, **query_obj)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/superset/connectors/druid/models.py", line 1233, in run_query
    client.topn(**pre_qry)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/pydruid/client.py", line 130, in topn
    return self._post(query)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/pydruid/client.py", line 569, in _post
    separators=(',', ': '))))
OSError: HTTP Error 500: Internal Server Error 
 Druid Error: {'error': 'Unknown exception', 'errorMessage': 'Query[2b1c0b03-0b6a-46e9-836c-0dd53ce6bb86] url[http://druid1001.eqiad.wmnet:8083/druid/v2/] timed out.', 'errorClass': 'io.druid.java.util.common.RE', 'host': None}
Nuria added a comment.EditedOct 4 2019, 10:43 PM

This dashboard also includes pageviews by browser family (top 50) in a timespan of 4 years, if i execute that query in 1001 it does not work at first try. If I shorten the time span to two years and try to execute, it works and when i later execute the larger query (4 years timespan) it does work. So this is an issue on druid (not superset) not being able to load 4 years of segments from pageview_hourly

curl -X POST 'http://localhost:8082/druid/v2/?pretty' -H 'Content-Type:application/json' -H 'Accept:application/json' -d '{

"aggregations": [
    {
        "fieldName": "view_count",
        "name": "sum__view_count",
        "type": "doubleSum"
    }
],
"dataSource": "pageviews_daily",
"dimension": "ua_browser_family",
"filter": {
    "dimension": "agent_type",
    "type": "selector",
    "value": "user"
},
"granularity": "all",
"intervals": "2015-10-04T00:00:00+00:00/2019-10-04T00:00:00+00:00",
"metric": "sum__view_count",
"postAggregations": [],
"queryType": "topN",
"threshold": 50

}'

Nuria added a comment.EditedOct 4 2019, 11:18 PM

So the query itself is expensive but once segments are partially loaded is tenable. I would think that once i have executed the larger query in the brokers by hand superset should be benefiting from the "pre-loading" of segments and should be able to execute the query too just like i can do it on the command line (takes 200ms) but no, not happening.

elukey added a subscriber: elukey.Oct 7 2019, 3:22 PM

I see errors reaching to 1001 but i thought it was 1003 the one that answered to superset?

Yep the broker used is on 1003, but the errors refer to druid1001:8083, that seems to be a historical (port 8082 is the broker one, 8083 the historical).

The broker.log has similar error on druid1003

raceback (most recent call last):
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/superset/viz.py", line 415, in get_df_payload
    df = self.get_df(query_obj)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/superset/viz.py", line 200, in get_df
    self.results = self.datasource.query(query_obj)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/superset/connectors/druid/models.py", line 1356, in query
    query_str = self.get_query_str(client=client, query_obj=query_obj, phase=2)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/superset/connectors/druid/models.py", line 958, in get_query_str
    return self.run_query(client=client, phase=phase, **query_obj)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/superset/connectors/druid/models.py", line 1233, in run_query
    client.topn(**pre_qry)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/pydruid/client.py", line 130, in topn
    return self._post(query)
  File "/srv/deployment/analytics/superset/venv/lib/python3.7/site-packages/pydruid/client.py", line 569, in _post
    separators=(',', ': '))))
OSError: HTTP Error 500: Internal Server Error 
 Druid Error: {'error': 'Unknown exception', 'errorMessage': 'Query[2b1c0b03-0b6a-46e9-836c-0dd53ce6bb86] url[http://druid1001.eqiad.wmnet:8083/druid/v2/] timed out.', 'errorClass': 'io.druid.java.util.common.RE', 'host': None}

We have strict timeouts for both Druid clusters, that we might need to tune some of the following:

profile::druid::broker::properties:
  [..]
  # To avoid excessive queueing on the Historicals,
  # the Broker fails fast if it doesn't receive any data
  # from the Historical after 5s (as opposed to minutes
  # like set in the defaults) or if a query takes more than
  # 5s to complete.
  druid.broker.http.readTimeout: PT5S
  druid.server.http.defaultQueryTimeout: 5000

profile::druid::historical::properties:
  [..]
  # Sane value to abort a query that takes more than 5s to complete
  druid.server.http.defaultQueryTimeout: 5000

Will dig a bit more in logs..

Milimetric reassigned this task from Milimetric to Nuria.
Milimetric triaged this task as High priority.
Milimetric moved this task from Incoming to Ops Week on the Analytics board.
Milimetric added a project: Analytics-Kanban.
Milimetric added a subscriber: Milimetric.

Change 541512 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics::worker: increase query timeout to 10s

https://gerrit.wikimedia.org/r/541512

Change 541512 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: increase query timeout to 10s

https://gerrit.wikimedia.org/r/541512

Mentioned in SAL (#wikimedia-operations) [2019-10-08T08:33:08Z] <elukey> roll restart druid historicals and brokers on druid100[1-3] to pick up new settings - T234684

elukey added a comment.Oct 8 2019, 9:29 AM

ok so now we have a more reasonable timeout for the druid analytics cluster (10s) but the issue persists, so we can rule out timeouts.

elukey added a comment.Oct 8 2019, 9:40 AM

Correction - some graphs in the dashboard now load (after a bit of retries to warm up druid's cache) but a couple still yield the errors. For example, the bottom graph says:

HTTP Error 500: Internal Server Error Druid Error: {'error': 'Resource limit exceeded', 'errorMessage': 'query[b31cc31c-56c5-43c4-9f52-d26a39ec79ab] url[http://druid1003.eqiad.wmnet:8083/druid/v2/] timed out or max bytes limit reached.', 'errorClass': 'io.druid.query.ResourceLimitExceededException', 'host': 'druid1003.eqiad.wmnet:8083'} Query is: { "aggregations": [ { "fieldName": "view_count", "name": "sum__view_count", "type": "doubleSum" } ], "dataSource": "pageviews_daily", "dimension": "project", "filter": { "dimension": "agent_type", "type": "selector", "value": "user" }, "granularity": "all", "intervals": "2015-05-08T00:00:00+00:00/2019-10-08T09:33:31+00:00", "metric": "sum__view_count", "postAggregations": [], "queryType": "topN", "threshold": 5000 }
HTTP Error 500: Internal Server Error Druid Error: {'error': 'Resource limit exceeded', 'errorMessage': 'query[a85ac3b1-a6d1-4bda-9818-4d0db5adb8ab] url[http://druid1002.eqiad.wmnet:8083/druid/v2/] timed out or max bytes limit reached.', 'errorClass': 'io.druid.query.ResourceLimitExceededException', 'host': 'druid1002.eqiad.wmnet:8083'} Query is

The query id (b31cc31c-56c5-43c4-9f52-d26a39ec79ab) in druid1003 historical's logs reports:

2019-10-08T09:33:41,831 INFO io.druid.query.ChainedExecutionQueryRunner: Query timeout, cancelling pending results for query id [b31cc31c-56c5-43c4-9f52-d26a39ec79ab]
2019-10-08T09:33:41,833 WARN io.druid.server.QueryLifecycle: Exception while processing queryId [b31cc31c-56c5-43c4-9f52-d26a39ec79ab]
io.druid.query.QueryInterruptedException: Timeout waiting for task.
        at io.druid.query.ChainedExecutionQueryRunner$1.make(ChainedExecutionQueryRunner.java:174) ~[druid-processing-0.12.3.jar:0.12.3]
        at io.druid.java.util.common.guava.BaseSequence.toYielder(BaseSequence.java:65) ~[java-util-0.12.3.jar:0.12.3]
        at io.druid.java.util.common.guava.MappedSequence.toYielder(MappedSequence.java:49) ~[java-util-0.12.3.jar:0.12.3]
        at io.druid.java.util.common.guava.WrappingSequence$2.get(WrappingSequence.java:87) ~[java-util-0.12.3.jar:0.12.3]
        at io.druid.java.util.common.guava.WrappingSequence$2.get(WrappingSequence.java:83) ~[java-util-0.12.3.jar:0.12.3]
elukey added a comment.EditedOct 9 2019, 3:19 PM

The There was an issue etc.. that pops up at the very beginning of the page load seems to be related to something else:

https://github.com/apache/incubator-superset/issues/6824

Will try to follow up..

elukey added a comment.EditedOct 9 2019, 3:26 PM

The only char currently not working is the Pageview by browser family one, that times out around 10s consistently. The dashboard should probably load less data at once, is it possible?

EDIT: selecting up to 1y of data works very well now, more causes timeouts. I think that 10s of timeout for the Druid Analytics cluster is ok, but I'd be reluctant in increasing it.

Nuria added a comment.Oct 9 2019, 3:41 PM

@elukey I think increasing timeout would do little here the 4 year query for browsers is not executable (cold cache) in druid so superset will not be able to do it either. Now, the puzzling part is that once cache is warm (and i warmed it up by hand) superset should be able to load that data. This is probably an issue bigger than this ticket.

Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.Oct 9 2019, 4:18 PM

Pinging @JAllemandou in case he has another ideas

@elukey I think increasing timeout would do little here the 4 year query for browsers is not executable (cold cache) in druid so superset will not be able to do it either. Now, the puzzling part is that once cache is warm (and i warmed it up by hand) superset should be able to load that data. This is probably an issue bigger than this ticket.

The charts that fail to load report HTTP 500s generated by Druid, and that we can find on the Druid logs as well. So I don't think that Superset could do more, but maybe I am missing something :(

Here is my understanding of the different things at play here:

  • Caching is made by the broker by query - per-segment. This means for a given set query-parameters (excluding interval boundaries), the cache uses/stores results for this query for every segment required by the query interval.
  • When using a line chart and a groupBy with a single dimension, Superset issues 2 queries, the first one to gather the list of values to use (topN over the given time-period, with ALL granularity), and then another one to get split-by-time results for the precomputed list of values (topN over the given time-period, with the selected granularity and a OR filter for the values list). It is interesting to notice that when using groupBy with 2 dimensions, superset issues a single groupBy query since the double-query optimization can only be done over one dimension.
  • When the query for a long time-period fails, it fails on the second query (the one with the time-split and value-filter). The first one that precomputes values is always successful (at least with my tests). Something else to notice is that this is this first query that supersert displays when using "View query".
  • Preloading the segment-result cache for a long time-period (4 years for instance) means running the second query for a small-enough time period (1 year worked for me in our case), allowing the query to be successful and therefore the cache to be loaded for the queried segments. Then reissuing the query for a longer time-period (2 years for instance), and again, up to having cached data for most of the needed segments. I confirm that this strategy works, I have managed to have a 4 years chart in superset.
  • The gotcha here is that this strategy cannot usually be transfered to superset: when asking superset for a 1 year chart, then a 2 years chart, it will issue 2 queries each time, and it is reasonable that the list of values parameterizing the second query will vary from the 1 year chart to the 2 years chart, meaning no caching is actually made for the 2 years chart when querying for the 1 year chart as query parameters are not the same. I have overcome this limitation and managed to draw a 4 years chart in superset by issuing manual queries to druid having the correct parameters.

Something else to notice: the series limit parameter is important when using a group by, as it defines how many values are shown, and therefore how many topN values are requested, and then filtered in the case of a single-dimension group by.

Finally, it could be fun to add a functionality to superset to incrementally warm cache for a given time-period - I however have no clue of how complicated it would be :)

Nuria added a comment.Oct 10 2019, 3:11 PM

@JAllemandou what a great piece of insight. I think i am going to close ticket cause:

  • bumping timeout seemed to resolve less intense queries
  • pageviews per browser family for 4 years is not doable on our system as is now. I think for two reasons: the length of data (number of segments) and cardinality of OS field which is on the thousands in just a few days.
Nuria moved this task from In Progress to Done on the Analytics-Kanban board.Oct 10 2019, 3:11 PM
Nuria set the point value for this task to 5.

@elukey Thanks for making this adjustment! That makes a huge difference.

Nuria closed this task as Resolved.Oct 14 2019, 3:30 PM