Page MenuHomePhabricator

Optimize intermediate session length data set and dashboard
Closed, ResolvedPublic

Description

The newly created Superset dashboard for session length (https://superset.wikimedia.org/superset/dashboard/232) has very few days of data, and is already quite slow, with some charts timing out after 1 minute. After some troubleshooting we identified a couple root causes and possible solutions.

Root causes:

  1. The Superset/Presto calls to approx_percentile() are being repeated once per percentile metric. We should optimize so that approx_percentile() is only called once and returns a list of percentiles.
  2. The data of the intermediate table is small in size, but quite big in number of rows (11M rows/day). This slows down all queries, especially the percentile ones.
  3. Bonus (not a root cause yet, but will be in the future): Presto doesn't use partition pruning when querying the intermediate table. With time, the base data will become impossible to query.
  4. Some calculations are being done in Superset/Presto that could be done in the backend. For instance: date concatenation, bucket classification and project_family regexp extraction.
  5. Maybe the fact that there are 6 charts on the same dashboard (simultaneous loading) slows down queries overall.

Possible mitigations:

  • Change the way we call approx_percentile in Superset/Presto. DONE! This solves #1. Thanks Connie!
  • Reduce the number of events collected by the clients. This is IN PROGRESS in T276630 and T276636. It will help in mitigating #2.
  • Using Iceberg in Hadoop to handle time intervals transparently will solve #3. This project is on-going T262256 and will take some time to be finished. However, #3 is a long term issue (in years), so likely Iceberg will be in place before we experience any problems there.
  • Optimizing the way we store data in the intermediate table, by aggregating it. Using weights to still allow Presto to calculate percentiles on top of that. This will greatly reduce the number of rows in the intermediate table (x500), solving #2. IN PROGRESS in this same task.
  • Move some calculations that are done in Superset, to the intermediate table (precompute). This would mitigate #4. Let's avoid this if possible, because we might want to change these calculations in the future, i.e. change the bucket distribution for our charts. If after solving #2, we still see slowness, then let's work on this.
  • Move the session count charts to another dashboard. Would that be OK?

Event Timeline

I think solving #2 will be enough for the dashboard to perform fine for several months, maybe a couple years.
I'm already working on that. It should be a small change to the way we store data in the intermediate table, that will need a couple adjustments to the dashboard queries.
It won't alert the contents of the data set, just the format. And thus, the results (data, charts, etc.) of the dashboard will not be altered.

In meantime, thinking of the future performance of this data set, we should work on dynamic sampling rate and Iceberg!

Change 672541 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Optimize data format of session length intermediate table

https://gerrit.wikimedia.org/r/672541

@cchen once we merge and deploy the optimization above, the session length dashboard will need some adjustments.
Maybe we can set up a meeting and pair on them, I think it will be the fastest. It shouldn't be a big thing, maybe 30 minutes.

Change 672541 merged by Mforns:
[analytics/refinery@master] Optimize data format of session length intermediate table

https://gerrit.wikimedia.org/r/672541

These also seem promising:

  • Improve the SQL Lab to Explore flow UX which talks about "In the Change Datasource modal allow the user to select virtual datasets, as well as physical datasets" so I'm guessing that would enable multiple charts based on the same underlying query/virtual datasource, rather than the 1-to-1 chart-query relationship that slows down Presto-based dashboards. This improvement appears to have made it into 1.0
  • Async Queries via Celery which talks about running long queries async and storing results (curious if @elukey has thoughts on this)

Adding @razzi as well as the new owner of Superset :)

My 2c about celery - it is something that we may want to do during the next fiscal, if there is the desire/need of long queries. What I am focusing on before transitioning fully to Machine Learning is to test Alluxio for Presto, and think about how to expand the current cluster to offer a more powerful way of querying datasets (not limited to small ones right now). In theory with Alluxio we should be able to speed up a lot presto-based queries, but not druid-based ones of course (even if we have done a lot on the caching side of Druid to make it better recently). So in theory celery will still be useful for long queries, in practice I hope to reduce the gap that we have now with aggressive caching on the Presto side :D

As the main problem that we had in the session length dashboard has been significantly mitigated by https://gerrit.wikimedia.org/r/672541, I consider this task done.
There are other suggested optimizations that are tackled in other tasks, those will continue to be worked on there.
For the suggestions that don't have their own task yet, please create it, if you think they need to be implemented in the future.
Thanks!