The newly created Superset dashboard for session length (https://superset.wikimedia.org/superset/dashboard/232) has very few days of data, and is already quite slow, with some charts timing out after 1 minute. After some troubleshooting we identified a couple root causes and possible solutions.
Root causes:
- The Superset/Presto calls to approx_percentile() are being repeated once per percentile metric. We should optimize so that approx_percentile() is only called once and returns a list of percentiles.
- The data of the intermediate table is small in size, but quite big in number of rows (11M rows/day). This slows down all queries, especially the percentile ones.
- Bonus (not a root cause yet, but will be in the future): Presto doesn't use partition pruning when querying the intermediate table. With time, the base data will become impossible to query.
- Some calculations are being done in Superset/Presto that could be done in the backend. For instance: date concatenation, bucket classification and project_family regexp extraction.
- Maybe the fact that there are 6 charts on the same dashboard (simultaneous loading) slows down queries overall.
Possible mitigations:
- Change the way we call approx_percentile in Superset/Presto. DONE! This solves #1. Thanks Connie!
- Reduce the number of events collected by the clients. This is IN PROGRESS in T276630 and T276636. It will help in mitigating #2.
- Using Iceberg in Hadoop to handle time intervals transparently will solve #3. This project is on-going T262256 and will take some time to be finished. However, #3 is a long term issue (in years), so likely Iceberg will be in place before we experience any problems there.
- Optimizing the way we store data in the intermediate table, by aggregating it. Using weights to still allow Presto to calculate percentiles on top of that. This will greatly reduce the number of rows in the intermediate table (x500), solving #2. IN PROGRESS in this same task.
- Move some calculations that are done in Superset, to the intermediate table (precompute). This would mitigate #4. Let's avoid this if possible, because we might want to change these calculations in the future, i.e. change the bucket distribution for our charts. If after solving #2, we still see slowness, then let's work on this.
- Move the session count charts to another dashboard. Would that be OK?