Page MenuHomePhabricator

Alluxio for Improved Superset Query Performance
Open, MediumPublic

Description

As a user of superset I wish to experience faster dashboard rendering and fewer timeouts so that I can quickly view the reports.

The solution identified is to implement Presto's built-in Alluxio SDK as a discrete cache for HDFS files on each presto worker node.

An earlier iteration of this plan was attempted in 2021, where we had intended to use a distributed alluxio cache service. This failed because we were unable to connect Alluxio to a kerberised Hive metastore.

This version of the plan differs from that previous attempt in that Alluxio is only ever used locally on each presto worker node, using a jar file provided with presto itself.
The caches are unaware of each other and the only client of each cache is the presto server running on the same machine.

Event Timeline

Adding 2 things:

  • Alluxio is built using 3 different Leader-follower systems: core (caching), job (data movement), catalog (Hive tables)
  • The performance test on the test cluster will probably give no visible result given the relatively small size of the test-cluster and data in there. We should nonetheless be able to trace execution and confirm the flow is the expected one.
Ottomata added a subtask: Unknown Object (Task).Sep 16 2021, 5:16 PM
wiki_willy closed subtask Unknown Object (Task) as Declined.Sep 30 2021, 8:01 PM

Should we decline this ticket now, or mark it as resolved, or re-title it?

BTullis claimed this task.
BTullis added subscribers: brouberol, Stevemunene, Ahoelzl.

I'm re-opening this ticket, as we have made significant advances on the use of the built-in Alluxio SDK cache: https://prestodb.io/docs/current/cache/local.html
Two child tickets T266641: [Data Platform] Test Alluxio as cache layer for Presto and T342343: Upgrade Presto to version 0.283 are under way, so I think it makes sense to bring back this ticket to track any follow-up or ancillary work.

Speaking to @odimitrijevic about this ticket the other day, we discussed that it would be good to see if we can get a baseline against which to measure any performance improvements, when we do enable caching in presto.
Would we want to have duplicate catalogs (one with caching, one without) for example, so that we can gauge the difference that it makes?

It also occurred to me that we have some cache metrics available to monitor via JMX: https://prestodb.io/docs/current/cache/local.html#monitoring
We should make sure that we have those available with a suitable Grafana dashboard.

Gehel lowered the priority of this task from High to Medium.Dec 7 2023, 1:50 PM
Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.
BTullis removed a project: Epic.
Gehel subscribed.

I think that the latest superset deployments have caching enabled. This might not be useful anymore.

Actually, this brings a different level of caching and could reduce network pressure in some instances (T364893#9800673).