Our Presto cluster performs adequately under single-query workloads but becomes unresponsive under concurrent usage. This proposal introduces resource groups, query limits, and prioritization tuned for our 15-node (~3.2 TB RAM) cluster.
In this ticket I'm proposing an optimization principle that should take our Presto users to a hopefully happier place. The intended purpose of the proposed configuration changes here is for them to be discussed and modified or even discarded if necessary.
Some of the Presto configuration proposed here already exists, but the values should be updated IMO since the existing ones were calibrated for the 5-node cluster.
Key design principle
Prevent cluster resource clogging by:
- Limiting per-query resource usage
- Enforcing strict concurrency caps
- Using queueing and prioritization
Proposed configuration
- Cluster-level query memory limits
We should not allow a single query to consume huge portions of the cluster.
Existing config:
query.max-memory: 200GB query.max-memory-per-node: 20GB query.max-total-memory-per-node: 40GB
Update to:
query.max-memory=400GB query.max-memory-per-node=12GB query.max-total-memory-per-node=16GB query.max-execution-time=30m
Why:
- 12GB per node × 15 nodes = 180GB active memory per query
- Total cap (with overhead) ~400GB/query
- Introduce max execution time limit, to prevent long-term hogging of the cluster
- CPU/task concurrency tuning
Presto can oversubscribe easily when it comes to concurrent task threads, so it's probably a good idea to impose a sensible limit.
task.concurrency=4 task.max-worker-threads=32
- Spill-to-disk
I'm not sure if we have this enabled as I don't see it in the Puppet config, but this is critical since as memory pressure grows the cluster stalls.
spill-enabled=true spiller-spill-path=/mnt/presto-spill query.max-spill-per-node=100GB
- Guardrails for “bad” queries
query.max-scan-physical-bytes=2TB
This prevents accidental “scan the entire data lake” queries from killing the cluster.
- Resource groups
The proposal here is to categorize queries into three different groups:
- High priority (e.g. frequently used Superset dashboard queries, possibly GrowthBook queries)
- Standard
- Heavy
By assigning different scheduling weights and memory/concurrency limits, we can modify the cluster behavior under concurrent usage by different users. Give priority to users/queries that need it, and put heavy queries on the backburner since it's expected they will take time to complete.
{
"rootGroups": [
{
"name": "global",
"softMemoryLimit": "100%",
"hardConcurrencyLimit": 40,
"maxQueued": 200,
"schedulingPolicy": "weighted",
"subGroups": [
{
"name": "high_priority",
"softMemoryLimit": "35%",
"hardConcurrencyLimit": 12,
"maxQueued": 40,
"schedulingWeight": 6
},
{
"name": "standard",
"softMemoryLimit": "45%",
"hardConcurrencyLimit": 20,
"maxQueued": 100,
"schedulingWeight": 3
},
{
"name": "heavy",
"softMemoryLimit": "20%",
"hardConcurrencyLimit": 5,
"maxQueued": 60,
"schedulingWeight": 1
}
]
}
]
}- User/workload selectors
Once we have different resource groups, we can route different users/workloads to them:
{
"selectors": [
{
"source": "superset_user",
"group": "global.high_priority"
},
{
"user": "growthbook_user",
"group": "global.high_priority"
},
{
"clientTags": ["heavy"],
"group": "global.heavy"
},
{
"queryType": "SELECT",
"group": "global.standard"
}
]
}Users can set tags this way to route their queries to specific resource groups:
SET SESSION client_tags = 'heavy';