Page MenuHomePhabricator

Presto cluster improvements for concurrency and workload
Open, Needs TriagePublic

Description

Our Presto cluster performs adequately under single-query workloads but becomes unresponsive under concurrent usage. This proposal introduces resource groups, query limits, and prioritization tuned for our 15-node (~3.2 TB RAM) cluster.

In this ticket I'm proposing an optimization principle that should take our Presto users to a hopefully happier place. The intended purpose of the proposed configuration changes here is for them to be discussed and modified or even discarded if necessary.

Some of the Presto configuration proposed here already exists, but the values should be updated IMO since the existing ones were calibrated for the 5-node cluster.

NOTE: This work is orthogonal to the investigation into hardware and networking bottlenecks of the Presto cluster. True performance improvements will most likely come out of that work, but configuring the cluster properly will set us on the correct course for an ever-increasing number of Presto users.

Key design principle

Prevent cluster resource clogging by:

  • Limiting per-query resource usage
  • Enforcing strict concurrency caps
  • Using queueing and prioritization

Proposed configuration

  1. Cluster-level query memory limits

We should not allow a single query to consume huge portions of the cluster.

Existing config:

query.max-memory: 200GB
query.max-memory-per-node: 20GB
query.max-total-memory-per-node: 40GB

Update to:

query.max-memory=400GB
query.max-memory-per-node=12GB
query.max-total-memory-per-node=16GB
query.max-execution-time=30m

Why:

  • 12GB per node × 15 nodes = 180GB active memory per query
  • Total cap (with overhead) ~400GB/query
  • Introduce max execution time limit, to prevent long-term hogging of the cluster
  1. CPU/task concurrency tuning

Presto can oversubscribe easily when it comes to concurrent task threads, so it's probably a good idea to impose a sensible limit.

task.concurrency=4
task.max-worker-threads=32
  1. Spill-to-disk

I'm not sure if we have this enabled as I don't see it in the Puppet config, but this is critical since as memory pressure grows the cluster stalls.

spill-enabled=true
spiller-spill-path=/mnt/presto-spill
query.max-spill-per-node=100GB
  1. Guardrails for “bad” queries
query.max-scan-physical-bytes=2TB

This prevents accidental “scan the entire data lake” queries from killing the cluster.

  1. Resource groups

The proposal here is to categorize queries into three different groups:

  • High priority (e.g. frequently used Superset dashboard queries, possibly GrowthBook queries)
  • Standard
  • Heavy

By assigning different scheduling weights and memory/concurrency limits, we can modify the cluster behavior under concurrent usage by different users. Give priority to users/queries that need it, and put heavy queries on the backburner since it's expected they will take time to complete.

{
  "rootGroups": [
    {
      "name": "global",
      "softMemoryLimit": "100%",
      "hardConcurrencyLimit": 40,
      "maxQueued": 200,
      "schedulingPolicy": "weighted",
      "subGroups": [
        {
          "name": "high_priority",
          "softMemoryLimit": "35%",
          "hardConcurrencyLimit": 12,
          "maxQueued": 40,
          "schedulingWeight": 6
        },
        {
          "name": "standard",
          "softMemoryLimit": "45%",
          "hardConcurrencyLimit": 20,
          "maxQueued": 100,
          "schedulingWeight": 3
        },
        {
          "name": "heavy",
          "softMemoryLimit": "20%",
          "hardConcurrencyLimit": 5,
          "maxQueued": 60,
          "schedulingWeight": 1
        }
      ]
    }
  ]
}
  1. User/workload selectors

Once we have different resource groups, we can route different users/workloads to them:

{
  "selectors": [
    {
      "source": "superset_user",
      "group": "global.high_priority"
    },
    {
      "user": "growthbook_user",
      "group": "global.high_priority"
    },
    {
      "clientTags": ["heavy"],
      "group": "global.heavy"
    },
    {
      "queryType": "SELECT",
      "group": "global.standard"
    }
  ]
}

Users can set tags this way to route their queries to specific resource groups:

SET SESSION client_tags = 'heavy';

Details

Event Timeline

Gehel added a subscriber: JAllemandou.

@JAllemandou should have a look into this when back from vacation

Nice! This is not unlike our Yarn policies, which follow similar principles. I especially like the prioritization of Superset and GrowthBook, which indeed require close to real-time access.

The only concern I have is on the following:

  1. Guardrails for “bad” queries
query.max-scan-physical-bytes=2TB

An example (presumably) legitimate use case: big ad-hoc aggregations that would take 1-5 minutes in Presto but 30+ minutes in Spark. I do these when I need to get some numbers quick on, say, wmf_content.mediawiki_content_history_v1, which is one of our biggest tables. If this max-scan-physical-bytes setting will just fail my query then the value of the Presto cluster, for me, would go down, especially if it fails at runtime rather than at query planning?

Nice! This is not unlike our Yarn policies, which follow similar principles. I especially like the prioritization of Superset and GrowthBook, which indeed require close to real-time access.

Yes, exactly. Prioritizing Superset and GrowthBook would be a very desirable change.

The only concern I have is on the following:

  1. Guardrails for “bad” queries
query.max-scan-physical-bytes=2TB

An example (presumably) legitimate use case: big ad-hoc aggregations that would take 1-5 minutes in Presto but 30+ minutes in Spark. I do these when I need to get some numbers quick on, say, wmf_content.mediawiki_content_history_v1, which is one of our biggest tables. If this max-scan-physical-bytes setting will just fail my query then the value of the Presto cluster, for me, would go down, especially if it fails at runtime rather than at query planning?

Any of these settings are up for discussion, so we could definitely increase the number for the max-scan if not exclude the feature setting altogether. I wonder how much data your Presto queries actually end up scanning, would it be possible to check that? I think that information is available on the Presto UI stats page of the query.

Change #1285926 had a related patch set uploaded (by Aleksandar Mastilovic; author: Aleksandar Mastilovic):

[operations/puppet@production] Presto memory tuning, resource groups

https://gerrit.wikimedia.org/r/1285926