Add an-presto10[06-15] to the presto cluster
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	BTullis
	Nov 24 2022, 5:27 PM

Description

We have 10 presto servers that are ready to be added to the cluster. Their setup/racking task was: T306835
https://github.com/wikimedia/puppet/blob/production/manifests/site.pp#L173-L181

# Analytics Presto nodes.
node /^an-presto100[1-5]\.eqiad\.wmnet$/ {
    role(analytics_cluster::presto::server)
}

# New an-presto nodes in eqiad T306835
node /^an-presto10(0[6-9]|1[0-5])\.eqiad\.wmnet/ {
    role(insetup::data_engineering)

There were a couple of delays caused by:

a dependency on having Hadoop packages ready for bullseye T310643
an issue regarding the H750 raid controller working properly

However, these have both been resolved so now adding these servers to the presto cluster should only be a matter of an update to the site.pp file linked above.
This will triple our current presto compute capacity.

Details

Subject	Repo	Branch	Lines +/-
Increase the presto cluster size to 15 hosts again	operations/puppet	production	+5 -17
Reduce the presto task concurrency from 48 to 32	operations/puppet	production	+2 -2
Tuning Presto Config for scaling	operations/puppet	production	+21 -4
Disable the presto server on the 10 new hosts	operations/puppet	production	+25 -0
Add an-presto1008-1015 to presto cluster	operations/puppet	production	+2 -7
Add python-is-python3 package	operations/puppet	production	+5 -0
Add an-presto1007 to presto cluster	operations/puppet	production	+2 -2
Add an-presto1006 to presto cluster	operations/puppet	production	+2 -2
Add dummy keytabs for new presto1006-1015 servers	labs/private	master	+0 -0

Customize query in gerrit

Related Objects

Mentioned In: T374924: Bring an-presto10[16-20] into service to replace an-presto100[1-5]
T294772: Superset Timeout Logging
T358196: [Presto] Use JWT authentication instead of Kerberos for cluster-internal communication
T337335: Upgrade Presto to release that aligns with Iceberg 1.2.1
T325809: Presto is unstable with more than 5 worker nodes
T327753: jmx_presto prometheus job down for some an-presto hosts
rLPRI2f5b48e77d5e: Add dummy keytabs for new presto1006-1015 servers
Mentioned Here: T325809: Presto is unstable with more than 5 worker nodes
T325331: Superset: Presto backend: Unable to access some charts
T306835: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet
T310643: Build Bigtop 1.5 Hadoop packages for Bullseye

Event Timeline

BTullis created this task.Nov 24 2022, 5:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 24 2022, 5:27 PM

BTullis set the point value for this task to 1.Nov 24 2022, 5:27 PM

BTullis moved this task from Backlog to Shared Data Infra on the Data-Engineering-Planning board.

BTullis moved this task from Backlog to To be discussed on the Shared-Data-Infrastructure board.

BTullis assigned this task to Stevemunene.Nov 28 2022, 10:04 AM

BTullis edited projects, added Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)); removed Shared-Data-Infrastructure.

BTullis moved this task from Next Up to In Progress on the Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)) board.Nov 28 2022, 10:59 AM

Change 861368 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add an-presto1006 to presto cluster

https://gerrit.wikimedia.org/r/861368

gerritbot added a project: Patch-For-Review.Nov 28 2022, 11:55 AM

Change 862240 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[labs/private@master] Add dummy keytabs for new presto1006-1015 servers

https://gerrit.wikimedia.org/r/862240

Change 862240 merged by Stevemunene:

[labs/private@master] Add dummy keytabs for new presto1006-1015 servers

https://gerrit.wikimedia.org/r/862240

Stevemunene mentioned this in rLPRI2f5b48e77d5e: Add dummy keytabs for new presto1006-1015 servers.Nov 30 2022, 12:48 PM

Change 861368 merged by Stevemunene:

[operations/puppet@production] Add an-presto1006 to presto cluster

https://gerrit.wikimedia.org/r/861368

Maintenance_bot removed a project: Patch-For-Review.Dec 1 2022, 10:30 AM

We do not have the right build for bullseye, thus we need to upgrade the packages for that. Here is a snippet from thew logs.

Error: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/presto/jvm.config20221201-1884923-gv9eoa.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/presto/manifests/server.pp, line: 95)
Error: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/presto/jvm.config20221201-1884923-gv9eoa.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/presto/manifests/server.pp, line: 95)
Wrapped exception:
No such file or directory - A directory component in /etc/presto/jvm.config20221201-1884923-gv9eoa.lock does not exist or is a dangling symbolic link
Error: /Stage[main]/Presto::Server/File[/etc/presto/jvm.config]/ensure: change from 'absent' to 'file' failed: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/presto/jvm.config20221201-1884923-gv9eoa.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/presto/manifests/server.pp, line: 95)
Notice: /Stage[main]/Profile::Kerberos::Client/File[/etc/krb5.conf]/mode: mode changed '0644' to '0444' (corrective)
Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install presto-cli' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package presto-cli
Error: /Stage[main]/Presto::Server/Package[presto-cli]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install presto-cli' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package presto-cli
Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install presto-server' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package presto-server
Error: /Stage[main]/Presto::Server/Package[presto-server]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install presto-server' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package presto-server

I think that we're in luck here, because the presto debs that we created are not compiled for a specific operating system. They use the binary tarballs which contains jar files.
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/presto/+/refs/heads/debian/debian/README.Debian#5

Therefore I believe that we can use the following process to make them available on bullseye:
https://wikitech.wikimedia.org/wiki/Reprepro#Copying_between_distributions

I'll tag @elukey and @Ottomata for visibility, but I can't personally see any problems with this approach.

Here are the versions of the presto package, but note that this is only a source package.

btullis@apt1001:~$ sudo -i reprepro ls presto
presto | 0.246-wmf-1 | stretch-wikimedia | source
presto |   0.273.3-1 |  buster-wikimedia | source

We can check to see which binary packages are created from that source package.

btullis@apt1001:~$ sudo -i reprepro listfilter buster-wikimedia 'Source ( == presto)'
buster-wikimedia|main|amd64: presto-cli 0.273.3-1
buster-wikimedia|main|amd64: presto-server 0.273.3-1
buster-wikimedia|main|i386: presto-cli 0.273.3-1
buster-wikimedia|main|i386: presto-server 0.273.3-1

That's presto-cli and presto-server, which matches your output above @Stevemunene.

I feel confident enough that we can copy these packages to the bullseye-wikimedia distribution.

One fineal check to make sure that the packages are not present in the destinatin repo.

btullis@apt1001:~$ sudo -i reprepro listmatched bullseye-wikimedia presto*
btullis@apt1001:~$

Run the command to copy them and re-check the output

btullis@apt1001:~$ sudo -i reprepro copysrc bullseye-wikimedia buster-wikimedia presto
Exporting indices...
btullis@apt1001:~$ sudo -i reprepro listmatched bullseye-wikimedia presto*
bullseye-wikimedia|main|amd64: presto-cli 0.273.3-1
bullseye-wikimedia|main|amd64: presto-server 0.273.3-1
bullseye-wikimedia|main|i386: presto-cli 0.273.3-1
bullseye-wikimedia|main|i386: presto-server 0.273.3-1
bullseye-wikimedia|main|source: presto 0.273.3-1

@Stevemunene - you should be able to run puppet again on an-presto1006 and see if the packages are installed.

Presto-service run fails on the Debian11 boxes due to a python issue caused by the unversioned /usr/bin/python required by a dependency.

Installed package python-is-python3 and the presto dependency was able to compile.
presto-server.service still could not start because of a permission issue

Dec 02 11:18:08 an-presto1006 presto-server[2138247]: ERROR: [Errno 13] Permission denied: '/srv/presto/var/run'

File /srv/presto/var/run was owned by root:root instead of presto:presto This was fixed by deleting the folder sudo rm -rf /srv/presto/var

A subsequent start of the presto-server.service ensured the process folder was generated with the right permissions

stevemunene@an-presto1006:~$ ls -lrth /srv/presto/var
total 8.0K
drwxr-xr-x 2 presto presto 4.0K Dec  2 12:05 run
drwxr-xr-x 2 presto presto 4.0K Dec  2 12:05 log
stevemunene@an-presto1006:~$

Thus an-presto1006 was successfully added to the presto cluster, moving on to the next server with package python-is-python3 preinstalled.

Change 863327 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add an-presto1007 to presto cluster

https://gerrit.wikimedia.org/r/863327

gerritbot added a project: Patch-For-Review.Dec 2 2022, 1:43 PM

Change 863327 merged by Stevemunene:

[operations/puppet@production] Add an-presto1007 to presto cluster

https://gerrit.wikimedia.org/r/863327

Maintenance_bot removed a project: Patch-For-Review.Dec 2 2022, 2:30 PM

@Stevemunene nice job! One suggestion - if possible let's incorporate the fixes that you applied manually to puppet, so that next deployments won't require manual steps. During the lifetime of a host we reimage to upgrade the os etc.. and remembering all the manual steps done is basically impossible :)

Thanks @elukey Working on that.

python-is-python3 a

Oh, cool! I wonder if we can get away with installing this everywhere by default?

@Ottomata This might affect the rare packages using python2 or the deployments that had already set up symlinks to python3. We shall discuss more on potential implications on the deployments already done.

Mentioned in SAL (#wikimedia-analytics) [2022-12-05T11:45:43Z] <steve_munene> restarting presto-server.service on an-presto1007 T323783

an-presto1007 is now part of the cluster. the delay in joining the cluster was caused by the timing between the puppet run and the adding of the related keytabs.
Setting up the python is python3 automation for the presto servers then proceeding with adding an-presto1008-15 to the cluster.

Change 864895 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add python-is-python3 package

https://gerrit.wikimedia.org/r/864895

gerritbot added a project: Patch-For-Review.Dec 6 2022, 7:04 AM

Change 864895 merged by Stevemunene:

[operations/puppet@production] Add python-is-python3 package

https://gerrit.wikimedia.org/r/864895

Change 865043 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add an-presto1008-1015 to presto cluster

https://gerrit.wikimedia.org/r/865043

Stevemunene moved this task from In Progress to In Review on the Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)) board.Dec 6 2022, 4:57 PM

Change 865043 merged by Stevemunene:

[operations/puppet@production] Add an-presto1008-1015 to presto cluster

https://gerrit.wikimedia.org/r/865043

Maintenance_bot removed a project: Patch-For-Review.Dec 7 2022, 10:30 AM

This is done, all servers successfully joined the cluster.

Stevemunene moved this task from In Review to Done on the Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)) board.Dec 7 2022, 12:52 PM

This change has been reverted and we are back to the original an-presto100[1-5] due to and incident where Superset: Presto backend: Unable to access some charts.
More details available in T325331

Change 870819 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Disable the presto server on the 10 new hosts

https://gerrit.wikimedia.org/r/870819

gerritbot added a project: Patch-For-Review.Dec 22 2022, 10:37 AM

Change 870819 merged by Btullis:

[operations/puppet@production] Disable the presto server on the 10 new hosts

https://gerrit.wikimedia.org/r/870819

Maintenance_bot removed a project: Patch-For-Review.Dec 22 2022, 11:30 AM

Shall we move this back to in-progress @Stevemunene ? Have we got any theories as to why the cluster was less stable with more workers?
I think that @JAllemandou said he had an idea why it might be.

BTullis triaged this task as Medium priority.Jan 18 2023, 4:06 PM

Found a discussion on the presto github revolving around a similar issue. The number of worker nodes that a cluster can support is limited by the resources (CPU and memory) available at the coordinator.
This was first discussed here Scale to larger clusters #10174 with the general suggestion being scaling the coordinator.
The suggestion was to introduce the Presto Disaggregated Coordinator discussed here Scaling The Presto Coordinator #13814 and here Design Disaggregated Presto Coordinators #15453 which provides a design for the feature. Also mentioned was an error similar to the one we were facing "In certain high QPS use cases, we have found that workers can become starved of splits, by excessive CPU being spent on task updates. This bottleneck in the coordinator is alleviated by reducing the concurrency, but this leaves the cluster under-utilized."
"Furthermore, because of the de-facto constraint that there must be one coordinator per cluster, this limits the size of the worker pool to whatever number can handle the QPS from conditions 1 and 2. This means it’s very difficult to deploy large high-QPS clusters than can take on queries of moderate complexity (such as high stage count queries)." This activity can be seen by viewing the presto stats, running queries vs abandoned queries here around the time T325331 was brought to attention.

The proposal was to split one coordinator into multiple coordinators, and each coordinator takes a subset of the queries for a single cluster. The idea is to extract Resource Scheduling from Coordinator and just provide Resource Scheduling as a service for Presto. User will have the ability to deploy Resource Manager and a Query coordinator as a single JVM for small deployments. This is implemented via the Presto Disaggregated Coordinator mentioned above and should help solve our current issue. The feature is live and some recommended configurations available as well as more details on the implementation here Disaggregated Coordinator

Stevemunene mentioned this in T327753: jmx_presto prometheus job down for some an-presto hosts.Jan 24 2023, 4:08 PM

Thanks @Stevemunene - that's good research. However, I'm not sure that it's totally applicable in our situation as they're talking about large Presto clusters. For example. #13814 mentions:

Coordinators can barely scale to around 800 to 1000 workers on a Hardware with 40 core CPU and 256GB RAM.

In our case we currently have 5 active workers and it becomes very ustable at 10 workers, then totally broken when we get towards 15 workers. Well below the numbers that they're talking about.

Also, the disaggregated coordinator design #15453 talks about:

large, high-QPS clusters

Well, I would not say that we are in a high QPS environment. Most of the time our presto cluster is doing nothing. In the last 24 hours it hasn't got above 25 QPS and there were still some failures at up to 8 per second, even with five workers.

Also the coordinator (an-coord1001) doesn't seem to be particularly stressed.

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-coord1001&var-datasource=thanos&var-cluster=analytics

So I'm wondering if it's something more fundamental about our cluster that is the issue. We already tried adding more Java heap to the coordinator process (https://gerrit.wikimedia.org/r/c/869214) but it didn't make any significant difference, as far as we could tell.

Not sure where else to look in the short term. Maybe there would be someone ontheir Slack who might have some ideas? https://prestodb.io/community.html

TB: End of this sprint. (End of next week)

• EChetty raised the priority of this task from Medium to High.Jan 25 2023, 1:41 PM

Here are some configuration parameters that we believe are going to be useful for testing this:

From this page: https://aws.amazon.com/blogs/big-data/top-9-performance-tuning-tips-for-prestodb-on-amazon-emr/

task.concurrency = number of vCPUS per worker = 48 (12 cores x 2 CPUs x 2 with hyperthreading)
query.max-concurrent-queries <- Not sure if this available in our version
task.max-worker-threads = Node vCPUs * 4 (i.e. 192)- Default is number of node vCPUs * 2
node-scheduler.max-splits-per- node 500 (default 100)
task.http-response-threads 5000 (default 100)

node-scheduler.max-splits-per- node 150 (default 100)

Let's try with 500 as shown in the code example instead of 150 please :)

Also

query.max-total-memory-per-node increase from 24GB to 40 GB
query.max-memory-per-node: increase 12GB to 20 GB
query.max-memory: increase from 62GB to 200 GB (40 GB max mem per node x 5 nodes.) When we increase to 15 nodes, increase this value to 600 GB.

In T323783#8557390, @JAllemandou wrote:

node-scheduler.max-splits-per- node 150 (default 100)

Let's try with 500 as shown in the code example instead of 150 please :)

Agreed! I have edited the list above with the new value of 500.

Change 883583 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Tuning Presto Config for scaling

https://gerrit.wikimedia.org/r/883583

gerritbot added a project: Patch-For-Review.Jan 25 2023, 2:59 PM

Change 883583 merged by Stevemunene:

[operations/puppet@production] Tuning Presto Config for scaling

https://gerrit.wikimedia.org/r/883583

Maintenance_bot removed a project: Patch-For-Review.Jan 25 2023, 4:31 PM

Updated the Presto server acting as coordinator and the presto servers acting as worker nodes configs with the tuning configs below:

query.max-memory: 200GB
query.max-memory-per-node: 20GB
query.max-total-memory-per-node: 40GB
task.concurrency: 48
# task.max-worker-threads is the Node vCPUs * 4
task.max-worker-threads: 192
node-scheduler.max-splits-per-node: 500
task.http-response-threads: 5000

After reviews, proceeded to merge the changes to production and ran the puppet agent sudo run-puppet-agent on the relevant hosts an-coord1001- Presto Server (acting as coordinator) an-presto100[1-5] - Presto Server (acting as worker)

Mentioned in SAL (#wikimedia-analytics) [2023-01-25T16:54:57Z] <steve_munene> Restarting presto-server.service on presto coordinator an-coord1001 for T323783

Change 883628 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Reduce the presto task concurrency from 48 to 32

https://gerrit.wikimedia.org/r/883628

Change 883628 merged by Btullis:

[operations/puppet@production] Reduce the presto task concurrency from 48 to 32

https://gerrit.wikimedia.org/r/883628

Maintenance_bot removed a project: Patch-For-Review.Jan 25 2023, 5:31 PM

Change 883642 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Increase the presto cluster size to 15 hosts again

https://gerrit.wikimedia.org/r/883642

gerritbot added a project: Patch-For-Review.Jan 25 2023, 5:43 PM

Change 883642 merged by Btullis:

[operations/puppet@production] Increase the presto cluster size to 15 hosts again

https://gerrit.wikimedia.org/r/883642

Maintenance_bot removed a project: Patch-For-Review.Jan 25 2023, 6:31 PM

Other ideas of things to tweak:

JVM parameters (high G1 Young G on coordinator when we bumped the number of hosts - https://grafana.wikimedia.org/d/pMd25ruZz/presto?orgId=1&from=1674664203228&to=1674681167611&viewPanel=15) - From https://trino.io/docs/current/installation/deployment.html#jvm-config
Metastore metadata caching - We currently don't cache metadata, it would probably speed up coordinator work - https://prestodb.io/docs/current/connector/hive.html - hive.metastore-cache-ttl

Mentioned in SAL (#wikimedia-analytics) [2023-01-30T16:41:44Z] <btullis> started an-presto1006-1015 again, but disabled the presto service on them once again T323783 and T325809

• nfraison subscribed.Feb 7 2023, 2:42 PM

I think that we should resolve this ticket and carry out the problem solving on T325809: Presto is unstable with more than 5 worker nodes instead.

These servers have been added to the cluster and removed several times, so it's the instability problem that remains to be solved.

BTullis mentioned this in T337335: Upgrade Presto to release that aligns with Iceberg 1.2.1.Jun 7 2023, 1:58 PM

BTullis mentioned this in T358196: [Presto] Use JWT authentication instead of Kerberos for cluster-internal communication.Feb 22 2024, 10:32 AM

BTullis mentioned this in T294772: Superset Timeout Logging.Mar 22 2024, 11:58 AM

Stevemunene mentioned this in T374924: Bring an-presto10[16-20] into service to replace an-presto100[1-5].Oct 31 2024, 7:53 AM

	F36487602: image.png
	Jan 25 2023, 12:03 PM

	F36487599: image.png
	Jan 25 2023, 12:03 PM

	F35844766: image.png
	Dec 7 2022, 12:51 PM

	F35827536: image.png
	Dec 1 2022, 9:25 PM

Add an-presto10[06-15] to the presto clusterClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Add an-presto10[06-15] to the presto cluster
Closed, ResolvedPublic5 Estimated Story Points
Actions