Page MenuHomePhabricator

Add an-presto10[06-15] to the presto cluster
Open, Needs TriagePublic1 Estimated Story Points

Description

We have 10 presto servers that are ready to be added to the cluster. Their setup/racking task was: T306835
https://github.com/wikimedia/puppet/blob/production/manifests/site.pp#L173-L181

# Analytics Presto nodes.
node /^an-presto100[1-5]\.eqiad\.wmnet$/ {
    role(analytics_cluster::presto::server)
}

# New an-presto nodes in eqiad T306835
node /^an-presto10(0[6-9]|1[0-5])\.eqiad\.wmnet/ {
    role(insetup::data_engineering)

There were a couple of delays caused by:

  1. a dependency on having Hadoop packages ready for bullseye T310643
  2. an issue regarding the H750 raid controller working properly

However, these have both been resolved so now adding these servers to the presto cluster should only be a matter of an update to the site.pp file linked above.
This will triple our current presto compute capacity.

Event Timeline

BTullis set the point value for this task to 1.Thu, Nov 24, 5:27 PM
BTullis moved this task from Backlog to Shared Data Infra on the Data-Engineering-Planning board.
BTullis moved this task from Backlog to To be discussed on the Shared-Data-Infrastructure board.

Change 861368 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add an-presto1006 to presto cluster

https://gerrit.wikimedia.org/r/861368

Change 862240 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[labs/private@master] Add dummy keytabs for new presto1006-1015 servers

https://gerrit.wikimedia.org/r/862240

Change 862240 merged by Stevemunene:

[labs/private@master] Add dummy keytabs for new presto1006-1015 servers

https://gerrit.wikimedia.org/r/862240

Change 861368 merged by Stevemunene:

[operations/puppet@production] Add an-presto1006 to presto cluster

https://gerrit.wikimedia.org/r/861368

We do not have the right build for bullseye, thus we need to upgrade the packages for that. Here is a snippet from thew logs.

Error: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/presto/jvm.config20221201-1884923-gv9eoa.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/presto/manifests/server.pp, line: 95)
Error: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/presto/jvm.config20221201-1884923-gv9eoa.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/presto/manifests/server.pp, line: 95)
Wrapped exception:
No such file or directory - A directory component in /etc/presto/jvm.config20221201-1884923-gv9eoa.lock does not exist or is a dangling symbolic link
Error: /Stage[main]/Presto::Server/File[/etc/presto/jvm.config]/ensure: change from 'absent' to 'file' failed: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/presto/jvm.config20221201-1884923-gv9eoa.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/presto/manifests/server.pp, line: 95)
Notice: /Stage[main]/Profile::Kerberos::Client/File[/etc/krb5.conf]/mode: mode changed '0644' to '0444' (corrective)
Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install presto-cli' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package presto-cli
Error: /Stage[main]/Presto::Server/Package[presto-cli]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install presto-cli' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package presto-cli
Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install presto-server' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package presto-server
Error: /Stage[main]/Presto::Server/Package[presto-server]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install presto-server' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package presto-server

I think that we're in luck here, because the presto debs that we created are not compiled for a specific operating system. They use the binary tarballs which contains jar files.
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/presto/+/refs/heads/debian/debian/README.Debian#5

Therefore I believe that we can use the following process to make them available on bullseye:
https://wikitech.wikimedia.org/wiki/Reprepro#Copying_between_distributions

I'll tag @elukey and @Ottomata for visibility, but I can't personally see any problems with this approach.

Here are the versions of the presto package, but note that this is only a source package.

btullis@apt1001:~$ sudo -i reprepro ls presto
presto | 0.246-wmf-1 | stretch-wikimedia | source
presto |   0.273.3-1 |  buster-wikimedia | source

We can check to see which binary packages are created from that source package.

btullis@apt1001:~$ sudo -i reprepro listfilter buster-wikimedia 'Source ( == presto)'
buster-wikimedia|main|amd64: presto-cli 0.273.3-1
buster-wikimedia|main|amd64: presto-server 0.273.3-1
buster-wikimedia|main|i386: presto-cli 0.273.3-1
buster-wikimedia|main|i386: presto-server 0.273.3-1

That's presto-cli and presto-server, which matches your output above @Stevemunene.

I feel confident enough that we can copy these packages to the bullseye-wikimedia distribution.

One fineal check to make sure that the packages are not present in the destinatin repo.

btullis@apt1001:~$ sudo -i reprepro listmatched bullseye-wikimedia presto*
btullis@apt1001:~$

Run the command to copy them and re-check the output

btullis@apt1001:~$ sudo -i reprepro copysrc bullseye-wikimedia buster-wikimedia presto
Exporting indices...
btullis@apt1001:~$ sudo -i reprepro listmatched bullseye-wikimedia presto*
bullseye-wikimedia|main|amd64: presto-cli 0.273.3-1
bullseye-wikimedia|main|amd64: presto-server 0.273.3-1
bullseye-wikimedia|main|i386: presto-cli 0.273.3-1
bullseye-wikimedia|main|i386: presto-server 0.273.3-1
bullseye-wikimedia|main|source: presto 0.273.3-1

@Stevemunene - you should be able to run puppet again on an-presto1006 and see if the packages are installed.

Presto-service run fails on the Debian11 boxes due to a python issue caused by the unversioned /usr/bin/python required by a dependency.

image.png (543×1 px, 217 KB)

Installed package python-is-python3 and the presto dependency was able to compile.
presto-server.service still could not start because of a permission issue

Dec 02 11:18:08 an-presto1006 presto-server[2138247]: ERROR: [Errno 13] Permission denied: '/srv/presto/var/run'

File /srv/presto/var/run was owned by root:root instead of presto:presto This was fixed by deleting the folder sudo rm -rf /srv/presto/var

A subsequent start of the presto-server.service ensured the process folder was generated with the right permissions

stevemunene@an-presto1006:~$ ls -lrth /srv/presto/var
total 8.0K
drwxr-xr-x 2 presto presto 4.0K Dec  2 12:05 run
drwxr-xr-x 2 presto presto 4.0K Dec  2 12:05 log
stevemunene@an-presto1006:~$

Thus an-presto1006 was successfully added to the presto cluster, moving on to the next server with package python-is-python3 preinstalled.

Change 863327 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add an-presto1007 to presto cluster

https://gerrit.wikimedia.org/r/863327

Change 863327 merged by Stevemunene:

[operations/puppet@production] Add an-presto1007 to presto cluster

https://gerrit.wikimedia.org/r/863327

@Stevemunene nice job! One suggestion - if possible let's incorporate the fixes that you applied manually to puppet, so that next deployments won't require manual steps. During the lifetime of a host we reimage to upgrade the os etc.. and remembering all the manual steps done is basically impossible :)

python-is-python3 a

Oh, cool! I wonder if we can get away with installing this everywhere by default?

@Ottomata This might affect the rare packages using python2 or the deployments that had already set up symlinks to python3. We shall discuss more on potential implications on the deployments already done.

Mentioned in SAL (#wikimedia-analytics) [2022-12-05T11:45:43Z] <steve_munene> restarting presto-server.service on an-presto1007 T323783

an-presto1007 is now part of the cluster. the delay in joining the cluster was caused by the timing between the puppet run and the adding of the related keytabs.
Setting up the python is python3 automation for the presto servers then proceeding with adding an-presto1008-15 to the cluster.

Change 864895 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add python-is-python3 package

https://gerrit.wikimedia.org/r/864895

Change 864895 merged by Stevemunene:

[operations/puppet@production] Add python-is-python3 package

https://gerrit.wikimedia.org/r/864895

Change 865043 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add an-presto1008-1015 to presto cluster

https://gerrit.wikimedia.org/r/865043