Page MenuHomePhabricator

Upgrade Hadoop test cluster to Bullseye
Closed, ResolvedPublic5 Estimated Story Points

Assigned To
Authored By
EChetty
Feb 10 2023, 11:33 AM
Referenced Files
F37123155: image.png
Jun 29 2023, 1:48 PM
F37123139: image.png
Jun 29 2023, 1:14 PM
F37122170: image.png
Jun 28 2023, 3:33 PM
F37122172: image.png
Jun 28 2023, 3:33 PM
F36823367: image.png
Feb 13 2023, 11:45 AM
F36818968: image.png
Feb 10 2023, 12:20 PM
Tokens
"Party Time" token, awarded by MoritzMuehlenhoff.

Description

Hadoop-test cluster servers to be upgraded

  • hadoop-workers-test - 3- cumin 'P{F:lsbdistcodename = buster} and A:hadoop-worker-test' - 2 out of 3 done
  • hadoop-coordinator-test - 1 an-test-coord1001.eqiad.wmnet
  • hadoop-coordinator-standby-test - 1 an-test-coord1002.eqiad.wmnet
  • hadoop-master-test - 1 - an-test-master1001.eqiad.wmnet
  • hadoop-standby-test - 1 - an-test-master1002.eqiad.wmnet
  • hadoop-client-test - 1 - an-test-client1001.eqiad.wmnet Replace with an-test-client1002.eqiad.wmnet
  • re commission an-test-worker1001 which is currently in host exclude list

We need to make sure that these servers do not format various volumes diring the reinstall.

  • an-test-worker100[1-3] - /srv/hadoop
  • an-test-coord1001 - /srv/
  • an-test-master1001 - /srv/

The contents of /home on an-test-client will be lost, so we should ask users whether they would like to back up anything before it is reinstalled.
Here are the largest home directories, according to sudo ncdu -x /home

image.png (304×380 px, 27 KB)

The way in which we configure the debian installer not to format volumes is as shown here:
https://phabricator.wikimedia.org/rOPUP8457d3f0007143f0772e9a8dae0b5d088c3d7978

All of these reuse partition recipes should already be in place for all of the servers here:
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/install_server/files/autoinstall/netboot.cfg$95

...but it's worth checking that they look good and work as expected.

There is an optional reuse-parts-test.cfg file that pauses the installer before committing the changes to disk. Not sure whether it makes sense to use it on these test servers, but it's worth knowing about.

These are the remaining errors running puppet on a Hadoop test worker:

  • Error: /Stage[main]/Profile::Python37/Package[python3.7]/ensure: change from 'purged' to 'present' failed
  • Error: /Stage[main]/Ores::Base/Package[enchant]/ensure: change from 'purged' to 'present' failed
  • Error: /Stage[main]/Ores::Base/Package[myspell-de-at]/ensure: change from 'purged' to 'present' failed - plus myspell-de-ch, myspell-de-de
  • Error: /Stage[main]/Conda_analytics/Package[conda-analytics]/ensure: change from 'purged' to 'present' failed
  • Error: /Stage[main]/Bigtop::Hive/File[/usr/lib/hive/bin/ext/hiveserver2.sh]/ensure: change from 'absent' to 'file' failed
  • Error: /Stage[main]/Profile::Hadoop::Spark2/Package[spark2]/ensure: change from 'purged' to 'present' failed
  • Error: /Stage[main]/Profile::Hadoop::Spark2/File[/etc/spark2/conf/hive-site.xml]/ensure: change from 'absent' to 'link' failed
  • Error: /Stage[main]/Profile::Hadoop::Spark2/Package[spark2]/ensure: change from 'purged' to 'present' failed
  • Error: /Stage[main]/Bigtop::Hadoop::Nodemanager/Systemd::Service[hadoop-yarn-nodemanager]/Service[hadoop-yarn-nodemanager]/ensure: change from 'stopped' to 'running' failed

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -5
operations/puppetproduction+3 -0
operations/puppetproduction+32 -33
operations/puppetproduction+1 -1
operations/puppetproduction+10 -17
operations/puppetproduction+21 -2
operations/puppetproduction+1 -0
operations/puppetproduction+0 -5
operations/puppetproduction+0 -5
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -2
operations/puppetproduction+0 -6
operations/puppetproduction+33 -32
operations/puppetproduction+0 -7
labs/privatemaster+0 -0
operations/puppetproduction+17 -2
operations/puppetproduction+13 -3
operations/puppetproduction+4 -0
operations/puppetproduction+16 -18
operations/puppetproduction+7 -5
operations/puppetproduction+15 -6
operations/puppetproduction+0 -4
operations/puppetproduction+4 -5
operations/puppetproduction+1 -1
operations/puppetproduction+4 -4
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-worker1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-worker1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-worker1003 (FAIL)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-worker1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-worker1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye

@MoritzMuehlenhoff this host is still not booting into PXE.

I've updated the BIOS, iDrac, and NIC to the latest versions.

image.png (891×1 px, 134 KB)

I can see that it's sending a DHCP request, but it doesn't seem to be getting through to install1004

image.png (509×1 px, 111 KB)

I can keep investigationg, but I thought I'd let you know.

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-worker1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Adding @Papaul Does that ring a bell? I think we had some systems recently where we could not use the most recent NIC firmware, but were forced to use an older version, is that maybe one of them?

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-worker1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

I tried downgrading the NIC firmware from 21.81.3 to 21.80.9 but that didn't solve the issue.

image.png (69×1 px, 28 KB)

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye

I tried downgrading again, from 21.80.8 to 21.60.16, but that didn't help either.

image.png (66×1 px, 27 KB)

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-worker1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

@jbond - I'm wondering if you might have any insight into why an-test-worker1003 seems so reluctant to get a DHCP address during PXE boot.
I've tried a variety of different NIC firmware versions and BIOS versions, but whatever I do it doesn't receive an address, so it just boots back into the existing OS after the cookbook removes it from the puppetdb.

If it's too much of a pain to investigate, I could just do an in-place dist-upgrade for this host instead. I'm beginning to wonder it it's an issue with the DHCP automation, or something related to the switches.
If you have any clues, I'd be grateful.

@BTullis if the server is not in production i can take a look.

@BTullis if the server is not in production i can take a look.

Yes please, @Papaul - You can do whatever you like with the server. It's currently out of the puppetdb because the reimage cookbook failed. I can't get a DHCP response from install1004 for it.

@BTullis hey looked at the server yesterday everything on the serve looks good so working with network team to see why the server is not getting anything DHCP. will let you know

@BTullis it looks like we found the issue @cmooney have the fix at https://gerrit.wikimedia.org/r/c/operations/homer/public/+/936036
so i am waiting on the merge to re-test

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye

Awesome. Many thanks @Papaul and @cmooney - Reimage under way now.

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-worker1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye completed:

  • an-test-worker1003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307061449_btullis_3020294_an-test-worker1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

The hadoop-test workers are all upgraded to bullseye.

btullis@cumin1001:~$ sudo cumin 'P{F:lsbdistcodename = buster} and A:hadoop-worker-test'
No hosts found that matches the query

We can start the upgrade of the production workers now.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1070.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1070.eqiad.wmnet with OS bullseye completed:

  • analytics1070 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307171110_btullis_1698106_analytics1070.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1072.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1072.eqiad.wmnet with OS bullseye executed with errors:

  • analytics1072 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1072.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1072.eqiad.wmnet with OS bullseye completed:

  • analytics1072 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307171413_btullis_1734840_analytics1072.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-master1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-master1002.eqiad.wmnet with OS bullseye completed:

  • an-test-master1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308091349_btullis_1404631_an-test-master1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 947366 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove the manual check of reuse recipe on an-test-master hosts

https://gerrit.wikimedia.org/r/947366

Change 947366 merged by Btullis:

[operations/puppet@production] Remove the manual check of reuse recipe on an-test-master hosts

https://gerrit.wikimedia.org/r/947366

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-master1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-master1001.eqiad.wmnet with OS bullseye completed:

  • an-test-master1001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308091550_btullis_1426374_an-test-master1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

I've upgraded both an-test-master servers to bullseye and failed back to the an-test-master1001 as the active namenode.

Interestingly, all three of the workers are showing up in the default rack now.

btullis@an-test-master1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -printTopology
Rack: /default-rack
   10.64.36.111:50010 (an-test-worker1002.eqiad.wmnet)
   10.64.5.38:50010 (an-test-worker1001.eqiad.wmnet)
   10.64.53.21:50010 (an-test-worker1003.eqiad.wmnet)

I'll look into where this is going wrong.

Oh, pretty easy to see where it's going wrong.

btullis@an-test-master1002:/etc/hadoop/conf$ ./net-topology.sh an-test-worker1001.eqiad.wmnet
/usr/bin/env: ‘python’: No such file or directory

I know that @SLyngshede-WMF has been working on refactoring this script in https://gerrit.wikimedia.org/r/c/operations/puppet/+/929643 so I'm inclined to push forward on that work, instead of tinkering with the old version of the script.

There is also one Icinga check that appears not to be functioning correctly.

btullis@alert1001:~$ /usr/lib/nagios/plugins/check_nrpe -2 -u -H 10.64.5.39 -c check_hadoop-hdfs-active-namenode -t 10
NRPE: Unable to read output

Yet on the server that NRPE command looks OK.

btullis@an-test-master1001:~$ cat /etc/nagios/nrpe.d/check_hadoop-hdfs-namenode.cfg
# File generated by puppet. DO NOT edit by hand
command[check_hadoop-hdfs-namenode]=/usr/lib/nagios/plugins/check_procs -c 1:1 -C java -a "org.apache.hadoop.hdfs.server.namenode.NameNode"
btullis@an-test-master1001:~$ /usr/lib/nagios/plugins/check_procs -c 1:1 -C java -a "org.apache.hadoop.hdfs.server.namenode.NameNode"
PROCS OK: 1 process with command name 'java', args 'org.apache.hadoop.hdfs.server.namenode.NameNode' | procs=1;;1:1;0;

Ah, my mistake. I had the wrong check. Now it's easy to see why the check is failing.

btullis@an-test-master1001:~$ cat /etc/nagios/nrpe.d/check_hadoop-hdfs-active-namenode.cfg 
# File generated by puppet. DO NOT edit by hand
command[check_hadoop-hdfs-active-namenode]=/usr/bin/sudo /usr/local/bin/kerberos-run-command hdfs /usr/local/bin/check_hdfs_active_namenode
btullis@an-test-master1001:~$ /usr/bin/sudo /usr/local/bin/kerberos-run-command hdfs /usr/local/bin/check_hdfs_active_namenode
/usr/bin/env: ‘python’: No such file or directory

Change 947421 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Use python3 for the check_hdfs_active_namenode script

https://gerrit.wikimedia.org/r/947421

Change 947421 merged by Btullis:

[operations/puppet@production] Use python3 for the check_hdfs_active_namenode script

https://gerrit.wikimedia.org/r/947421

Change 947811 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Temporarily disable the gobblin jobs on hadoop_test

https://gerrit.wikimedia.org/r/947811

Change 947812 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Re-enable the gobblin timers on hadoop_test

https://gerrit.wikimedia.org/r/947812

Change 947811 merged by Btullis:

[operations/puppet@production] Temporarily disable the gobblin jobs on hadoop_test

https://gerrit.wikimedia.org/r/947811

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-coord1001.eqiad.wmnet with OS bullseye

OK, there's still an error on an-test-coord1001 because of a conflict over python-is-python2, which is added by hive and python-is-python3, which is added by presto-server.
I remember discussing it here: T336281#8847022 but I didn't come up with a proper solution.
Now I need to fix it properly.

Change 947824 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Don't install python-is-python3 to presto servers

https://gerrit.wikimedia.org/r/947824

Change 947824 merged by Btullis:

[operations/puppet@production] Don't install python-is-python3 to presto servers

https://gerrit.wikimedia.org/r/947824

Change 947826 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove settings relating to oozie on an-test-coord1001

https://gerrit.wikimedia.org/r/947826

Change 947826 merged by Btullis:

[operations/puppet@production] Remove settings relating to oozie on an-test-coord1001

https://gerrit.wikimedia.org/r/947826

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-coord1001.eqiad.wmnet with OS bullseye completed:

  • an-test-coord1001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308101148_btullis_1658804_an-test-coord1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Hive isn't happy with the MariaDB connector instead of the MySQL connector.
We knew that this was a possibility, but now this is confirmed.

I edited /etc/hive/conf/hive-site.xmland changed the following:

<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>

to

<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.mariadb.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>

Unfortunately, when I try to start the hive-metastore service I get:

2023-08-10T14:13:23,772  WARN [pool-10-thread-3] metastore.MetaStoreDirectSql: Self-test query [select "DB_ID" from "DBS"] failed; direct SQL is disabled
javax.jdo.JDODataStoreException: Error executing SQL query "select "DB_ID" from "DBS"".

and:

Caused by: java.sql.SQLSyntaxErrorException: (conn=27) You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '"DBS"' at line 1

It looks like I'm going to have to try forward-porting the MySQL connector again, which is what we did before for the buster upgrade.

Change 947857 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Create component/libmysql-java for bullseye

https://gerrit.wikimedia.org/r/947857

Change 947880 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Use the libmysql-java component on bullseye as well

https://gerrit.wikimedia.org/r/947880

Change 947881 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Use the libmariadb-java connector for sqoop

https://gerrit.wikimedia.org/r/947881

Change 947857 merged by Btullis:

[operations/puppet@production] Create component/libmysql-java for bullseye

https://gerrit.wikimedia.org/r/947857

I've now been able to copy the libmysql-java package to bullseye.

btullis@apt1001:~$ sudo -i reprepro -C component/libmysql-java copy bullseye-wikimedia buster-wikimedia libmysql-java
Exporting indices...
btullis@apt1001:~$ sudo -i reprepro -C component/libmysql-java list bullseye-wikimedia
bullseye-wikimedia|component/libmysql-java|amd64: libmysql-java 5.1.49-0+deb9u1
bullseye-wikimedia|component/libmysql-java|i386: libmysql-java 5.1.49-0+deb9u1

Change 947880 merged by Btullis:

[operations/puppet@production] Use the libmysql-java component on bullseye as well

https://gerrit.wikimedia.org/r/947880

Change 947812 merged by Btullis:

[operations/puppet@production] Re-enable the gobblin timers on hadoop_test

https://gerrit.wikimedia.org/r/947812

I think that this is all finished now, except for the issue with all servers appearing in the default rack, which we're going to fix with: https://gerrit.wikimedia.org/r/c/operations/puppet/+/929643 when @SLyngshede-WMF is around sometime. We won't be able to upgrade the production coordinators until this is fixed, but that's OK.

@BTullis I think that was one of my regressions with my updated script. I think I stripped out the /default/rack bit, but added it back in.

The old script has this slightly weird feature where it adds /{site}/default/rack when you parse it less than two hosts. So when you just query using one hosts it will always appear to be in default. We can just remove that again.

@BTullis I think that was one of my regressions with my updated script. I think I stripped out the /default/rack bit, but added it back in.

The old script has this slightly weird feature where it adds /{site}/default/rack when you parse it less than two hosts. So when you just query using one hosts it will always appear to be in default. We can just remove that again.

Thanks @SLyngshede-WMF - I don't think that change in behaviour matters much, to be honest.
It's more that the old script is failing (because there is no /usr/bin/python) after I upgraded the test cluster hadoop masters to bullseye.

btullis@an-test-master1001:~$ /etc/hadoop/conf.analytics-test-hadoop/net-topology.sh
/usr/bin/env: ‘python’: No such file or directory

Without this working, the default settings from hadoop seem to kick in, which gives us the following.

btullis@an-test-master1001:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -printTopology
Rack: /default-rack
   10.64.36.111:50010 (an-test-worker1002.eqiad.wmnet)
   10.64.5.38:50010 (an-test-worker1001.eqiad.wmnet)
   10.64.53.21:50010 (an-test-worker1003.eqiad.wmnet)

Our Icinga check still works in the same way, even with these default settings.

btullis@an-test-master1001:~$ sudo -u hdfs kerberos-run-command hdfs bash -x /usr/local/lib/nagios/plugins/check_hdfs_topology
+ hdfs dfsadmin -printTopology
+ egrep -q 'Rack:.*default.*'
+ '[' 0 -eq 1 ']'
+ echo 'CRITICAL: There is at least one node in the default rack.'
CRITICAL: There is at least one node in the default rack.
+ exit 2

I just thought it better to push on and get your updated script deployed, rather than go back and fix the old one.

Change 901670 abandoned by Btullis:

[operations/puppet@production] Upload the spark3-assemly file to HDFS on the test cluster

Reason:

Change of approach. We will be generating the assembly from GitLab-CI and uploading manually.

https://gerrit.wikimedia.org/r/901670

Change 956383 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Retain python2 on the test hadoop standby role

https://gerrit.wikimedia.org/r/956383

Change 956383 merged by Btullis:

[operations/puppet@production] Retain python2 on the test hadoop standby role

https://gerrit.wikimedia.org/r/956383

Change 957862 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Remove mention of an-test-client1001

https://gerrit.wikimedia.org/r/957862

Change 957862 merged by Stevemunene:

[operations/puppet@production] Remove mention of an-test-client1001

https://gerrit.wikimedia.org/r/957862