This ticket will track the upgrade of the analaytics Hadoop workers to bullseye.
There are currently 91 hosts in this cluster, although at the time of writing, 6 are due to be decommissioned.
| BTullis | |
| Mar 20 2023, 12:36 PM |
| F37730472: image.png | |
| Sep 18 2023, 1:28 PM |
| F37727493: image.png | |
| Sep 15 2023, 11:16 AM |
| F37721279: image.png | |
| Sep 14 2023, 2:34 PM |
| F37667677: image.png | |
| Sep 6 2023, 4:52 PM |
| F37656999: image.png | |
| Sep 5 2023, 12:34 PM |
| F37652889: image.png | |
| Sep 4 2023, 10:08 AM |
| F37652969: image.png | |
| Sep 4 2023, 10:08 AM |
| F37626140: image.png | |
| Aug 24 2023, 11:41 AM |
This ticket will track the upgrade of the analaytics Hadoop workers to bullseye.
There are currently 91 hosts in this cluster, although at the time of writing, 6 are due to be decommissioned.
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T291916 Tracking task for Bullseye migrations in production | |||
| Resolved | BTullis | T288804 Upgrade the Data Engineering infrastructure to Debian Bullseye | |||
| Resolved | • Stevemunene | T332570 Upgrade hadoop workers to bullseye |
an-worker1117 is stuck at install with an error no root filesystem is defined. Looking into this.
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1117.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1117.eqiad.wmnet with OS bullseye completed:
Looking into this, we recently changed the partman recipe for an-worker1117 to the reuse-analytics-hadoop-worker-12dev.cfg which on #L15-L17 expects to find 3 partitions root journalnode and swap. However, due to the recent reimage that failed due to the fact that an-worker1117 was pointing to the wrong partition, the journalnode partition was not available as seen below.
@an-worker1117:~# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 446.6G 0 disk ├─sda1 8:1 0 953M 0 part /boot ├─sda2 8:2 0 1K 0 part └─sda5 8:5 0 445.7G 0 part ├─an--worker1117--vg-swap 254:0 0 9.3G 0 lvm [SWAP] ├─an--worker1117--vg-root 254:1 0 55.9G 0 lvm / └─an--worker1117--vg-unused 254:2 0 291.4G 0 lvm
This was fixed by running the script available at #Worker_Nodes to remove the unused partition and create the journalnode partition. The result is as below:
@an-worker1117:~# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 446.6G 0 disk ├─sda1 8:1 0 953M 0 part /boot ├─sda2 8:2 0 1K 0 part └─sda5 8:5 0 445.7G 0 part ├─an--worker1117--vg-swap 254:0 0 9.3G 0 lvm [SWAP] ├─an--worker1117--vg-root 254:1 0 55.9G 0 lvm / └─an--worker1117--vg-journalnode 254:2 0 10G 0 lvm
Retried the reimage with sudo cookbook sre.hosts.reimage --os bullseye -t T332570 an-worker1117 --new since the host had disappeared from PuppetDB because down for too long. The reimage was successful, moving on to an-worker1118+
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1118.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1119.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1118.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1119.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1120.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1121.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1120.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1121.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1122.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1123.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1123.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1122.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1124.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1125.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1124.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1125.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1126.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1127.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1126.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1127.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1128.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1128.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye executed with errors:
sre.hosts.downtime cookbook Failed in the second last step during the reimage with this;
Error: Could not prepare for execution: The puppet agent command does not take parameters
================
PASS | | 0% (0/1) [01:11<?, ?hosts/s]
FAIL |█████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [01:11<00:00, 71.75s/hosts]
100.0% (1/1) of nodes failed to execute command 'run-puppet-agent...et --attempts 60': alert1001.wikimedia.org
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'run-puppet-agent...et --attempts 60'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Exception raised while executing cookbook sre.hosts.downtime:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 212, in run
raw_ret = runner.run()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/downtime.py", line 116, in run
self.puppet.run(quiet=True, attempts=60, timeout=600)
File "/usr/lib/python3/dist-packages/spicerack/puppet.py", line 200, in run
self._remote_hosts.run_sync(Command(command, timeout=timeout), batch_size=batch_size)
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 496, in run_sync
return self._execute(
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 702, in _execute
raise RemoteExecutionError(ret, "Cumin execution failed", worker.get_results())
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1129.eqiad.wmnet with reason: host reimage
//Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99//The host seems to be back on icinga after a while, monitoring for any abnormalities
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1130.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1131.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1130.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1131.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye executed with errors:
an-worker1132 seems to be stuck on debian Install as seen below. power cycling the server and retrying the reimage.
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye
The error I see is this:
I also logged in with the sudo install_console an-worker1132.eqiad.wmnet and I could verify that there is no journalnode volume.
~ # lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert root an-worker1132-vg -wi-a----- <55.88g swap an-worker1132-vg -wi-a----- 9.31g unused an-worker1132-vg -wi-a----- <291.36g ~ # vgs VG #PV #LV #SN Attr VSize VFree an-worker1132-vg 1 3 0 wz--n- <445.69g 89.14g
There is a script referenced here, which has some commands for manually creating a journalnode, if required.
https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Standard_Worker_Installation
I'll run the steps manually and retry the reimage.
I have executed:
lvcreate -L 10g -n journalnode an-worker1132-vg
Now we can see that there is a 10 GB journalnode volume.
# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert journalnode an-worker1132-vg -wi-a----- 10.00g root an-worker1132-vg -wi-a----- <55.88g swap an-worker1132-vg -wi-a----- 9.31g
I'll try the reimage again.
The installation looks to be proceeding as expected now. I will check the other nodes to see if any others will experience the same issue.
I ran sudo cumin A:hadoop-worker "lvs | grep journalnode" from cumin1001 and it looks like this is the only host that is going to be affected by this issue. There are a few discrepancies in the VG name or the size, which suggests that these are caused by the manual configuraiton in the past, but nothing serious.
1.3% (1/78) of nodes failed to execute command 'lvs | grep journalnode': an-worker1132.eqiad.wmnet 98.7% (77/78) success ratio (< 100.0% threshold) for command: 'lvs | grep journalnode'. Aborting.: an-worker[1078-1095,1097-1131,1133-1148].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet 98.7% (77/78) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.: an-worker[1078-1095,1097-1131,1133-1148].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1133.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1134.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1133.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1134.eqiad.wmnet with OS bullseye completed:
Seeing some HDFS corrupt blocks from 2023-09-07 10:03 UTC on grafana.
Did a quick check on the master nodes which show 0 corrupt files
@an-master1001:~$ sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corruptfileblocks Connecting to namenode via https://an-master1001.eqiad.wmnet:50470/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F The filesystem under path '/' has 0 CORRUPT files
@an-master1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corruptfileblocks Connecting to namenode via https://an-master1001.eqiad.wmnet:50470/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F The filesystem under path '/' has 0 CORRUPT files
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1135.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1136.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1135.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1136.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1137.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1137.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1138.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1139.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1138.eqiad.wmnet with OS bullseye executed with errors:
Mentioned in SAL (#wikimedia-analytics) [2023-09-14T14:13:33Z] <stevemunene> powercycle an-worker1138, investigating failures related to reimage T332570
an-worker1138 is currently facing an error
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1139.eqiad.wmnet with OS bullseye completed:
To track this down you can restart the reimage and follow along via the mgmt/serial console. If the error from the screenshot above happens again, just keep the error dialogue open and connect into the Debian installer from puppetmaster1001.eqiad.wmnet with:
sudo ssh -4 -i /root/.ssh/new_install -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no an-worker1138.eqiad.wmnet
The file /var/syslog probably has a few hints what failed exactly. Also happy to have a look as well.
Thanks @MoritzMuehlenhoff.
I managed to get access to the instance via regular ssh and confirmed that the right volumes exist, which they do
sda 8:0 0 446.6G 0 disk ├─sda1 8:1 0 953M 0 part /boot ├─sda2 8:2 0 1K 0 part └─sda5 8:5 0 445.7G 0 part ├─an--worker1138--vg-root 254:0 0 55.9G 0 lvm / ├─an--worker1138--vg-swap 254:1 0 9.3G 0 lvm [SWAP] └─an--worker1138--vg-journalnode 254:2 0 10G 0 lvm /var/lib/hadoop/journal
Restarting the reimage and following along to see where else the issue could be.
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1138.eqiad.wmnet with OS bullseye
Following the install via IPMI with ipmitool -I lanplus -H "an-worker1138.mgmt.eqiad.wmnet" -U root -E sol activate
Reimage seems to have been successful this time round. Waiting for the first puppet run to complete.
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1138.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1140.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1141.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1140.eqiad.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1141.eqiad.wmnet with OS bullseye completed:
an-worker1140 failed with an unsuccessful puppet run, but succeeded in the subsequent runs
the fail below
----- OUTPUT of 'run-puppet-agent --quiet' ----- ================ PASS | | 0% (0/1) [00:02<?, ?hosts/s] FAIL |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:02<00:00, 2.40s/hosts] 100.0% (1/1) of nodes failed to execute command 'run-puppet-agent --quiet': cumin1001.eqiad.wmnet 0.0% (0/1) success ratio (< 100.0% threshold) for command: 'run-puppet-agent --quiet'. Aborting. 0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
but all the subsequent runs were successful, manual puppet runs also succesful
Notice: Applied catalog in 49.25 seconds stevemunene@an-worker1140:~$
The host is also fully back and running as per icinga and the hdfs namenode manager interface.
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1142.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1143.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1142.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1143.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1144.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1145.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1144.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1145.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1146.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1146.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1147.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1148.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1147.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1148.eqiad.wmnet with OS bullseye completed:
We have successfully completed the hadoop worker upgrades to Bullseye.
sudo cumin --no-progress a:hadoop-worker 'cat /etc/debian_version' 86 hosts will be targeted: an-worker[1078-1095,1097-1156].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet OK to proceed on 86 hosts? Enter the number of affected hosts to confirm or "q" to quit: 86 ===== NODE GROUP ===== (86) an-worker[1078-1095,1097-1156].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet ----- OUTPUT of 'cat /etc/debian_version' ----- 11.7 ================ 100.0% (86/86) success ratio (>= 100.0% threshold) for command: 'cat /etc/debian_version'. 100.0% (86/86) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.