Page MenuHomePhabricator

Add kafka-stretch100[1-2] to the hadoop cluster
Closed, ResolvedPublic

Assigned To
Authored By
Gehel
Dec 20 2023, 10:47 AM
Referenced Files
F57624267: image.png
Oct 18 2024, 4:55 PM
F57622020: image.png
Oct 18 2024, 8:36 AM
F57619998: image.png
Oct 16 2024, 6:38 PM
F57588339: image.png
Oct 4 2024, 11:10 AM
F57582206: image.png
Oct 2 2024, 8:47 AM
F57582202: image.png
Oct 2 2024, 8:47 AM
F57582199: image.png
Oct 2 2024, 8:47 AM
F57534291: image.png
Sep 24 2024, 2:43 PM

Description

These servers were originally intended for a kafka cluster, but we have decided not to proceed. It makes sense to rename them and add them to the Hadoop cluster instead.
We shall use an-worker117[6-7].

  • Rename the hosts with the appropriate hostname
    • kafka-stretch1001 -> an-worker1176eqiad.wmnet
    • kafka-stretch1002 -> an-worker1177.eqiad.wmnet
  • Assign the servers the right partman recipe (already on analytics-flex.cfg
  • Create the server's dummy keytabs.
  • Create the server's kerberos keytabs.
  • Reimage the hosts.
  • Run the hadoop-init-worker.py cookbook to setup the remaining partitions
  • Setup journalnode on each
  • Add the hosts to net_topology with the right rack assignment
  • Add the hosts to role(analytics_cluster::hadoop::worker)
  • roll restart hadoop masters to pick up the right rack assignment for new hosts
  • Add the hosts to the analytics vlan

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1072655 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[labs/private@master] Add new an worker keytabs

https://gerrit.wikimedia.org/r/1072655

Change #1072660 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] hdfs: Add new worker hosts to net_topology

https://gerrit.wikimedia.org/r/1072660

Change #1072661 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] hdfs: Assign the worker role to new hadoop workers

https://gerrit.wikimedia.org/r/1072661

Mentioned in SAL (#wikimedia-analytics) [2024-09-17T13:22:18Z] <stevemunene> rename kafka-stretch to an-worker1176 T353788

Cookbook cookbooks.sre.hosts.rename started by stevemunene@cumin1002 from kafka-stretch1001 to an-worker1176 completed:

  • kafka-stretch1001 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Disabled puppet and its timer
    • ✔️ Disabled debmonitor-client timer
    • ✔️ Netbox updated
    • ✔️ BMC Hostname updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Mentioned in SAL (#wikimedia-analytics) [2024-09-17T13:34:39Z] <stevemunene> rename kafka-stretch1002 to an-worker1177 T353788

Cookbook cookbooks.sre.hosts.rename started by stevemunene@cumin1002 from kafka-stretch1002 to an-worker1177 completed:

  • kafka-stretch1002 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Disabled puppet and its timer
    • ✔️ Disabled debmonitor-client timer
    • ✔️ Netbox updated
    • ✔️ BMC Hostname updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Change #1072655 merged by Stevemunene:

[labs/private@master] Add new an worker keytabs

https://gerrit.wikimedia.org/r/1072655

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1176 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console an-worker1176.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1177.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1177.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1177 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console an-worker1177.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

The reimage cookbook has been unable to restart since the initial failure at the populate exported resources in PuppetDB stage for both hosts.

stevemunene@cumin1002:~$ sudo cookbook sre.hosts.reimage --os bullseye -t T353788 an-worker1177
==> ATTENTION: Destructive action for an-worker1177. Proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
Exception raised while initializing the Cookbook sre.hosts.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 199, in run
    runner = self.instance.get_runner(args)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 109, in get_runner
    return ReimageRunner(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 190, in __init__
    self.new_puppet_server = self._get_puppet_server()
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 234, in _get_puppet_server
    has_puppet7 = self.puppet_server.hiera_lookup(self.fqdn, "profile::puppet::agent::force_puppet7")
  File "/usr/lib/python3/dist-packages/spicerack/puppet.py", line 599, in hiera_lookup
    result = self.server_host.run_sync(command, is_safe=True, print_output=False, print_progress_bars=False)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 514, in run_sync
    return self._execute(
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 720, in _execute
    raise RemoteExecutionError(ret, "Cumin execution failed", worker.get_results())
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)

Change #1075012 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] hdfs: add new an-workers to insetup role

https://gerrit.wikimedia.org/r/1075012

Change #1075012 merged by Stevemunene:

[operations/puppet@production] hdfs: add new an-workers to insetup role

https://gerrit.wikimedia.org/r/1075012

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1176 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console an-worker1176.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1177.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1177.eqiad.wmnet with OS bullseye completed:

  • an-worker1177 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202409240855_stevemunene_1711859_an-worker1177.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye completed:

  • an-worker1176 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202409240903_stevemunene_1712498_an-worker1176.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Having some challenges with the disk configuration as I run the hadoop init cookbook.

Ensure some MegaCLI specific settings.
----- OUTPUT of '/usr/sbin/megacl...ADRA -LALL -aALL' -----

No Logical drive configured.

Exit Code: 0x01

================
PASS |                                                                                                                                                                                                                                                                         |   0% (0/1) [00:00<?, ?hosts/s]
FAIL |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  2.53hosts/s]
100.0% (1/1) of nodes failed to execute command '/usr/sbin/megacl...ADRA -LALL -aALL': an-worker1176.eqiad.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Failed to run spicerack.remote.RemoteHosts.run_async: Cumin execution failed (exit_code=2)
==> What do you want to do? "retry" the last command, manually fix the issue and "skip" the last command to continue the execution or completely "abort" the execution.
>

Standard hadoop workers have 12 x 3.5" 4 TB hard disk in the front bays - each configured as a single RAID0 volume, 2 x 2.5" SSDs installed to the rear-mounted flex bays - configured as RAID1 volume in hardware and the OS and JournalNode partitions are installed to the 2.5" RAID1 volume. This leaves all of the space on the 12 4TB HDDs for DataNode use. as per Standard_Worker_Installation. Our hosts however have the below

image.png (388×2 px, 53 KB)

We shall need to adjust the configuration to be compatible with what we have as we proceed to add the hosts to the cluster, the init hadoop worker cookbook fails on the above step with the full error as shown below

stevemunene@an-worker1176:~$ sudo megacli -LDInfo -Lall -aALL
                                     

Adapter 0 -- Virtual Drive Information:
Adapter 0: No Virtual Drive Configured.

Exit Code: 0x00

OK, we can destroy that RAID10 partition and recreate them as 12 x RAID0 logical volumes.

The fact that the megacli command isn't working is probably related to the fact that the servers are new. This means that we will have to use the perccli64 utility to manage the raid arrays, instead.

https://wikitech.wikimedia.org/wiki/PERCCli

The megacli utility had a very useful option, CfgEachDskRaid0 which added a raid0 logical volume for each physical disk that was unallocated.

I am not aware of such an option for perccli64, but there will be some kind of command that does what we need. We could we if dc ops have something in the cookbook that they use to provision new an-worker nodes for us.

There were some similar notes here, but not the exact command required to create the twelve raid0 arrays. T353776#9508961

Checking the current setup with

stevemunene@an-worker1176:~$ sudo perccli64 /c0/dall show all
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-32-amd64
Controller = 0
Status = Success
Description = Show Drive Group Succeeded


TOPOLOGY :
========

------------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type   State BT       Size PDC  PI SED DS3  FSpace TR 
------------------------------------------------------------------------------
 0 -   -   -        -   RAID10 Optl  N   21.829 TB dflt N  N   dflt N      N  
 0 0   -   -        -   RAID1  Optl  N   21.829 TB dflt N  N   dflt N      N  
 0 0   0   64:0     2   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   1   64:1     0   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   2   64:2     3   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   3   64:3     8   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   4   64:4     9   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   5   64:5     6   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   6   64:6     1   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   7   64:7     4   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   8   64:8     5   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   9   64:9     10  DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   10  64:10    7   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   11  64:11    11  DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 1 -   -   -        -   RAID1  Optl  N  446.625 GB dflt N  N   dflt N      N  
 1 0   -   -        -   RAID1  Optl  N  446.625 GB dflt N  N   dflt N      N  
 1 0   0   64:12    12  DRIVE  Onln  N  446.625 GB dflt N  N   dflt -      N  
 1 0   1   64:13    13  DRIVE  Onln  N  446.625 GB dflt N  N   dflt -      N  
------------------------------------------------------------------------------

DG=Disk Group Index|Arr=Array Index|Row=Row Index|EID=Enclosure Device ID
DID=Device ID|Type=Drive Type|Onln=Online|Rbld=Rebuild|Optl=Optimal|Dgrd=Degraded
Pdgd=Partially degraded|Offln=Offline|BT=Background Task Active
PDC=PD Cache|PI=Protection Info|SED=Self Encrypting Drive|Frgn=Foreign
DS3=Dimmer Switch 3|dflt=Default|Msng=Missing|FSpace=Free Space Present
TR=Transport Ready


VD LIST :
=======

----------------------------------------------------------------
DG/VD TYPE   State Access Consist Cache Cac sCC       Size Name 
----------------------------------------------------------------
0/239 RAID10 Optl  RW     Yes     RWBD  -   OFF  21.829 TB      
1/238 RAID1  Optl  RW     Yes     RWBD  -   OFF 446.625 GB      
----------------------------------------------------------------

VD=Virtual Drive| DG=Drive Group|Rec=Recovery
Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded
Optl=Optimal|dflt=Default|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady
B=Blocked|Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack
FWB=Force WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

Total VD Count = 2

DG Drive LIST :
=============

------------------------------------------------------------------------------
EID:Slt DID State DG       Size Intf Med SED PI SeSz Model            Sp Type 
------------------------------------------------------------------------------
64:0      2 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:1      0 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:2      3 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:3      8 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:4      9 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:5      6 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:6      1 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:7      4 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:8      5 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:9     10 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:10     7 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:11    11 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:12    12 Onln   1 446.625 GB SATA SSD N   N  512B HFS480G3H2X069N  U  -    
64:13    13 Onln   1 446.625 GB SATA SSD N   N  512B HFS480G3H2X069N  U  -    
------------------------------------------------------------------------------

EID=Enclosure Device ID|Slt=Slot No|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild

Total Drive Count = 14

We can proceed to delete the raid10 configured drive with

sudo perccli64 /c0/v239 del force

Then reconfigure the as below based on T353776#9506632 and the documentation

sudo percli64 /c0 add vd each r0 wb ra NoCachedBadBBU

Checking the current setup with

stevemunene@an-worker1176:~$ sudo perccli64 /c0/dall show all
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-32-amd64
Controller = 0
Status = Success
Description = Show Drive Group Succeeded


TOPOLOGY :
========

------------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type   State BT       Size PDC  PI SED DS3  FSpace TR 
------------------------------------------------------------------------------
 0 -   -   -        -   RAID10 Optl  N   21.829 TB dflt N  N   dflt N      N  
 0 0   -   -        -   RAID1  Optl  N   21.829 TB dflt N  N   dflt N      N  
 0 0   0   64:0     2   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   1   64:1     0   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   2   64:2     3   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   3   64:3     8   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   4   64:4     9   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   5   64:5     6   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   6   64:6     1   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   7   64:7     4   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   8   64:8     5   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   9   64:9     10  DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   10  64:10    7   DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 0 0   11  64:11    11  DRIVE  Onln  N    3.637 TB dflt N  N   dflt -      N  
 1 -   -   -        -   RAID1  Optl  N  446.625 GB dflt N  N   dflt N      N  
 1 0   -   -        -   RAID1  Optl  N  446.625 GB dflt N  N   dflt N      N  
 1 0   0   64:12    12  DRIVE  Onln  N  446.625 GB dflt N  N   dflt -      N  
 1 0   1   64:13    13  DRIVE  Onln  N  446.625 GB dflt N  N   dflt -      N  
------------------------------------------------------------------------------

DG=Disk Group Index|Arr=Array Index|Row=Row Index|EID=Enclosure Device ID
DID=Device ID|Type=Drive Type|Onln=Online|Rbld=Rebuild|Optl=Optimal|Dgrd=Degraded
Pdgd=Partially degraded|Offln=Offline|BT=Background Task Active
PDC=PD Cache|PI=Protection Info|SED=Self Encrypting Drive|Frgn=Foreign
DS3=Dimmer Switch 3|dflt=Default|Msng=Missing|FSpace=Free Space Present
TR=Transport Ready


VD LIST :
=======

----------------------------------------------------------------
DG/VD TYPE   State Access Consist Cache Cac sCC       Size Name 
----------------------------------------------------------------
0/239 RAID10 Optl  RW     Yes     RWBD  -   OFF  21.829 TB      
1/238 RAID1  Optl  RW     Yes     RWBD  -   OFF 446.625 GB      
----------------------------------------------------------------

VD=Virtual Drive| DG=Drive Group|Rec=Recovery
Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded
Optl=Optimal|dflt=Default|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady
B=Blocked|Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack
FWB=Force WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

Total VD Count = 2

DG Drive LIST :
=============

------------------------------------------------------------------------------
EID:Slt DID State DG       Size Intf Med SED PI SeSz Model            Sp Type 
------------------------------------------------------------------------------
64:0      2 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:1      0 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:2      3 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:3      8 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:4      9 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:5      6 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:6      1 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:7      4 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:8      5 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:9     10 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:10     7 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:11    11 Onln   0   3.637 TB SAS  HDD N   N  512B MG08SDA400NY     U  -    
64:12    12 Onln   1 446.625 GB SATA SSD N   N  512B HFS480G3H2X069N  U  -    
64:13    13 Onln   1 446.625 GB SATA SSD N   N  512B HFS480G3H2X069N  U  -    
------------------------------------------------------------------------------

EID=Enclosure Device ID|Slt=Slot No|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild

Total Drive Count = 14

We can proceed to delete the raid10 configured drive with

sudo perccli64 /c0/v239 del force

Then reconfigure the as below based on T353776#9506632 and the documentation

sudo percli64 /c0 add vd each r0 wb ra NoCachedBadBBU

Done on both with

stevemunene@an-worker1177:~$ sudo perccli64 /c0/v239 del force
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-32-amd64
Controller = 0
Status = Success
Description = Delete VD succeeded

stevemunene@an-worker1177:~$ sudo perccli64 /c0 add vd each r0 wb ra
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-32-amd64
Controller = 0
Status = Success
Description = Add VD Succeeded.


stevemunene@an-worker1177:~$ sudo perccli64 /c0 show
Generating detailed summary of the adapter, it may take a while to complete.

CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-32-amd64
Controller = 0
Status = Success
Description = None

Product Name = PERC H750 Adapter
Serial Number = 26N03Q5
SAS Address =  5f4ee08039b91e00
PCI Address = 00:af:00:00
System Time = 09/30/2024 12:34:20
Mfg. Date = 06/28/22
Controller Time = 09/30/2024 12:34:19
FW Package Build = 52.16.1-4405
BIOS Version = 7.16.00.0_0x07100501
FW Version = 5.160.02-3552
Driver Name = megaraid_sas
Driver Version = 07.714.04.00-rc1
Current Personality = RAID-Mode 
Vendor Id = 0x1000
Device Id = 0x10E2
SubVendor Id = 0x1028
SubDevice Id = 0x2176
Host Interface = PCI-E
Device Interface = SAS-12G
Bus Number = 175
Device Number = 0
Function Number = 0
Domain ID = 0
Security Protocol = None
Drive Groups = 13

TOPOLOGY :
========

-----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT       Size PDC  PI SED DS3  FSpace TR 
-----------------------------------------------------------------------------
 0 -   -   -        -   RAID1 Optl  N  446.625 GB dflt N  N   dflt N      N  
 0 0   -   -        -   RAID1 Optl  N  446.625 GB dflt N  N   dflt N      N  
 0 0   0   64:12    12  DRIVE Onln  N  446.625 GB dflt N  N   dflt -      N  
 0 0   1   64:13    13  DRIVE Onln  N  446.625 GB dflt N  N   dflt -      N  
 1 -   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 1 0   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 1 0   0   64:2     0   DRIVE Onln  N    3.637 TB dflt N  N   dflt -      N  
 2 -   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 2 0   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 2 0   0   64:11    1   DRIVE Onln  N    3.637 TB dflt N  N   dflt -      N  
 3 -   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 3 0   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 3 0   0   64:0     2   DRIVE Onln  N    3.637 TB dflt N  N   dflt -      N  
 4 -   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 4 0   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 4 0   0   64:1     3   DRIVE Onln  N    3.637 TB dflt N  N   dflt -      N  
 5 -   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 5 0   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 5 0   0   64:7     4   DRIVE Onln  N    3.637 TB dflt N  N   dflt -      N  
 6 -   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 6 0   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 6 0   0   64:6     5   DRIVE Onln  N    3.637 TB dflt N  N   dflt -      N  
 7 -   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 7 0   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 7 0   0   64:9     6   DRIVE Onln  N    3.637 TB dflt N  N   dflt -      N  
 8 -   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 8 0   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 8 0   0   64:8     7   DRIVE Onln  N    3.637 TB dflt N  N   dflt -      N  
 9 -   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 9 0   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
 9 0   0   64:4     8   DRIVE Onln  N    3.637 TB dflt N  N   dflt -      N  
10 -   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
10 0   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
10 0   0   64:3     9   DRIVE Onln  N    3.637 TB dflt N  N   dflt -      N  
11 -   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
11 0   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
11 0   0   64:5     10  DRIVE Onln  N    3.637 TB dflt N  N   dflt -      N  
12 -   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
12 0   -   -        -   RAID0 Optl  N    3.637 TB dflt N  N   dflt N      N  
12 0   0   64:10    11  DRIVE Onln  N    3.637 TB dflt N  N   dflt -      N  
-----------------------------------------------------------------------------

DG=Disk Group Index|Arr=Array Index|Row=Row Index|EID=Enclosure Device ID
DID=Device ID|Type=Drive Type|Onln=Online|Rbld=Rebuild|Optl=Optimal|Dgrd=Degraded
Pdgd=Partially degraded|Offln=Offline|BT=Background Task Active
PDC=PD Cache|PI=Protection Info|SED=Self Encrypting Drive|Frgn=Foreign
DS3=Dimmer Switch 3|dflt=Default|Msng=Missing|FSpace=Free Space Present
TR=Transport Ready

Virtual Drives = 13

VD LIST :
=======

----------------------------------------------------------------
DG/VD  TYPE  State Access Consist Cache Cac sCC       Size Name 
----------------------------------------------------------------
12/227 RAID0 Optl  RW     Yes     RWBD  -   OFF   3.637 TB      
11/228 RAID0 Optl  RW     Yes     RWBD  -   OFF   3.637 TB      
10/229 RAID0 Optl  RW     Yes     RWBD  -   OFF   3.637 TB      
9/230  RAID0 Optl  RW     Yes     RWBD  -   OFF   3.637 TB      
8/231  RAID0 Optl  RW     Yes     RWBD  -   OFF   3.637 TB      
7/232  RAID0 Optl  RW     Yes     RWBD  -   OFF   3.637 TB      
6/233  RAID0 Optl  RW     Yes     RWBD  -   OFF   3.637 TB      
5/234  RAID0 Optl  RW     Yes     RWBD  -   OFF   3.637 TB      
4/235  RAID0 Optl  RW     Yes     RWBD  -   OFF   3.637 TB      
3/236  RAID0 Optl  RW     Yes     RWBD  -   OFF   3.637 TB      
2/237  RAID0 Optl  RW     Yes     RWBD  -   OFF   3.637 TB      
0/238  RAID1 Optl  RW     Yes     RWBD  -   OFF 446.625 GB      
1/239  RAID0 Optl  RW     Yes     RWBD  -   OFF   3.637 TB      
----------------------------------------------------------------

VD=Virtual Drive| DG=Drive Group|Rec=Recovery
Cac=CacheCade|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded
Optl=Optimal|dflt=Default|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady
B=Blocked|Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

Physical Drives = 14

Change #1072660 merged by Stevemunene:

[operations/puppet@production] hdfs: Add new worker hosts to net_topology

https://gerrit.wikimedia.org/r/1072660

Change #1072661 merged by Stevemunene:

[operations/puppet@production] hdfs: Assign the worker role to new hadoop workers

https://gerrit.wikimedia.org/r/1072661

Done with all the steps for this, pending a roll restart of the hadoop masters to pick up the new hosts. Restart scheduled for 01/10/2024 10:00UTC when the cluster isn't too busy

Mentioned in SAL (#wikimedia-analytics) [2024-10-01T11:50:30Z] <stevemunene> roll restart hadoop analytics master to pick up new hosts T353788

Noticed the new an-worker hosts do not appear in the hdfs topology an-worker1177 vs the output from

stevemunene@an-master1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -printTopology

image.png (166×1 px, 98 KB)

Hosts also do not appear on the datanode interface
image.png (1×3 px, 408 KB)

the namenode service shows that it was successfully restarted by the cookbook
image.png (956×3 px, 1 MB)

Looking into why this is the case

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1176 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Failed to migrate host to the new VLAN, sre.hosts.move-vlan cookbook returned 94
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console an-worker1176.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye

The hosts an-worker117[67] are not yet a part of the analytics Hadoop cluster, this is due to the fact that the hosts cannot connect to any of the master nodes. Looking further into this, we found the main cause of this was the fact that the hosts did not belong to the analytics vlan.
Reimaging the hosts once more

Change #1077913 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Change an-worker117[67] to use reuse partman recipe.

https://gerrit.wikimedia.org/r/1077913

Reimage with the initial recipe fails at

image.png (956×1 px, 355 KB)

Change #1077913 merged by Stevemunene:

[operations/puppet@production] Change an-worker117[67] to use reuse partman recipe.

https://gerrit.wikimedia.org/r/1077913

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1176 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console an-worker1176.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1176 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console an-worker1176.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1177.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1177.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1177 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console an-worker1177.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1177.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1177.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1177 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console an-worker1177.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1176 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console an-worker1176.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

cookbooks.sre.hosts.decommission executed by stevemunene@cumin1002 for hosts: an-worker1176.eqiad.wmnet

  • an-worker1176.eqiad.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by stevemunene@cumin1002 for hosts: an-worker1177.eqiad.wmnet

  • an-worker1177.eqiad.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Following the guide on https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Move_existing_server_between_rows/racks,_changing_IPs to get the hosts on the right vlan, the above errors are mainly due to the fact that the two hosts were not in puppet db

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1177.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1176 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console an-worker1176.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1177.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1177 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console an-worker1177.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Following the guide on https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Move_existing_server_between_rows/racks,_changing_IPs to get the hosts on the right vlan, the above errors are mainly due to the fact that the two hosts were not in puppet db

Successfully followed the steps outlined, currently working on the reimaging which has been failing with partition related errors

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-worker1177.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1176.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1176 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console an-worker1176.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

The hosts reimage but with the wrong partitions.
During the reimage process I get an error "No root disks found". However, when I redo the partitioning step of the installation the hosts boot with debian 11. The hosts then fail on the puppet dry run step of the reimage with the error below;

image.png (255×1 px, 48 KB)

20:35:37	err	Error while evaluating a Function Call:

Number of datanode mountpoints (0) below threshold: 10, please check.

…in /srv/puppet_code/environments/production/modules/profile/manifests/hadoop/common.pp, line: 413, column: 9.

Checking fstab

root@an-worker1177:~# cat /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# systemd generates mount units based on this file, see systemd.mount(5).
# Please run 'systemctl daemon-reload' after making changes here.
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
/dev/mapper/an--worker1177--vg-root /               ext4    errors=remount-ro 0       1
# /boot was on /dev/sda1 during installation
UUID=8a5a8145-54a8-4e2f-8e14-11862a1951c1 /boot           ext4    defaults        0       2
/dev/mapper/an--worker1177--vg-journalnode /var/lib/hadoop/journal ext4    defaults        0       2
/dev/mapper/an--worker1177--vg-swap none            swap    sw              0       0
root@an-worker1176:~# cat /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# systemd generates mount units based on this file, see systemd.mount(5).
# Please run 'systemctl daemon-reload' after making changes here.
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
/dev/mapper/an--worker1176--vg-root /               ext4    errors=remount-ro 0       1
# /boot was on /dev/sda1 during installation
UUID=268c772a-2499-4b54-8e7c-66362550e61a /boot           ext4    defaults        0       2
/dev/mapper/an--worker1176--vg-journalnode /var/lib/hadoop/journal ext4    defaults        0       2
/dev/mapper/an--worker1176--vg-swap none            swap    sw              0       0
root@an-worker1177:~# lsblk -i
NAME                               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                  8:0    0   3.6T  0 disk 
`-sda1                               8:1    0   3.6T  0 part 
sdb                                  8:16   0   3.6T  0 disk 
`-sdb1                               8:17   0   3.6T  0 part /boot
sdc                                  8:32   0   3.6T  0 disk 
`-sdc1                               8:33   0   3.6T  0 part 
sdd                                  8:48   0   3.6T  0 disk 
`-sdd1                               8:49   0   3.6T  0 part 
sde                                  8:64   0   3.6T  0 disk 
`-sde1                               8:65   0   3.6T  0 part 
sdf                                  8:80   0   3.6T  0 disk 
`-sdf1                               8:81   0   3.6T  0 part 
sdg                                  8:96   0   3.6T  0 disk 
`-sdg1                               8:97   0   3.6T  0 part 
sdh                                  8:112  0   3.6T  0 disk 
`-sdh1                               8:113  0   3.6T  0 part 
sdi                                  8:128  0   3.6T  0 disk 
`-sdi1                               8:129  0   3.6T  0 part 
sdj                                  8:144  0   3.6T  0 disk 
`-sdj1                               8:145  0   3.6T  0 part 
sdk                                  8:160  0 446.6G  0 disk 
|-sdk1                               8:161  0   953M  0 part 
|-sdk2                               8:162  0     1K  0 part 
`-sdk5                               8:165  0 445.7G  0 part 
  |-an--worker1177--vg-swap        254:0    0   9.3G  0 lvm  [SWAP]
  |-an--worker1177--vg-root        254:1    0  55.9G  0 lvm  /
  `-an--worker1177--vg-journalnode 254:2    0    10G  0 lvm  /var/lib/hadoop/journal
sdl                                  8:176  0   3.6T  0 disk 
`-sdl1                               8:177  0   3.6T  0 part 
sdm                                  8:192  0   3.6T  0 disk 
`-sdm1                               8:193  0   3.6T  0 part

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker1177.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1177 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console an-worker1177.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

The hosts are now set back to Active on netbox, and are now part of the hadoop cluster and are catching up to the production hosts. Keeping an eye on this

image.png (510×2 px, 112 KB)

Stevemunene updated the task description. (Show Details)

Ah, there is a slight problem with an-worker1176. It won't take long to sort out.

image.png (421×819 px, 92 KB)

The partition table on /dev/sda looks like it was created for the operating system disk, which makes sense.
We can see here that /dev/sda1 is only 1 GB in size, with 3.6TB unused in /dev/sda2

btullis@an-worker1176:~$ lsblk /dev/sda
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0  3.6T  0 disk 
├─sda1   8:1    0  953M  0 part /var/lib/hadoop/data/m
└─sda2   8:2    0  3.6T  0 part

I think that we will have to modify the partition table and resize the file system.