Page MenuHomePhabricator

Bring an-worker11[57-75] into service
Closed, ResolvedPublic

Assigned To
Authored By
Gehel
Dec 20 2023, 9:29 AM
Referenced Files
F41803475: image.png
Feb 7 2024, 1:48 PM
F41793107: image.png
Feb 6 2024, 2:08 PM
F41740129: image.png
Feb 1 2024, 4:08 PM
F41740139: image.png
Feb 1 2024, 4:08 PM
Tokens
"Party Time" token, awarded by BTullis.

Description

We have 18 new Hadoop workers to bring into service. These were racked in T349936 Q2:rack/setup/install an-worker11[57-75]
For these we are going to:

  • Assign the servers the right partman recipe analytics-flex.cfg (This was done during initial setup)
  • Install Debian Bullseye on the hosts.
  • Create the server's dummy keytabs.
  • Create the server's kerberos keytabs.
  • Run the hadoop-init-worker.py cookbook to setup the remaining partitions
  • Setup journalnode on each
  • Add the hosts to net_topology with the right rack assignment
  • Add the hosts to role(analytics_cluster::hadoop::worker)

Event Timeline

Gehel triaged this task as High priority.Dec 20 2023, 9:30 AM
Gehel moved this task from Incoming to Hardware refresh on the Data-Platform-SRE board.

Change 993675 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[labs/private@master] Add dummy keytabs for new an-worker1157-1175

https://gerrit.wikimedia.org/r/993675

Change 993742 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] hdfs: Add new worker hosts to net_topology

https://gerrit.wikimedia.org/r/993742

Change 993743 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] hdfs: Assign the right role to new hadoop workers

https://gerrit.wikimedia.org/r/993743

Change 993675 merged by Stevemunene:

[labs/private@master] Add dummy keytabs for new an-worker1157-1175

https://gerrit.wikimedia.org/r/993675

The sre.hadoop.init-hadoop-workers fails in creating new partitions. running the cookbook for an-worker1157 fails with the details below

Creating ext4 disk partitions.
Working on /dev/sdb
----- OUTPUT of '/sbin/parted /de...ript mklabel gpt' -----                                                                                                                                                                                                                     
Error: Could not stat device /dev/sdb - No such file or directory.                                                                                                                                                                                                              
================                                                                                                                                                                                                                                                                
PASS |                                                                                                                                                                                                                                          |   0% (0/1) [00:00<?, ?hosts/s]
FAIL |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  2.38hosts/s]
100.0% (1/1) of nodes failed to execute command '/sbin/parted /de...ript mklabel gpt': an-worker1157.eqiad.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Failed to run spicerack.remote.RemoteHosts.run_async: Cumin execution failed (exit_code=2)

Currently looking into this

It looks like the operating system can only see one disk.

btullis@an-worker1157:~$ lsblk
NAME                          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                             8:0    0 446.6G  0 disk 
├─sda1                          8:1    0   953M  0 part /boot
├─sda2                          8:2    0     1K  0 part 
└─sda5                          8:5    0 445.7G  0 part 
  ├─an--worker1157--vg-swap   254:0    0   9.3G  0 lvm  [SWAP]
  ├─an--worker1157--vg-root   254:1    0  55.9G  0 lvm  /
  └─an--worker1157--vg-unused 254:2    0 291.4G  0 lvm  
btullis@an-worker1157:~$

This is going to require the RAID controller commands to be run to create a RAID0 volume on each of the physical disks in the array. Maybe this is something that DC Ops used to do before handing over the machine to us. Let's see if we can see a reference to the command anywhere.

@Stevemunene - Here is a command to run. It looks like this happened during the last batch of workers as well. https://phabricator.wikimedia.org/T343762#9157092

sudo cumin 'an-worker11[57-75].eqiad.wmnet' 'sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0'

Let's add a reference to the CfgEachDskRaid0 megacli command to https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration so that it's easier to find next time.

I found a brief mention of it here: https://wikitech.wikimedia.org/wiki/MegaCli#Replace_individual_disks_in_JBOD but I think it would be useful for us on the Hadoop/Administration page.

I suppose we could add it to the sre.hadoop.init-hadoop-workers cookbook, if it doesn't find 12 data drives. What do you think?

Thanks @BTullis , We can add it to the cookbook for future reference. Did some further reading on the RAID Configuration Input Options used from the StorCLI Reference Manual
RA: Read Ahead
WB: Write Back
Direct: Direct I/O.: Sets the logical drive cache policy. Direct I/O is the default.
CachedBadBBU: Enable bad BBU caching.: Enables caching when BBU is not functioning. Disabled is the default.

Adding to this, had to run the cookbook sre.hadoop.init-hadoop-workers to install megacli first on all the hosts then pass the megacli command to create a RAID0 volume on each of the physical disks in the array. So we can edit the cookbook to check on the number of drives on the hosts and run the command to create the volumes. However, comparing the parameters passed and those available on the cookbook MegaCLI specific settings. there are some variations in the WB value and NoCachedBadBBU which I would like to look into a bit further.
Currently looking into an error from the playbook

(19) an-worker[1157-1175].eqiad.wmnet                                                                                                                                                                                                                                           
----- OUTPUT -----                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                
No Logical drive configured.                                                                                                                                                                                                                                                    

Exit Code: 0x01

================                                                                                                                                                                                                                                                                
PASS |                                                                                                                                                                                                                                         |   0% (0/19) [00:00<?, ?hosts/s]
FAIL |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (19/19) [00:00<00:00, 22.61hosts/s]
100.0% (19/19) of nodes failed to execute command '/usr/sbin/megacl...ADRA -LALL -aALL': an-worker[1157-1175].eqiad.wmnet
0.0% (0/19) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.

This is from running the command

stevemunene@an-worker1157:~$ sudo megacli -LDSetProp ADRA -LALL -aALL
                                     
No Logical drive configured.

Exit Code: 0x01

Host rebooted by stevemunene@cumin1002 with reason: None

The new an-workers1157-1175 do not have any Virtual drive configured, however the datanode disks/partitions initialized are as expected. Comparing new host an-worker1157 to previously initialized host an-worker1156

an-worker1157

stevemunene@an-worker1157:~$ sudo megacli -LDInfo -Lall -aALL
Adapter 0 -- Virtual Drive Information:
Adapter 0: No Virtual Drive Configured.

Exit Code: 0x00

an-worker1156 Shows the 12 Virtual drives (result trimmed for legibility)

stevemunene@an-worker1156:~$ sudo megacli -LDInfo -Lall -aALL
                                     

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0
Size                : 3.637 TB
Sector Size         : 512
Parity Size         : 0
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 1
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: No


Virtual Drive: 1 (Target Id: 1)
Name                :
RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0
Size                : 3.637 TB
Sector Size         : 512
Parity Size         : 0
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 1
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: No

performing an lsblk -i on the two
an-worker1157

stevemunene@an-worker1157:~$ lsblk -i
NAME                               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                  8:0    0 446.6G  0 disk 
|-sda1                               8:1    0   953M  0 part /boot
|-sda2                               8:2    0     1K  0 part 
`-sda5                               8:5    0 445.7G  0 part 
  |-an--worker1157--vg-swap        254:0    0   9.3G  0 lvm  [SWAP]
  |-an--worker1157--vg-root        254:1    0  55.9G  0 lvm  /
  `-an--worker1157--vg-journalnode 254:2    0    10G  0 lvm  /var/lib/hadoop/journal
sdb                                  8:16   0   3.6T  0 disk 
`-sdb1                               8:17   0   3.6T  0 part /var/lib/hadoop/data/m
sdc                                  8:32   0   3.6T  0 disk 
`-sdc1                               8:33   0   3.6T  0 part /var/lib/hadoop/data/l
sdd                                  8:48   0   3.6T  0 disk 
`-sdd1                               8:49   0   3.6T  0 part /var/lib/hadoop/data/k
sde                                  8:64   0   3.6T  0 disk 
`-sde1                               8:65   0   3.6T  0 part /var/lib/hadoop/data/j
sdf                                  8:80   0   3.6T  0 disk 
`-sdf1                               8:81   0   3.6T  0 part /var/lib/hadoop/data/i
sdg                                  8:96   0   3.6T  0 disk 
`-sdg1                               8:97   0   3.6T  0 part /var/lib/hadoop/data/h
sdh                                  8:112  0   3.6T  0 disk 
`-sdh1                               8:113  0   3.6T  0 part /var/lib/hadoop/data/g
sdi                                  8:128  0   3.6T  0 disk 
`-sdi1                               8:129  0   3.6T  0 part /var/lib/hadoop/data/f
sdj                                  8:144  0   3.6T  0 disk 
`-sdj1                               8:145  0   3.6T  0 part /var/lib/hadoop/data/e
sdk                                  8:160  0   3.6T  0 disk 
`-sdk1                               8:161  0   3.6T  0 part /var/lib/hadoop/data/d
sdl                                  8:176  0   3.6T  0 disk 
`-sdl1                               8:177  0   3.6T  0 part /var/lib/hadoop/data/c
sdm                                  8:192  0   3.6T  0 disk 
`-sdm1                               8:193  0   3.6T  0 part /var/lib/hadoop/data/b

an-worker1156

stevemunene@an-worker1156:~$ lsblk -i
NAME                               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                  8:0    0   3.6T  0 disk 
`-sda1                               8:1    0   3.6T  0 part /var/lib/hadoop/data/b
sdb                                  8:16   0   3.6T  0 disk 
`-sdb1                               8:17   0   3.6T  0 part /var/lib/hadoop/data/c
sdc                                  8:32   0   3.6T  0 disk 
`-sdc1                               8:33   0   3.6T  0 part /var/lib/hadoop/data/e
sdd                                  8:48   0   3.6T  0 disk 
`-sdd1                               8:49   0   3.6T  0 part /var/lib/hadoop/data/d
sde                                  8:64   0   3.6T  0 disk 
`-sde1                               8:65   0   3.6T  0 part /var/lib/hadoop/data/f
sdf                                  8:80   0   3.6T  0 disk 
`-sdf1                               8:81   0   3.6T  0 part /var/lib/hadoop/data/h
sdg                                  8:96   0   3.6T  0 disk 
`-sdg1                               8:97   0   3.6T  0 part /var/lib/hadoop/data/g
sdh                                  8:112  0   3.6T  0 disk 
`-sdh1                               8:113  0   3.6T  0 part /var/lib/hadoop/data/i
sdi                                  8:128  0   3.6T  0 disk 
`-sdi1                               8:129  0   3.6T  0 part /var/lib/hadoop/data/l
sdj                                  8:144  0   3.6T  0 disk 
`-sdj1                               8:145  0   3.6T  0 part /var/lib/hadoop/data/j
sdk                                  8:160  0   3.6T  0 disk 
`-sdk1                               8:161  0   3.6T  0 part /var/lib/hadoop/data/k
sdl                                  8:176  0   3.6T  0 disk 
`-sdl1                               8:177  0   3.6T  0 part /var/lib/hadoop/data/m
sdm                                  8:192  0 447.1G  0 disk 
|-sdm1                               8:193  0   953M  0 part /boot
|-sdm2                               8:194  0     1K  0 part 
`-sdm5                               8:197  0 446.1G  0 part 
  |-an--worker1156--vg-swap        254:0    0   9.3G  0 lvm  [SWAP]
  |-an--worker1156--vg-root        254:1    0  55.9G  0 lvm  /
  `-an--worker1156--vg-journalnode 254:2    0    10G  0 lvm  /var/lib/hadoop/journal

Another key difference is the Current cache policy set by https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hadoop/init-hadoop-workers.py#L137-L182 which is what failed on the new hosts 1157-1175.

an-worker1157: Current Cache Policy: WriteBack, ReadAhead, Direct, Write Cache OK if Bad BBU

an-worker1156: Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

Hmm, maybe the RAID controller on the new hosts has been set somehow to IT/JBOD mode, instead of RAID?
We normally have to create a RAID0 logical volume on them before the operating system sees them.

I wonder if we can find out whether jbod mode is enabled on the controller.
There are some examples here of how to set it, but I can't find a command about finding out whether or not it is already enabled: https://serverfault.com/questions/818486/does-enabling-jbod-mode-on-lsi-based-controllers-affect-existing-logical-disks-a

Saw some comments on some RAID config issues here https://phabricator.wikimedia.org/T349936#9360470 by @Papaul from the rack/setup task but not as detailed.

From the iDRAc interfce we can verify that the hosts have been set to RAID0 and that the virtual drives are visible as expected.

image.png (704×1 px, 47 KB)

image.png (1×1 px, 179 KB)

Still trying to understand how they are not visible via megacli commands. Exploring reimaging an-worker1157 with the default playbook and setting up the disks to be visible with
sudo megacli -CfgEachDskRaid0 WB RA Direct NoCachedBadBBU -a0

Oh, I think I might know what is going on.

We will probably have to start managing the RAID configuration with the perccli64 command on these servers, instead of megacli.
It seems that megacli remains sort of partically compatible with the controller, but not fully compatible.

See: https://wikitech.wikimedia.org/wiki/PERCCli for some useful information and commands to run.

We can see the status of the controller here:

btullis@an-worker1157:~$ sudo perccli64 show all
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-27-amd64
Status Code = 0
Status = Success
Description = None

Number of Controllers = 1
Host Name = an-worker1157
Operating System  = Linux 5.10.0-27-amd64

System Overview :
===============

---------------------------------------------------------------------------
Ctl Model           Ports PDs DGs DNOpt VDs VNOpt BBU sPR DS EHS ASOs Hlth 
---------------------------------------------------------------------------
  0 PERCH755Adapter    16  14  13     0  13     0 Opt On  -  N      0 Opt  
---------------------------------------------------------------------------

We can see the status of all of the logical volumes here:

btullis@an-worker1157:~$ sudo perccli64 /c0/vall show
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-27-amd64
Controller = 0
Status = Success
Description = None


Virtual Drives :
==============

----------------------------------------------------------------
DG/VD  TYPE  State Access Consist Cache Cac sCC       Size Name 
----------------------------------------------------------------
0/226  RAID1 Optl  RW     Yes     RWBD  -   OFF 446.625 GB      
12/228 RAID0 Optl  RW     Yes     RFWBD -   OFF   3.637 TB      
11/229 RAID0 Optl  RW     Yes     RFWBD -   OFF   3.637 TB      
10/230 RAID0 Optl  RW     Yes     RFWBD -   OFF   3.637 TB      
9/231  RAID0 Optl  RW     Yes     RFWBD -   OFF   3.637 TB      
8/232  RAID0 Optl  RW     Yes     RFWBD -   OFF   3.637 TB      
7/233  RAID0 Optl  RW     Yes     RFWBD -   OFF   3.637 TB      
6/234  RAID0 Optl  RW     Yes     RFWBD -   OFF   3.637 TB      
5/235  RAID0 Optl  RW     Yes     RFWBD -   OFF   3.637 TB      
4/236  RAID0 Optl  RW     Yes     RFWBD -   OFF   3.637 TB      
3/237  RAID0 Optl  RW     Yes     RFWBD -   OFF   3.637 TB      
2/238  RAID0 Optl  RW     Yes     RFWBD -   OFF   3.637 TB      
1/239  RAID0 Optl  RW     Yes     RFWBD -   OFF   3.637 TB      
----------------------------------------------------------------

That wikitech page doesn't list the commands for setting the options on the logical volumes, but I think that we can probably find it and add it to that page.

Thanks @BTullis, updating wikitech and the cookbook once I get the right commands

Change 993742 merged by Stevemunene:

[operations/puppet@production] hdfs: Add new worker hosts to net_topology

https://gerrit.wikimedia.org/r/993742

Change 993743 merged by Stevemunene:

[operations/puppet@production] hdfs: Assign the right role to new hadoop workers

https://gerrit.wikimedia.org/r/993743

Mentioned in SAL (#wikimedia-analytics) [2024-02-06T11:45:17Z] <stevemunene> add new an-workers to analytics_cluster hadoop worker role analytics_cluster::hadoop::worker T353776

The hosts have been added to net_topology and assigned the right role. Hosts are also running OK without any RAID related alerts. However, some hosts are in the default rack so we shall need to run a role restart of the masters sudo cookbook sre.hadoop.roll-restart-masters analytics this was checked with
stevemunene@an-master1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -printTopology
hosts on the default being some of the newly added hosts

Rack: /eqiad/default/rack
   10.64.153.2:50010 (an-worker1157.eqiad.wmnet)
   10.64.153.3:50010 (an-worker1158.eqiad.wmnet)
   10.64.153.4:50010 (an-worker1159.eqiad.wmnet)
   10.64.155.4:50010 (an-worker1162.eqiad.wmnet)
   10.64.157.2:50010 (an-worker1163.eqiad.wmnet)
   10.64.157.3:50010 (an-worker1164.eqiad.wmnet)
   10.64.157.4:50010 (an-worker1165.eqiad.wmnet)
   10.64.161.2:50010 (an-worker1166.eqiad.wmnet)
   10.64.161.4:50010 (an-worker1168.eqiad.wmnet)
   10.64.165.2:50010 (an-worker1172.eqiad.wmnet)
   10.64.165.4:50010 (an-worker1174.eqiad.wmnet)

Keeping an eye on the cluster
HDFS Capacity remaining
HDFS Available vs used space

Mentioned in SAL (#wikimedia-analytics) [2024-02-06T13:29:49Z] <stevemunene> roll restart hadoop masters to pick up the right rack assignment for new hosts T353776

Hosts are visiblue on the namenode UI and should rebalance with time

image.png (2×2 px, 626 KB)

Having an issue with the an-masters roll restart, Namenode failover from an-master1003 to an-master1004 fails with

Run manual HDFS Namenode failover from an-master1003-eqiad-wmnet to an-master1004-eqiad-wmnet.
----- OUTPUT of 'kerberos-run-com...1004-eqiad-wmnet' -----                                                                                                                               
Operation failed: Unable to determine service address for namenode 'an-master1003-eqiad-wmnet'
Run manual HDFS Namenode failover from an-master1003-eqiad-wmnet to an-master1004-eqiad-wmnet.
----- OUTPUT of 'kerberos-run-com...1004-eqiad-wmnet' -----                                                                                                                               
Operation failed: Unable to determine service address for namenode 'an-master1003-eqiad-wmnet'

Oh, this is odd. We have the following in /etc/hadoop/conf/hdfs-site.xml on an-master1003:

  <property>
    <name>dfs.ha.namenodes.analytics-hadoop</name>
    <value>an-master1003-eqiad-wmnet,an-master1004-eqiad-wmnet</value>
  </property>
...
    <property>
    <name>dfs.namenode.servicerpc-address.analytics-hadoop.an-master1003-eqiad-wmnet</name>
    <value>an-master1003.eqiad.wmnet:8040</value>
  </property>
    <property>
    <name>dfs.namenode.servicerpc-address.analytics-hadoop.an-master1004-eqiad-wmnet</name>
    <value>an-master1004.eqiad.wmnet:8040</value>
  </property>

...and the same on an-master1004. I notice that the indenting is a little bit offm but I don't think that should affect it in XML.

This command also runs successfully on both nodes:

btullis@an-master1003:~$ sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getAllServiceState
an-master1003.eqiad.wmnet:8040                     active    
an-master1004.eqiad.wmnet:8040                     standby

I wonder what would happen if we tried to run the failover commands from the cookbook ourselves.

Sure, I'll try the manual failover and restart of the services probably during our sync

Mentioned in SAL (#wikimedia-analytics) [2024-02-07T12:24:55Z] <stevemunene> restart jvm services on an-master1004 for T353776 and to pick up new JDK

Mentioned in SAL (#wikimedia-analytics) [2024-02-07T12:36:34Z] <stevemunene> failover hadoop namenode to an-master1004 for jvm service restart to pick up new JDK and T353776

Mentioned in SAL (#wikimedia-analytics) [2024-02-07T12:48:30Z] <stevemunene> restart jvm services on an-master1003 for T353776 and to pick up new JDK

Mentioned in SAL (#wikimedia-analytics) [2024-02-07T13:01:37Z] <stevemunene> failover hadoop namenode back to an-master1003 after the jvm service restart to pick up new JDK and T353776

The hosts are slowly balancing in the cluster and should help with the low capacity warnings we were getting.

image.png (1×1 px, 285 KB)

Namenodes services have also been restarted and all the hosts have the right rack assignment. We can move this to done

Mentioned in SAL (#wikimedia-analytics) [2024-02-08T09:50:11Z] <stevemunene> failover hadoop namenode back to an-master1003 T353776