Page MenuHomePhabricator

Swap an existing Journal node analytics1069 with an-worker1142
Closed, ResolvedPublic

Description

analytics1069 is a journal node and is part of the nodes to be decommissioned under T317861.
We would like to make sure that it is row aware, in case we were to lose a whole row (like we did in the recent switch upgrade).
Doing a check of the current journal nodes

journalnode_hosts:
  - an-worker1080.eqiad.wmnet  # Row A4
  - an-worker1078.eqiad.wmnet  # Row A2
  - analytics1072.eqiad.wmnet  # ROW B2
  - an-worker1090.eqiad.wmnet  # Row C4
  - analytics1069.eqiad.wmnet  # Row D8

We would like the node to be on row E or F, We chose to go with an-worker1142 which is in Row E1.
From here we follow the steps listed here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Swap_an_existing_Journal_node_with_a_new_one_in_a_running_HA_Hadoop_Cluster

Event Timeline

Mentioned in SAL (#wikimedia-analytics) [2023-06-08T06:42:58Z] <stevemunene> stop hadoop-hdfs-journalnode on analytics1069 in order to swap the journal node with an-worker1142 T338336

First check for the journalnode partition on the new journal node an-worker1142. This is verified as available as seen below.

stevemunene@an-worker1142:~$ sudo lvs
  LV          VG               Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  journalnode an-worker1142-vg -wi-ao----  10.00g                                                    
  root        an-worker1142-vg -wi-ao---- <55.88g                                                    
  swap        an-worker1142-vg -wi-ao----   9.31g                                                    
stevemunene@an-worker1142:~$

On the node to be decommissioned analytics1069
Disabled puppet with

sudo disable-puppet 'Journal node is about to be decommissioned thus, swap the journal node with another -T338336 - ${USER}'

Shutdown the daemon
sudo systemctl stop hadoop-hdfs-journalnode

Then transferred the data using the trasfer.py script

stevemunene@cumin1001:~$ sudo transfer.py analytics1069.eqiad.wmnet:/var/lib/hadoop/journal an-worker1142.eqiad.wmnet:/var/lib/hadoop/journal
2023-06-08 06:44:48  INFO: About to transfer /var/lib/hadoop/journal from analytics1069.eqiad.wmnet to ['an-worker1142.eqiad.wmnet']:['/var/lib/hadoop/journal'] (265661459 bytes)
2023-06-08 06:44:57  WARNING: Original size is 265661459 but transferred size is 265149459 for copy to an-worker1142.eqiad.wmnet
2023-06-08 06:44:58  INFO: Parallel checksum of source on analytics1069.eqiad.wmnet and the transmitted ones on an-worker1142.eqiad.wmnet match.
2023-06-08 06:44:59  INFO: 265149459 bytes correctly transferred from analytics1069.eqiad.wmnet to an-worker1142.eqiad.wmnet
2023-06-08 06:45:00  INFO: Cleaning up....
stevemunene@cumin1001:~$

Change 928349 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Swap journal node analytics1069 with an-worker1142

https://gerrit.wikimedia.org/r/928349

@Stevemunene I think that the copy didn't lead to the results that we expected:

elukey@analytics1069:~$ ls -l /var/lib/hadoop/journal/
total 20
drwxr-xr-x 3 hdfs hdfs  4096 Jun  8 06:43 analytics-hadoop
drwx------ 2 root root 16384 Jun 14  2018 lost+found

vs

elukey@an-worker1142:~$ ls -l /var/lib/hadoop/journal/
total 20
drwxr-xr-x 4 hdfs hdfs  4096 Jun 14  2018 journal
drwx------ 2 root root 16384 Sep 12  2022 lost+found

Before proceeding you'd need to fix the extra journal directory on 1142 :)

Sorted the Journal folder

stevemunene@analytics1069:~$ ls -l /var/lib/hadoop/journal/
total 20
drwxr-xr-x 3 hdfs hdfs  4096 Jun  8 06:43 analytics-hadoop
drwx------ 2 root root 16384 Jun 14  2018 lost+found
stevemunene@analytics1069:~$

and

stevemunene@an-worker1142:~$ ls -l /var/lib/hadoop/journal/
total 8
drwxr-xr-x 3 hdfs hdfs 4096 Jun  8 06:43 analytics-hadoop
drwx------ 2 root root 4096 Jun 14  2018 lost+found

Change 928349 merged by Stevemunene:

[operations/puppet@production] Swap journal node analytics1069 with an-worker1142

https://gerrit.wikimedia.org/r/928349

Mentioned in SAL (#wikimedia-analytics) [2023-06-08T17:12:44Z] <btullis> running the sre.hadoop.roll-restart-masters cookbook for the analytics cluster, to pick up the new journalnode for T338336

After adding the new journal host to puppet and removing the old one, next is to Apply puppet on NameNodes and restart them.
This is done by first running puppet on the new host, then running puppet on the namenodes. Once done restart the namenodes with the cookbook as shown
sudo cookbook sre.hadoop.roll-restart-masters analytics and following the prompts.

Summary of active/standby statuses after the restarts:
Checking Master/Standby status.

Master status for HDFS:
----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' -----                                                                                                                               
active                                                                                                                                                                                    
================                                                                                                                                                                          
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.85s/hosts]
FAIL |                                                                                                                                                    |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Master status for Yarn:
----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' -----                                                                                                                               
active                                                                                                                                                                                    
================                                                                                                                                                                          
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.53s/hosts]
FAIL |                                                                                                                                                    |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Standby status for HDFS:
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----                                                                                                                               
standby                                                                                                                                                                                   
================                                                                                                                                                                          
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:04<00:00,  4.56s/hosts]
FAIL |                                                                                                                                                    |   0% (0/1) [00:04<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Standby status for Yarn:
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----                                                                                                                               
standby                                                                                                                                                                                   
================                                                                                                                                                                          
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.62s/hosts]
FAIL |                                                                                                                                                    |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Restart MapReduce historyserver on the master.
----- OUTPUT of 'systemctl restar...ce-historyserver' -----                                                                                                                               
================                                                                                                                                                                          
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:11<00:00, 11.75s/hosts]
FAIL |                                                                                                                                                    |   0% (0/1) [00:11<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl restar...ce-historyserver'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Deleted silence ID 92df8eeb-5655-4cf5-a629-e174230f83c4
END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop analytics cluster: Restart of jvm daemons.