Page MenuHomePhabricator

Refresh an-master100[1-2] with an-master100[3-4]
Closed, ResolvedPublic

Description

This ticket will track the replacement of the two current Hadoop masters an-master100[1-2] with the new servers, an-master100[3-4.

Thereis the outline of a procedure here, but it will need validating carefully:
https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Migrating_to_new_HA_NameNodes

Procedure

This will need a full cluster restart, I believe.

The documented procedure looks sound to me.

The rough outline of the process is:

  • Bring up an-master1003 in its correct role, but in the hdfs-site.xml we only tell it about an-master1001 and an-master1003
  • Bring up an-master1004 in its correct role, but in the hdfs-site.xml we only tell it about an-master1001 and an-master1004
  • Bootstrap the namenodes, so that they are receiving updates about edits from the journalnodes and are thus populated.
  • Prepare the patches to replace an-master1001 with an-master1003 and an-master1002 with an-master1004 globally.
  • Announce downtime for the cluster and wait patiently.

At the appointed time:

Acceptance Criteria

  • an-master1003 is running in the role analytics_cluster::hadoop::master
  • an-master1004 is running in the role analytics_cluster::hadoop::standby
  • an-master1001 is ready for decommissioning
  • an-master1002 is ready for decommissioning
  • Wikitech documentation has been updated

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 947421 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Use python3 for the check_hdfs_active_namenode script

https://gerrit.wikimedia.org/r/947421

Change 947421 merged by Btullis:

[operations/puppet@production] Use python3 for the check_hdfs_active_namenode script

https://gerrit.wikimedia.org/r/947421

BTullis triaged this task as High priority.Nov 15 2023, 9:45 AM
BTullis renamed this task from Upgrade hadoop master to bullseye to Refresh an-master1001 with an-master1003.Jan 9 2024, 3:23 PM
BTullis claimed this task.
BTullis updated the task description. (Show Details)
BTullis removed subscribers: Ottomata, nfraison, Stevemunene.
BTullis renamed this task from Refresh an-master1001 with an-master1003 to Refresh an-master1001 and an-master1002 with an-master1003 and an-master1004.Jan 9 2024, 5:20 PM
BTullis updated the task description. (Show Details)

Change 989213 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Bring an-master1003 into service as a hadoop::master

https://gerrit.wikimedia.org/r/989213

Change 989214 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Bring an-master1004 into service as a hadoop::standby

https://gerrit.wikimedia.org/r/989214

BTullis renamed this task from Refresh an-master1001 and an-master1002 with an-master1003 and an-master1004 to Refresh an-master100[1-2] with an-master100[3-4].Jan 9 2024, 5:30 PM

I have created the kerberos principals and keytabs for the new hosts with the following file:

btullis@krb1001:~$ cat T332573_new_hadoop_masters.txt
an-master1003.eqiad.wmnet,create_princ,HTTP
an-master1003.eqiad.wmnet,create_princ,analytics
an-master1003.eqiad.wmnet,create_princ,hdfs
an-master1003.eqiad.wmnet,create_princ,mapred
an-master1003.eqiad.wmnet,create_princ,yarn
an-master1003.eqiad.wmnet,create_keytab,HTTP
an-master1003.eqiad.wmnet,create_keytab,analytics
an-master1003.eqiad.wmnet,create_keytab,hdfs
an-master1003.eqiad.wmnet,create_keytab,mapred
an-master1003.eqiad.wmnet,create_keytab,yarn
an-master1004.eqiad.wmnet,create_princ,HTTP
an-master1004.eqiad.wmnet,create_princ,analytics
an-master1004.eqiad.wmnet,create_princ,hdfs
an-master1004.eqiad.wmnet,create_princ,mapred
an-master1004.eqiad.wmnet,create_princ,yarn
an-master1004.eqiad.wmnet,create_keytab,HTTP
an-master1004.eqiad.wmnet,create_keytab,analytics
an-master1004.eqiad.wmnet,create_keytab,hdfs
an-master1004.eqiad.wmnet,create_keytab,mapred
an-master1004.eqiad.wmnet,create_keytab,yarn

I then ran this command: sudo generate_keytabs.py --realm WIKIMEDIA T332573_new_hadoop_masters.txt

This shows 8 new principals, each with a keytab.

I will now sync these keytabs to the private repo and create dummy versions in the labs/private repo.

Change 989222 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add dummy keytabs for new hadoop master servers

https://gerrit.wikimedia.org/r/989222

Change 989222 merged by Btullis:

[labs/private@master] Add dummy keytabs for new hadoop master servers

https://gerrit.wikimedia.org/r/989222

Change 989213 merged by Btullis:

[operations/puppet@production] Bring an-master1003 into service as a hadoop::master

https://gerrit.wikimedia.org/r/989213

Change 989490 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Allow deep merging of hadoop config overrides

https://gerrit.wikimedia.org/r/989490

Change 989490 merged by Btullis:

[operations/puppet@production] Allow deep merging of hadoop config overrides

https://gerrit.wikimedia.org/r/989490

This going quite well, so far.

I have brought up an-master1003 in the analytics_cluster::hadoop::master role and bootstrapped the namenode process.
Currently the hadoop-yarn-resourcemanager and hadoop-mapreduce-historyserver services are failing to start, but I think that these are minor issues.

I can see that the namenode processes are syncing the edits from the journalnodes and keeping the fsimage up-to-date.

I will now work on an-master1004 and try to get it to the same state.

Icinga downtime and Alertmanager silence (ID=e7c141a8-f6e4-4385-b205-cac59bd12d90) set by btullis@cumin1002 for 7 days, 0:00:00 on 2 host(s) and their services with reason: Bringing new nameservers into service

an-master[1003-1004].eqiad.wmnet

Change 989214 merged by Btullis:

[operations/puppet@production] Bring an-master1004 into service as a hadoop::standby

https://gerrit.wikimedia.org/r/989214

Change 989901 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add data for the new an-master100[3-4]

https://gerrit.wikimedia.org/r/989901

Change 990598 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Bump the namenode heap value for the new nameservers

https://gerrit.wikimedia.org/r/990598

Change 990600 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the hadoop nameservers

https://gerrit.wikimedia.org/r/990600

Change 990598 merged by Btullis:

[operations/puppet@production] Bump the namenode heap value for the new nameservers

https://gerrit.wikimedia.org/r/990598

Change 990605 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Temporarily disable gobblin ingestion

https://gerrit.wikimedia.org/r/990605

Change 990627 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] spark-history: update an-master hostnames

https://gerrit.wikimedia.org/r/990627

Change 990629 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Temporarily disable systemd jobs on an-launcher1002

https://gerrit.wikimedia.org/r/990629

Change 990605 merged by Btullis:

[operations/puppet@production] Temporarily disable gobblin ingestion

https://gerrit.wikimedia.org/r/990605

Change 990629 merged by Btullis:

[operations/puppet@production] Temporarily disable systemd jobs on an-launcher1002

https://gerrit.wikimedia.org/r/990629

BTullis updated the task description. (Show Details)

All currently running production pipelines have completed. THere are some user-submitted jobs still running, but we are well within the window for the maintenance, so we will proceed to put HDFS into safe mode.

Mentioned in SAL (#wikimedia-analytics) [2024-01-15T10:54:43Z] <btullis> putting HDFS into safe mode for T332573

Icinga downtime and Alertmanager silence (ID=e51f2667-db1a-4c38-b125-d23138404c64) set by btullis@cumin1002 for 7 days, 0:00:00 on 97 host(s) and their services with reason: Bringing new nameservers into service

an-worker[1078-1095,1097-1175].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=8958a01b-be93-491e-8968-a18db034f488) set by btullis@cumin1002 for 7 days, 0:00:00 on 8 host(s) and their services with reason: Bringing new nameservers into service

analytics[1070-1077].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=d1c834b7-213d-439c-9ec0-27ed5a825a70) set by btullis@cumin1002 for 7 days, 0:00:00 on 4 host(s) and their services with reason: Bringing new nameservers into service

an-master[1001-1004].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=a1f1c00f-92c4-4d89-8bf3-0cb5bc2f3d7d) set by btullis@cumin1002 for 7 days, 0:00:00 on 4 host(s) and their services with reason: Bringing new nameservers into service

an-coord[1001-1004].eqiad.wmnet

Change 990600 merged by Btullis:

[operations/puppet@production] Update the hadoop nameservers

https://gerrit.wikimedia.org/r/990600

Mentioned in SAL (#wikimedia-analytics) [2024-01-15T11:16:04Z] <btullis> running puppet on journal nodes first for T332573

Mentioned in SAL (#wikimedia-analytics) [2024-01-15T11:20:46Z] <btullis> running puppet on an-master1003 to set it to active for T332573

We have now got to a state where the two new nameservers are up and running.

btullis@an-master1003:~$ sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getAllServiceState
an-master1003.eqiad.wmnet:8040                     active    
an-master1004.eqiad.wmnet:8040                     standby

We will exit safe mode.

btullis@an-master1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode get
Safe mode is ON in an-master1003.eqiad.wmnet/10.64.36.15:8020
Safe mode is ON in an-master1004.eqiad.wmnet/10.64.53.14:8020
btullis@an-master1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave
Safe mode is OFF in an-master1003.eqiad.wmnet/10.64.36.15:8020
Safe mode is OFF in an-master1004.eqiad.wmnet/10.64.53.14:8020
btullis@an-master1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode get
Safe mode is OFF in an-master1003.eqiad.wmnet/10.64.36.15:8020
Safe mode is OFF in an-master1004.eqiad.wmnet/10.64.53.14:8020
btullis@an-master1003:~$

Change 990627 merged by Brouberol:

[operations/deployment-charts@master] spark-history: update an-master hostnames

https://gerrit.wikimedia.org/r/990627

Mentioned in SAL (#wikimedia-analytics) [2024-01-15T11:38:00Z] <brouberol> redeploying the Spark History Server to pick up the new HDFS namenodes - T332573

Change 990637 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Set the old namenodes to be insetup

https://gerrit.wikimedia.org/r/990637

Change 990612 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Revert "Temporarily disable gobblin ingestion"

https://gerrit.wikimedia.org/r/990612

Change 990613 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Revert "Temporarily disable systemd jobs on an-launcher1002"

https://gerrit.wikimedia.org/r/990613

Change 990637 merged by Btullis:

[operations/puppet@production] Set the old namenodes to be insetup

https://gerrit.wikimedia.org/r/990637

Change 990612 merged by Btullis:

[operations/puppet@production] Revert "Temporarily disable gobblin ingestion"

https://gerrit.wikimedia.org/r/990612

Mentioned in SAL (#wikimedia-analytics) [2024-01-15T11:57:19Z] <btullis> un-pausing all previously paused DAGS on all airflow instances for T332573

Change 990613 merged by Btullis:

[operations/puppet@production] Revert "Temporarily disable systemd jobs on an-launcher1002"

https://gerrit.wikimedia.org/r/990613

Mentioned in SAL (#wikimedia-analytics) [2024-01-15T12:00:47Z] <btullis> removing all downtime for hadoop-all for T332573

Change 990643 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable monitoring for the new namenodes

https://gerrit.wikimedia.org/r/990643

BTullis updated the task description. (Show Details)

yarn.wikimedia.org is redirecting to an-master1004.eqiad.wmnet:8088 and that doesn't work. Investigating now.

Ah, it's fine now. It was just a case of this: T331448: Make YARN web interface work with both primary and standby resourcemanager
Confirmed with:

btullis@an-master1003:~$ sudo kerberos-run-command yarn /usr/bin/yarn rmadmin -getAllServiceState
an-master1003.eqiad.wmnet:8033                     standby   
an-master1004.eqiad.wmnet:8033                     active

I restarted the hadoop-yarn-resourcemanager service on an-master1004 and then it started working.

btullis@an-master1003:~$ sudo kerberos-run-command yarn /usr/bin/yarn rmadmin -getAllServiceState
an-master1003.eqiad.wmnet:8033                     active    
an-master1004.eqiad.wmnet:8033                     standby

I have updated https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration and several other references on Wikitech. I believe that the old namenodes are ready to be decommissioned.

Change 990643 merged by Btullis:

[operations/puppet@production] Enable monitoring for the new namenodes

https://gerrit.wikimedia.org/r/990643

Change 990665 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Use insetup::buster for the old namenodes

https://gerrit.wikimedia.org/r/990665

Mentioned in SAL (#wikimedia-analytics) [2024-01-15T16:47:58Z] <btullis> restarted the hive-server2 and hive-metastore services on an-coord100[3-4] which had been accidentally omitted earlier for T332573

We have been getting some emails from systemd timers that were inadvertently left enabled on an-master1002.
I have disabled them with:

btullis@an-master1002:~$ sudo systemctl disable hadoop-namenode-backup-fetchimage.timer hadoop-namenode-backup-prune.timer hadoop-namenode-backup-hdfs.timer
Removed /etc/systemd/system/multi-user.target.wants/hadoop-namenode-backup-fetchimage.timer.
Removed /etc/systemd/system/multi-user.target.wants/hadoop-namenode-backup-hdfs.timer.
Removed /etc/systemd/system/multi-user.target.wants/hadoop-namenode-backup-prune.timer.

Change 990665 abandoned by Btullis:

[operations/puppet@production] Use insetup::buster for the old namenodes

Reason:

Decommissioned the hosts already

https://gerrit.wikimedia.org/r/990665

Change 989901 abandoned by Btullis:

[operations/puppet@production] Add data for the new an-master100[3-4]

Reason:

Achieved in another commit

https://gerrit.wikimedia.org/r/989901