⚓ T332573 Refresh an-master100[1-2] with an-master100[3-4]

Subject	Repo	Branch	Lines +/-
Add data for the new an-master100[3-4]	operations/puppet	production	+4 -0
Use insetup::buster for the old namenodes	operations/puppet	production	+3 -1
Enable monitoring for the new namenodes	operations/puppet	production	+0 -17
Set the old namenodes to be insetup	operations/puppet	production	+7 -2
Revert "Temporarily disable systemd jobs on an-launcher1002"	operations/puppet	production	+9 -9
Revert "Temporarily disable gobblin ingestion"	operations/puppet	production	+1 -1
Temporarily disable systemd jobs on an-launcher1002	operations/puppet	production	+9 -9
Temporarily disable gobblin ingestion	operations/puppet	production	+1 -1
spark-history: update an-master hostnames	operations/deployment-charts	master	+7 -7
Update the hadoop nameservers	operations/puppet	production	+18 -17
Bump the namenode heap value for the new nameservers	operations/puppet	production	+3 -0
Bring an-master1004 into service as a hadoop::standby	operations/puppet	production	+8 -6
Allow deep merging of hadoop config overrides	operations/puppet	production	+2 -1
Bring an-master1003 into service as a hadoop::master	operations/puppet	production	+9 -2
Add dummy keytabs for new hadoop master servers	labs/private	master	+0 -0
Use python3 for the check_hdfs_active_namenode script	operations/puppet	production	+1 -1

Status	Assigned	Task
Open	None	T291916 Tracking task for Bullseye migrations in production
Resolved	BTullis	T288804 Upgrade the Data Engineering infrastructure to Debian Bullseye
Resolved	BTullis	T353775 Decom an-master100[1-2]
Resolved	BTullis	T332573 Refresh an-master100[1-2] with an-master100[3-4]

BTullis added a project: Data-Platform-SRE.Jun 9 2023, 11:56 AM

JArguello-WMF removed a project: Shared-Data-Infrastructure.Jun 29 2023, 1:44 PM

JArguello-WMF removed a project: Data-Engineering-Planning.Jun 29 2023, 9:41 PM

Change 947421 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Use python3 for the check_hdfs_active_namenode script

https://gerrit.wikimedia.org/r/947421

Change 947421 merged by Btullis:

[operations/puppet@production] Use python3 for the check_hdfs_active_namenode script

https://gerrit.wikimedia.org/r/947421

Maintenance_bot removed a project: Patch-For-Review.Aug 9 2023, 5:30 PM

Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.Aug 16 2023, 1:21 PM

Gehel moved this task from Misc to Ready for Work on the Data-Platform-SRE board.Oct 11 2023, 8:30 AM

BTullis triaged this task as High priority.Nov 15 2023, 9:45 AM

Gehel moved this task from Ready for Work to OS Upgrade on the Data-Platform-SRE board.Dec 6 2023, 1:10 PM

Gehel moved this task from OS Upgrade to 2024.01.01 - 2024.01.21 on the Data-Platform-SRE board.Dec 13 2023, 10:00 AM

Gehel edited projects, added Data-Platform-SRE (2024.01.01 - 2024.01.21); removed Data-Platform-SRE.

Gehel added a parent task: T353775: Decom an-master100[1-2].Dec 20 2023, 9:28 AM

BTullis renamed this task from Upgrade hadoop master to bullseye to Refresh an-master1001 with an-master1003.Jan 9 2024, 3:23 PM

BTullis claimed this task.

BTullis moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.

BTullis updated the task description. (Show Details)

BTullis removed subscribers: Ottomata, • nfraison, Stevemunene.

BTullis renamed this task from Refresh an-master1001 with an-master1003 to Refresh an-master1001 and an-master1002 with an-master1003 and an-master1004.Jan 9 2024, 5:20 PM

BTullis updated the task description. (Show Details)

BTullis merged a task: T332578: Refresh an-master1002 with an-master1004.

Change 989213 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Bring an-master1003 into service as a hadoop::master

https://gerrit.wikimedia.org/r/989213

Change 989214 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Bring an-master1004 into service as a hadoop::standby

https://gerrit.wikimedia.org/r/989214

BTullis renamed this task from Refresh an-master1001 and an-master1002 with an-master1003 and an-master1004 to Refresh an-master100[1-2] with an-master100[3-4].Jan 9 2024, 5:30 PM

BTullis updated the task description. (Show Details)Jan 9 2024, 5:49 PM

I have created the kerberos principals and keytabs for the new hosts with the following file:

btullis@krb1001:~$ cat T332573_new_hadoop_masters.txt
an-master1003.eqiad.wmnet,create_princ,HTTP
an-master1003.eqiad.wmnet,create_princ,analytics
an-master1003.eqiad.wmnet,create_princ,hdfs
an-master1003.eqiad.wmnet,create_princ,mapred
an-master1003.eqiad.wmnet,create_princ,yarn
an-master1003.eqiad.wmnet,create_keytab,HTTP
an-master1003.eqiad.wmnet,create_keytab,analytics
an-master1003.eqiad.wmnet,create_keytab,hdfs
an-master1003.eqiad.wmnet,create_keytab,mapred
an-master1003.eqiad.wmnet,create_keytab,yarn
an-master1004.eqiad.wmnet,create_princ,HTTP
an-master1004.eqiad.wmnet,create_princ,analytics
an-master1004.eqiad.wmnet,create_princ,hdfs
an-master1004.eqiad.wmnet,create_princ,mapred
an-master1004.eqiad.wmnet,create_princ,yarn
an-master1004.eqiad.wmnet,create_keytab,HTTP
an-master1004.eqiad.wmnet,create_keytab,analytics
an-master1004.eqiad.wmnet,create_keytab,hdfs
an-master1004.eqiad.wmnet,create_keytab,mapred
an-master1004.eqiad.wmnet,create_keytab,yarn

I then ran this command: sudo generate_keytabs.py --realm WIKIMEDIA T332573_new_hadoop_masters.txt

This shows 8 new principals, each with a keytab.

I will now sync these keytabs to the private repo and create dummy versions in the labs/private repo.

Change 989222 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add dummy keytabs for new hadoop master servers

https://gerrit.wikimedia.org/r/989222

Change 989222 merged by Btullis:

[labs/private@master] Add dummy keytabs for new hadoop master servers

https://gerrit.wikimedia.org/r/989222

BTullis mentioned this in rLPRI20760a048c30: Add dummy keytabs for new hadoop master servers.Jan 9 2024, 6:16 PM

Stevemunene subscribed.Jan 10 2024, 10:03 AM

Change 989213 merged by Btullis:

[operations/puppet@production] Bring an-master1003 into service as a hadoop::master

https://gerrit.wikimedia.org/r/989213

Change 989490 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Allow deep merging of hadoop config overrides

https://gerrit.wikimedia.org/r/989490

Change 989490 merged by Btullis:

[operations/puppet@production] Allow deep merging of hadoop config overrides

https://gerrit.wikimedia.org/r/989490

This going quite well, so far.

I have brought up an-master1003 in the analytics_cluster::hadoop::master role and bootstrapped the namenode process.
Currently the hadoop-yarn-resourcemanager and hadoop-mapreduce-historyserver services are failing to start, but I think that these are minor issues.

I can see that the namenode processes are syncing the edits from the journalnodes and keeping the fsimage up-to-date.

I will now work on an-master1004 and try to get it to the same state.

BTullis updated the task description. (Show Details)Jan 10 2024, 3:13 PM

Icinga downtime and Alertmanager silence (ID=e7c141a8-f6e4-4385-b205-cac59bd12d90) set by btullis@cumin1002 for 7 days, 0:00:00 on 2 host(s) and their services with reason: Bringing new nameservers into service

an-master[1003-1004].eqiad.wmnet

Change 989214 merged by Btullis:

[operations/puppet@production] Bring an-master1004 into service as a hadoop::standby

https://gerrit.wikimedia.org/r/989214

Maintenance_bot removed a project: Patch-For-Review.Jan 10 2024, 5:30 PM

BTullis updated the task description. (Show Details)Jan 10 2024, 6:09 PM

xcollazo subscribed.Jan 11 2024, 2:09 PM

Change 989901 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add data for the new an-master100[3-4]

https://gerrit.wikimedia.org/r/989901

gerritbot added a project: Patch-For-Review.Jan 11 2024, 4:24 PM

Change 990598 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Bump the namenode heap value for the new nameservers

https://gerrit.wikimedia.org/r/990598

Change 990600 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the hadoop nameservers

https://gerrit.wikimedia.org/r/990600

Change 990598 merged by Btullis:

[operations/puppet@production] Bump the namenode heap value for the new nameservers

https://gerrit.wikimedia.org/r/990598

Change 990605 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Temporarily disable gobblin ingestion

https://gerrit.wikimedia.org/r/990605

Change 990627 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] spark-history: update an-master hostnames

https://gerrit.wikimedia.org/r/990627

Change 990629 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Temporarily disable systemd jobs on an-launcher1002

https://gerrit.wikimedia.org/r/990629

Change 990605 merged by Btullis:

[operations/puppet@production] Temporarily disable gobblin ingestion

https://gerrit.wikimedia.org/r/990605

Change 990629 merged by Btullis:

[operations/puppet@production] Temporarily disable systemd jobs on an-launcher1002

https://gerrit.wikimedia.org/r/990629

BTullis updated the task description. (Show Details)Jan 15 2024, 10:28 AM

BTullis updated the task description. (Show Details)

BTullis updated the task description. (Show Details)Jan 15 2024, 10:47 AM

All currently running production pipelines have completed. THere are some user-submitted jobs still running, but we are well within the window for the maintenance, so we will proceed to put HDFS into safe mode.

Mentioned in SAL (#wikimedia-analytics) [2024-01-15T10:54:43Z] <btullis> putting HDFS into safe mode for T332573

Icinga downtime and Alertmanager silence (ID=e51f2667-db1a-4c38-b125-d23138404c64) set by btullis@cumin1002 for 7 days, 0:00:00 on 97 host(s) and their services with reason: Bringing new nameservers into service

an-worker[1078-1095,1097-1175].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=8958a01b-be93-491e-8968-a18db034f488) set by btullis@cumin1002 for 7 days, 0:00:00 on 8 host(s) and their services with reason: Bringing new nameservers into service

analytics[1070-1077].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=d1c834b7-213d-439c-9ec0-27ed5a825a70) set by btullis@cumin1002 for 7 days, 0:00:00 on 4 host(s) and their services with reason: Bringing new nameservers into service

an-master[1001-1004].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=a1f1c00f-92c4-4d89-8bf3-0cb5bc2f3d7d) set by btullis@cumin1002 for 7 days, 0:00:00 on 4 host(s) and their services with reason: Bringing new nameservers into service

an-coord[1001-1004].eqiad.wmnet

Change 990600 merged by Btullis:

[operations/puppet@production] Update the hadoop nameservers

https://gerrit.wikimedia.org/r/990600

Mentioned in SAL (#wikimedia-analytics) [2024-01-15T11:16:04Z] <btullis> running puppet on journal nodes first for T332573

BTullis updated the task description. (Show Details)Jan 15 2024, 11:16 AM

Mentioned in SAL (#wikimedia-analytics) [2024-01-15T11:20:46Z] <btullis> running puppet on an-master1003 to set it to active for T332573

We have now got to a state where the two new nameservers are up and running.

btullis@an-master1003:~$ sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getAllServiceState
an-master1003.eqiad.wmnet:8040                     active    
an-master1004.eqiad.wmnet:8040                     standby

We will exit safe mode.

btullis@an-master1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode get
Safe mode is ON in an-master1003.eqiad.wmnet/10.64.36.15:8020
Safe mode is ON in an-master1004.eqiad.wmnet/10.64.53.14:8020
btullis@an-master1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave
Safe mode is OFF in an-master1003.eqiad.wmnet/10.64.36.15:8020
Safe mode is OFF in an-master1004.eqiad.wmnet/10.64.53.14:8020
btullis@an-master1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode get
Safe mode is OFF in an-master1003.eqiad.wmnet/10.64.36.15:8020
Safe mode is OFF in an-master1004.eqiad.wmnet/10.64.53.14:8020
btullis@an-master1003:~$

Change 990627 merged by Brouberol:

[operations/deployment-charts@master] spark-history: update an-master hostnames

https://gerrit.wikimedia.org/r/990627

Mentioned in SAL (#wikimedia-analytics) [2024-01-15T11:38:00Z] <brouberol> redeploying the Spark History Server to pick up the new HDFS namenodes - T332573

BTullis updated the task description. (Show Details)Jan 15 2024, 11:41 AM

Change 990637 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Set the old namenodes to be insetup

https://gerrit.wikimedia.org/r/990637

BTullis updated the task description. (Show Details)Jan 15 2024, 11:48 AM

Change 990612 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Revert "Temporarily disable gobblin ingestion"

https://gerrit.wikimedia.org/r/990612

Change 990613 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Revert "Temporarily disable systemd jobs on an-launcher1002"

https://gerrit.wikimedia.org/r/990613

Change 990637 merged by Btullis:

[operations/puppet@production] Set the old namenodes to be insetup

https://gerrit.wikimedia.org/r/990637

Change 990612 merged by Btullis:

[operations/puppet@production] Revert "Temporarily disable gobblin ingestion"

https://gerrit.wikimedia.org/r/990612

Mentioned in SAL (#wikimedia-analytics) [2024-01-15T11:57:19Z] <btullis> un-pausing all previously paused DAGS on all airflow instances for T332573

BTullis updated the task description. (Show Details)Jan 15 2024, 11:57 AM

Change 990613 merged by Btullis:

[operations/puppet@production] Revert "Temporarily disable systemd jobs on an-launcher1002"

https://gerrit.wikimedia.org/r/990613

Mentioned in SAL (#wikimedia-analytics) [2024-01-15T12:00:47Z] <btullis> removing all downtime for hadoop-all for T332573

Change 990643 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable monitoring for the new namenodes

https://gerrit.wikimedia.org/r/990643

BTullis updated the task description. (Show Details)Jan 15 2024, 12:08 PM

BTullis updated the task description. (Show Details)

yarn.wikimedia.org is redirecting to an-master1004.eqiad.wmnet:8088 and that doesn't work. Investigating now.

Ah, it's fine now. It was just a case of this: T331448: Make YARN web interface work with both primary and standby resourcemanager
Confirmed with:

btullis@an-master1003:~$ sudo kerberos-run-command yarn /usr/bin/yarn rmadmin -getAllServiceState
an-master1003.eqiad.wmnet:8033                     standby   
an-master1004.eqiad.wmnet:8033                     active

I restarted the hadoop-yarn-resourcemanager service on an-master1004 and then it started working.

btullis@an-master1003:~$ sudo kerberos-run-command yarn /usr/bin/yarn rmadmin -getAllServiceState
an-master1003.eqiad.wmnet:8033                     active    
an-master1004.eqiad.wmnet:8033                     standby

I have updated https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration and several other references on Wikitech. I believe that the old namenodes are ready to be decommissioned.

BTullis updated the task description. (Show Details)Jan 15 2024, 12:31 PM

Change 990643 merged by Btullis:

[operations/puppet@production] Enable monitoring for the new namenodes

https://gerrit.wikimedia.org/r/990643

Change 990665 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Use insetup::buster for the old namenodes

https://gerrit.wikimedia.org/r/990665

Mentioned in SAL (#wikimedia-analytics) [2024-01-15T16:47:58Z] <btullis> restarted the hive-server2 and hive-metastore services on an-coord100[3-4] which had been accidentally omitted earlier for T332573

BTullis closed this task as Resolved.Jan 16 2024, 3:39 PM

We have been getting some emails from systemd timers that were inadvertently left enabled on an-master1002.
I have disabled them with:

btullis@an-master1002:~$ sudo systemctl disable hadoop-namenode-backup-fetchimage.timer hadoop-namenode-backup-prune.timer hadoop-namenode-backup-hdfs.timer
Removed /etc/systemd/system/multi-user.target.wants/hadoop-namenode-backup-fetchimage.timer.
Removed /etc/systemd/system/multi-user.target.wants/hadoop-namenode-backup-hdfs.timer.
Removed /etc/systemd/system/multi-user.target.wants/hadoop-namenode-backup-prune.timer.

Change 990665 abandoned by Btullis:

[operations/puppet@production] Use insetup::buster for the old namenodes

Reason:

Decommissioned the hosts already

https://gerrit.wikimedia.org/r/990665

Change 989901 abandoned by Btullis:

[operations/puppet@production] Add data for the new an-master100[3-4]

Reason:

Achieved in another commit

https://gerrit.wikimedia.org/r/989901

Maintenance_bot removed a project: Patch-For-Review.Jan 30 2024, 11:31 AM

Refresh an-master100[1-2] with an-master100[3-4]
Closed, ResolvedPublic
Actions

Description

Procedure

Acceptance Criteria

Details

Related Objects
Search...

Event Timeline

	BTullis
	Mar 20 2023, 12:41 PM

Refresh an-master100[1-2] with an-master100[3-4]Closed, ResolvedPublicActions

Description

Procedure

Acceptance Criteria

Details

Related ObjectsSearch...

Event Timeline

Refresh an-master100[1-2] with an-master100[3-4]
Closed, ResolvedPublic
Actions

Related Objects
Search...