Maniphest T203635

Replace the Analytics HDFS/Yarn masters (hardware refresh)
Closed, ResolvedPublic13 Estimated Story Points
Actions

Assigned To

Authored By

	elukey
	Sep 6 2018, 7:17 AM

Description

The HDFS and Yarn masters (analytics100[12]) need to be replaced as part of the usual workflow for hardware refresh. We have already ordered the hardware, that is currently being racked in T201939.

As for all the delicate/risky Hadoop procedures, there is not a lot of documentation from upstream about the safest way to do things, except some occasional brave user that reports his/her story:

https://stackoverflow.com/questions/40216709/moving-hadoop-master-node-in-another-box-how-to-handle-hdfs?rq=1

After a chat with Andrew and Joseph we reached the same conclusion as the user of the above thread, namely that it is way safer and less error prone to shutdown completely the cluster to do this maintenance.

The (high level) idea is the following:

Stop all the regular Analytics processing jobs, and alert people in advance about the maintenance.
Stop Hive, Hue, etc..
Enter HDFS Safe Mode (only reads, no writes allowed)
Shutdown the cluster
Replace the master node domains in puppet and make sure that the file is updated everywhere
Copy over to the new master the HDFS state from the "current" masters (a rsync is sufficient).
Start the cluster.

This of course needs to be tested in labs, but the procedure seems sound from a quick review.

Useful doc to reference: https://etherpad.wikimedia.org/p/analytics-hadoop-java8 (last Java upgrade, that involved shutting down the cluster).

The selected date for the switch in production (if testing supports us) should be Sept 25th (to be announced/scheduled). We think it will be only a matter of shutting down the Hadoop cluster for a couple of hours.

Procedure to follow: https://etherpad.wikimedia.org/p/analytics-swap-masters (still WIP)

Details

Subject	Repo	Branch	Lines +/-
Replace analytics1003's occurrences with an-coord1001	operations/puppet	production	+10 -18
druid: replace analytics1003 with an-coord1001	operations/puppet	production	+2 -2
Set analytics100[1,2] to role spare system	operations/puppet	production	+4 -8
Swap occurrences of analytics1002 with an-master1002	operations/puppet	production	+3 -3
Swap analytics100[1,2] with an-master100[1,2]	operations/puppet	production	+20 -9
Guard analytics cron jobs to ease the hw refresh of an1003	operations/puppet	production	+350 -326

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		elukey	T203635 Replace the Analytics HDFS/Yarn masters (hardware refresh)
		Resolved		Jclark-ctr	T205507 Decommission analytics100[1,2]

Event Timeline

elukey triaged this task as Medium priority.Sep 6 2018, 7:17 AM

elukey created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 6 2018, 7:17 AM

elukey updated the task description. (Show Details)Sep 6 2018, 7:50 AM

elukey updated the task description. (Show Details)Sep 6 2018, 8:46 AM

elukey updated the task description. (Show Details)Sep 6 2018, 12:00 PM

So I tested the procedure in labs, that worked fine with the only caveat that hadoop-hdfs-zkfc seems to store the names of the master nodes in zookeeper, under /hadoop-ha/analytics-hadoop-labs/ActiveBreadCrumb (of course the name of the cluster varies depending in which one you are in), so I had to rmr it via zkCli.sh to make a new leader election work with the new hosts.

greaaaattttttttt

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Sep 6 2018, 1:29 PM

• fdans moved this task from Incoming to Operational Excellence on the Analytics board.Sep 6 2018, 4:48 PM

elukey added a project: User-Elukey.Sep 7 2018, 12:17 PM

elukey moved this task from In Progress to Ready to Deploy on the Analytics-Kanban board.Sep 14 2018, 6:47 AM

Change 461979 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Swap analytics100[1,2] with an-master100[1,2]$

https://gerrit.wikimedia.org/r/461979

gerritbot added a project: Patch-For-Review.Sep 21 2018, 3:29 PM

Change 461988 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Guard analytics cron jobs to ease the hw refresh of an1003

https://gerrit.wikimedia.org/r/461988

Change 461997 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Replace analytics1003's occurrences with an-coord1001

https://gerrit.wikimedia.org/r/461997

Change 461988 merged by Elukey:
[operations/puppet@production] Guard analytics cron jobs to ease the hw refresh of an1003

https://gerrit.wikimedia.org/r/461988

elukey moved this task from Backlog to In Progress on the User-Elukey board.Sep 24 2018, 11:39 AM

Mentioned in SAL (#wikimedia-operations) [2018-09-24T11:39:36Z] <elukey> reboot an-master100[1,2] as part of the pre-checks before the hadoop master daemons swap - T203635

Mentioned in SAL (#wikimedia-operations) [2018-09-25T08:03:27Z] <elukey> start of the maintenance to swap Hadoop masters from analytics100[1,2] to an-master100[1,2] - T203635

Change 461979 merged by Elukey:
[operations/puppet@production] Swap analytics100[1,2] with an-master100[1,2]

https://gerrit.wikimedia.org/r/461979

Change 462684 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Swap occurrences of analytics1002 with an-master1002

https://gerrit.wikimedia.org/r/462684

Change 462684 merged by Elukey:
[operations/puppet@production] Swap occurrences of analytics1002 with an-master1002

https://gerrit.wikimedia.org/r/462684

Mentioned in SAL (#wikimedia-operations) [2018-09-25T12:01:04Z] <elukey> end of the maintenance to swap Hadoop masters from analytics100[1,2] to an-master100[1,2] - T203635

Change 462720 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set analytics100[1,2] to role spare system

https://gerrit.wikimedia.org/r/462720

Change 462720 merged by Elukey:
[operations/puppet@production] Set analytics100[1,2] to role spare system

https://gerrit.wikimedia.org/r/462720

elukey mentioned this in T192642: Upgrade Analytics infrastructure to Debian Stretch.Sep 26 2018, 6:43 AM

elukey moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Sep 26 2018, 6:58 AM

elukey set the point value for this task to 13.

elukey moved this task from In Progress to Done on the User-Elukey board.Sep 26 2018, 7:03 AM

• Nuria closed this task as Resolved.Sep 26 2018, 7:12 PM

Change 465130 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] druid: replace analytics1003 with an-coord1001

https://gerrit.wikimedia.org/r/465130

Change 465130 merged by Elukey:
[operations/puppet@production] druid: replace analytics1003 with an-coord1001

https://gerrit.wikimedia.org/r/465130

Change 461997 merged by Elukey:
[operations/puppet@production] Replace analytics1003's occurrences with an-coord1001

https://gerrit.wikimedia.org/r/461997

elukey changed the status of subtask T205507: Decommission analytics100[1,2] from Open to Stalled.Jan 7 2019, 3:25 PM

elukey changed the status of subtask T205507: Decommission analytics100[1,2] from Stalled to Open.Jan 7 2019, 3:53 PM

• Cmjohnson closed subtask T205507: Decommission analytics100[1,2] as Resolved.May 13 2020, 5:56 PM

Maintenance_bot removed a project: Patch-For-Review.May 13 2020, 6:12 PM

Replace the Analytics HDFS/Yarn masters (hardware refresh)Closed, ResolvedPublic13 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Replace the Analytics HDFS/Yarn masters (hardware refresh)
Closed, ResolvedPublic13 Estimated Story Points
Actions

Related Objects
Search...