Page MenuHomePhabricator

Replace the Analytics HDFS/Yarn masters (hardware refresh)
Closed, ResolvedPublic13 Estimated Story Points

Description

The HDFS and Yarn masters (analytics100[12]) need to be replaced as part of the usual workflow for hardware refresh. We have already ordered the hardware, that is currently being racked in T201939.

As for all the delicate/risky Hadoop procedures, there is not a lot of documentation from upstream about the safest way to do things, except some occasional brave user that reports his/her story:

https://stackoverflow.com/questions/40216709/moving-hadoop-master-node-in-another-box-how-to-handle-hdfs?rq=1

After a chat with Andrew and Joseph we reached the same conclusion as the user of the above thread, namely that it is way safer and less error prone to shutdown completely the cluster to do this maintenance.

The (high level) idea is the following:

  • Stop all the regular Analytics processing jobs, and alert people in advance about the maintenance.
  • Stop Hive, Hue, etc..
  • Enter HDFS Safe Mode (only reads, no writes allowed)
  • Shutdown the cluster
  • Replace the master node domains in puppet and make sure that the file is updated everywhere
  • Copy over to the new master the HDFS state from the "current" masters (a rsync is sufficient).
  • Start the cluster.

This of course needs to be tested in labs, but the procedure seems sound from a quick review.

Useful doc to reference: https://etherpad.wikimedia.org/p/analytics-hadoop-java8 (last Java upgrade, that involved shutting down the cluster).

The selected date for the switch in production (if testing supports us) should be Sept 25th (to be announced/scheduled). We think it will be only a matter of shutting down the Hadoop cluster for a couple of hours.

Procedure to follow: https://etherpad.wikimedia.org/p/analytics-swap-masters (still WIP)

Event Timeline

elukey triaged this task as Medium priority.Sep 6 2018, 7:17 AM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 6 2018, 7:17 AM
elukey updated the task description. (Show Details)Sep 6 2018, 7:50 AM
elukey updated the task description. (Show Details)Sep 6 2018, 8:46 AM
elukey updated the task description. (Show Details)Sep 6 2018, 12:00 PM

So I tested the procedure in labs, that worked fine with the only caveat that hadoop-hdfs-zkfc seems to store the names of the master nodes in zookeeper, under /hadoop-ha/analytics-hadoop-labs/ActiveBreadCrumb (of course the name of the cluster varies depending in which one you are in), so I had to rmr it via zkCli.sh to make a new leader election work with the new hosts.

greaaaattttttttt

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Sep 6 2018, 1:29 PM

Change 461979 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Swap analytics100[1,2] with an-master100[1,2]$

https://gerrit.wikimedia.org/r/461979

Change 461988 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Guard analytics cron jobs to ease the hw refresh of an1003

https://gerrit.wikimedia.org/r/461988

Change 461997 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Replace analytics1003's occurrences with an-coord1001

https://gerrit.wikimedia.org/r/461997

Change 461988 merged by Elukey:
[operations/puppet@production] Guard analytics cron jobs to ease the hw refresh of an1003

https://gerrit.wikimedia.org/r/461988

elukey moved this task from Backlog to In Progress on the User-Elukey board.Sep 24 2018, 11:39 AM

Mentioned in SAL (#wikimedia-operations) [2018-09-24T11:39:36Z] <elukey> reboot an-master100[1,2] as part of the pre-checks before the hadoop master daemons swap - T203635

Mentioned in SAL (#wikimedia-operations) [2018-09-25T08:03:27Z] <elukey> start of the maintenance to swap Hadoop masters from analytics100[1,2] to an-master100[1,2] - T203635

Change 461979 merged by Elukey:
[operations/puppet@production] Swap analytics100[1,2] with an-master100[1,2]

https://gerrit.wikimedia.org/r/461979

Change 462684 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Swap occurrences of analytics1002 with an-master1002

https://gerrit.wikimedia.org/r/462684

Change 462684 merged by Elukey:
[operations/puppet@production] Swap occurrences of analytics1002 with an-master1002

https://gerrit.wikimedia.org/r/462684

Mentioned in SAL (#wikimedia-operations) [2018-09-25T12:01:04Z] <elukey> end of the maintenance to swap Hadoop masters from analytics100[1,2] to an-master100[1,2] - T203635

Change 462720 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set analytics100[1,2] to role spare system

https://gerrit.wikimedia.org/r/462720

Change 462720 merged by Elukey:
[operations/puppet@production] Set analytics100[1,2] to role spare system

https://gerrit.wikimedia.org/r/462720

elukey moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Sep 26 2018, 6:58 AM
elukey set the point value for this task to 13.
elukey moved this task from In Progress to Done on the User-Elukey board.Sep 26 2018, 7:03 AM
Nuria closed this task as Resolved.Sep 26 2018, 7:12 PM

Change 465130 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] druid: replace analytics1003 with an-coord1001

https://gerrit.wikimedia.org/r/465130

Change 465130 merged by Elukey:
[operations/puppet@production] druid: replace analytics1003 with an-coord1001

https://gerrit.wikimedia.org/r/465130

Change 461997 merged by Elukey:
[operations/puppet@production] Replace analytics1003's occurrences with an-coord1001

https://gerrit.wikimedia.org/r/461997

elukey changed the status of subtask T205507: Decommission analytics100[1,2] from Open to Stalled.Jan 7 2019, 3:25 PM
elukey changed the status of subtask T205507: Decommission analytics100[1,2] from Stalled to Open.Jan 7 2019, 3:53 PM