Page MenuHomePhabricator

Replace the Analytics HDFS/Yarn masters (hardware refresh)
Closed, ResolvedPublic13 Estimated Story Points

Description

The HDFS and Yarn masters (analytics100[12]) need to be replaced as part of the usual workflow for hardware refresh. We have already ordered the hardware, that is currently being racked in T201939.

As for all the delicate/risky Hadoop procedures, there is not a lot of documentation from upstream about the safest way to do things, except some occasional brave user that reports his/her story:

https://stackoverflow.com/questions/40216709/moving-hadoop-master-node-in-another-box-how-to-handle-hdfs?rq=1

After a chat with Andrew and Joseph we reached the same conclusion as the user of the above thread, namely that it is way safer and less error prone to shutdown completely the cluster to do this maintenance.

The (high level) idea is the following:

  • Stop all the regular Analytics processing jobs, and alert people in advance about the maintenance.
  • Stop Hive, Hue, etc..
  • Enter HDFS Safe Mode (only reads, no writes allowed)
  • Shutdown the cluster
  • Replace the master node domains in puppet and make sure that the file is updated everywhere
  • Copy over to the new master the HDFS state from the "current" masters (a rsync is sufficient).
  • Start the cluster.

This of course needs to be tested in labs, but the procedure seems sound from a quick review.

Useful doc to reference: https://etherpad.wikimedia.org/p/analytics-hadoop-java8 (last Java upgrade, that involved shutting down the cluster).

The selected date for the switch in production (if testing supports us) should be Sept 25th (to be announced/scheduled). We think it will be only a matter of shutting down the Hadoop cluster for a couple of hours.

Procedure to follow: https://etherpad.wikimedia.org/p/analytics-swap-masters (still WIP)

Event Timeline

elukey triaged this task as Medium priority.Sep 6 2018, 7:17 AM
elukey created this task.

So I tested the procedure in labs, that worked fine with the only caveat that hadoop-hdfs-zkfc seems to store the names of the master nodes in zookeeper, under /hadoop-ha/analytics-hadoop-labs/ActiveBreadCrumb (of course the name of the cluster varies depending in which one you are in), so I had to rmr it via zkCli.sh to make a new leader election work with the new hosts.

Change 461979 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Swap analytics100[1,2] with an-master100[1,2]$

https://gerrit.wikimedia.org/r/461979

Change 461988 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Guard analytics cron jobs to ease the hw refresh of an1003

https://gerrit.wikimedia.org/r/461988

Change 461997 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Replace analytics1003's occurrences with an-coord1001

https://gerrit.wikimedia.org/r/461997

Change 461988 merged by Elukey:
[operations/puppet@production] Guard analytics cron jobs to ease the hw refresh of an1003

https://gerrit.wikimedia.org/r/461988

Mentioned in SAL (#wikimedia-operations) [2018-09-24T11:39:36Z] <elukey> reboot an-master100[1,2] as part of the pre-checks before the hadoop master daemons swap - T203635

Mentioned in SAL (#wikimedia-operations) [2018-09-25T08:03:27Z] <elukey> start of the maintenance to swap Hadoop masters from analytics100[1,2] to an-master100[1,2] - T203635

Change 461979 merged by Elukey:
[operations/puppet@production] Swap analytics100[1,2] with an-master100[1,2]

https://gerrit.wikimedia.org/r/461979

Change 462684 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Swap occurrences of analytics1002 with an-master1002

https://gerrit.wikimedia.org/r/462684

Change 462684 merged by Elukey:
[operations/puppet@production] Swap occurrences of analytics1002 with an-master1002

https://gerrit.wikimedia.org/r/462684

Mentioned in SAL (#wikimedia-operations) [2018-09-25T12:01:04Z] <elukey> end of the maintenance to swap Hadoop masters from analytics100[1,2] to an-master100[1,2] - T203635

Change 462720 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set analytics100[1,2] to role spare system

https://gerrit.wikimedia.org/r/462720

Change 462720 merged by Elukey:
[operations/puppet@production] Set analytics100[1,2] to role spare system

https://gerrit.wikimedia.org/r/462720

elukey set the point value for this task to 13.

Change 465130 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] druid: replace analytics1003 with an-coord1001

https://gerrit.wikimedia.org/r/465130

Change 465130 merged by Elukey:
[operations/puppet@production] druid: replace analytics1003 with an-coord1001

https://gerrit.wikimedia.org/r/465130

Change 461997 merged by Elukey:
[operations/puppet@production] Replace analytics1003's occurrences with an-coord1001

https://gerrit.wikimedia.org/r/461997