Page MenuHomePhabricator

Improve speed and reliability of Yarn's Resource Manager failover
Closed, ResolvedPublic5 Story Points

Description

After T216952 the Yarn Resource Manager store (basically where Application Ids and their config/parameters are stored) moved from Zookeeper to HDFS. Due to a network maintenance happened yesterday we performed a failover of the Yarn/HDFS master daemons on an-master100[1,2], and we ended up in an outage for Yarn (namely the master not being available and all the nodemanagers on the workers shutting down as consequence after a timeout).

After a long investigation with @JAllemandou we found some interesting things:

  • A complete load of all the 10000 Application Ids stored currently in HDFS takes 7~10 minutes. From the code it is easy to see that each of the 10k directories/files are loaded one at the time, and HDFS is not meant to perform as we'd need in this case. The zookeeper equivalent looks faster, since it seems that in one call you can ask to zookeeper all the znodes.
  • The new HDFS based storage leverages this property with a questionable default: yarn.resourcemanager.fs.state-store.retry-policy-spec: 2000,500. A very nice explanation in https://community.hortonworks.com/articles/108734/yarn-resource-manager-retry-policy.html.

The outage that happened followed this chain of events:

  • Yarn RM was restarted on an-master1001, followed by HDFS Name Node (current active leaders). The expected result should was both daemons on an-master1002 becoming leaders without impact.
  • The HDFS Name Node on an-master1002 became the leader immediately (thanks to zookeeper, that we still rely on to decide who's leader) but the Yarn Resource Manager appeared to be stuck on an-master1002 (it was only taking a huge amount of time but it was not clear to me at the time when I was doing the failover).
  • To add more joy, restarting the HDFS Name Node after the Yarn Resource Manager triggered the behavior of yarn.resourcemanager.fs.state-store.retry-policy-spec, namely trying 500 times (with 2s of pause on each try) to contact the HDFS Name Node on an-master1001 (currently doing bootstrap since it was restarted) rather than picking up the new elected master on an-master1002.
  • After restarting again the Yarn Resource Manager on an-master1002 the new HDFS Namenode master was picked, and loading the Yarn Resource Manager state store took several minutes.
  • At this point a rolling restart of all the Yarn Node Managers on the hadoop workers fixed the issue.

Interesting note: in the Hadoop testing cluster we tried a failover and it took ~30s for Yarn to come up to speed. The number of application ids stored is not 10000 but similar (~8500). We think that the HDFS Namenode on the Prod cluster is more prone to slowdowns than the testing one due to the different load that they have to support, this might explain the difference.

We also discovered that Yarn can be configured to store less than 10000k application ids via the following parameters:

  • yarn.resourcemanager.state-store.max-completed-applications: The maximum number of completed applications RM state store keeps, less than or equals to ${yarn.resourcemanager.max-completed-applications}. By default, it equals to ${yarn.resourcemanager.max-completed-applications}. This ensures that the applications kept in the state store are consistent with the applications remembered in RM memory. Any values larger than ${yarn.resourcemanager.max-completed-applications} will be reset to ${yarn.resourcemanager.max-completed-applications}. Note that this value impacts the RM recovery performance.Typically, a smaller value indicates better performance on RM recovery.
  • yarn.resourcemanager.max-completed-applications: The maximum number of completed applications RM keeps. Default: 10000

Last but not the least, the zookeeper Yarn storage adds an interesting feature: it adds ACLs to the znodes corresponding to a certain Yarn Resource Manager to be sure that only one master at the time gets the privilege to write to zookeeper. In theory in our case this shouldn't be needed since we rely on zookeeper for the Yarn failover/active-leader-election, but it is surely another layer of defense against split brain scenarios.

To wrap up, this task should focus on answering the following:

  • Do we need to keep a history of 10k application ids in our Yarn state?
  • Do we want to move back to the Zookeeper based storage (given what said above and maybe with less znodes stored) ?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2019, 8:03 AM
elukey triaged this task as High priority.Mar 20 2019, 8:04 AM
elukey updated the task description. (Show Details)
elukey updated the task description. (Show Details)
elukey updated the task description. (Show Details)
elukey added a comment.EditedMar 20 2019, 8:16 AM

Quick check on the current HDFS based Yarn app ids store, sorting by creation time of the directories related to app ids:

elukey@an-master1001:~$ hdfs dfs -ls /user/yarn/rmstore/FSRMStateRoot/RMAppRoot | awk '{print $6}' | sort | uniq -c

   2124 2019-03-17
   3036 2019-03-18
   2895 2019-03-19
   1958 2019-03-20

So we have a history of 4 days, more or less, at any given time. Not sure if this also translates in what logs are retained and for how long (edit: seems to be 3h for yarn.nodemanager.log.retain-seconds's default).

Change 497702 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] yarn: allow the configuration of maximum app ids retained in HDFS/Zookeeper

https://gerrit.wikimedia.org/r/497702

Do we need to keep a history of 10k application ids in our Yarn state?

No I don't think so! 10K (4ish days) is nice to have in memory I suppose, but I see no reason to bog down start up times with 4 days of completed app ids. 1 or 2 days would be plenty.

Do we want to move back to the Zookeeper based storage

Maybe ya! Let's buy an analytics ZK cluster next FY and use it for Hadoop and Druid clusters. Actually...we might have some funds this FY to do it.

BTW great finds yall! Thanks so much for figure this out and the great write up!

So we have a history of 4 days, more or less, at any given time. Not sure if this also translates in what logs are retained and for how long (edit: seems to be 3h for yarn.nodemanager.log.retain-seconds's default).

@elukey : The default definition says: Only applicable if log aggregation is disabled :)

We keep 3 month of logs:

sudo -u hdfs hdfs dfs -ls /var/log/hadoop-yarn/apps/*/logs | awk '{print $6}' | sort | uniq -c
JobsDay
13952018-12-19
39792018-12-20
27942018-12-21
27342018-12-22
27332018-12-23
27402018-12-24
27592018-12-25
27052018-12-26
36032018-12-27
27412018-12-28
28472018-12-29
27182018-12-30
27962018-12-31
36502019-01-01
38342019-01-02
29282019-01-03
27412019-01-04
61742019-01-05
59522019-01-06
37132019-01-07
33272019-01-08
29212019-01-09
29292019-01-10
27802019-01-11
28662019-01-12
27512019-01-13
42792019-01-14
46262019-01-15
29012019-01-16
29292019-01-17
30502019-01-18
29482019-01-19
40332019-01-20
30632019-01-21
32052019-01-22
28942019-01-23
28682019-01-24
28552019-01-25
28142019-01-26
36032019-01-27
28052019-01-28
30012019-01-29
28312019-01-30
28042019-01-31
37522019-02-01
35362019-02-02
27802019-02-03
27142019-02-04
52542019-02-05
77242019-02-06
69842019-02-07
29702019-02-08
26692019-02-09
28902019-02-10
29652019-02-11
28932019-02-12
27322019-02-13
38322019-02-14
44472019-02-15
28162019-02-16
28112019-02-17
28302019-02-18
28762019-02-19
37222019-02-20
30162019-02-21
29092019-02-22
28072019-02-23
27912019-02-24
29282019-02-25
29452019-02-26
36582019-02-27
37322019-02-28
39852019-03-01
35562019-03-02
28112019-03-03
31372019-03-04
35422019-03-05
95732019-03-06
62532019-03-07
100812019-03-08
51692019-03-09
27892019-03-10
130622019-03-11
131662019-03-12
43702019-03-13
76692019-03-14
120542019-03-15
115952019-03-16
27972019-03-17
30382019-03-18
29012019-03-19
25302019-03-20

And there are days with more jobs than others, with noticeable difference when we sqoop (1 job per table per wiki - A LOT of them)

So we have a history of 4 days, more or less, at any given time. Not sure if this also translates in what logs are retained and for how long (edit: seems to be 3h for yarn.nodemanager.log.retain-seconds's default).

@elukey : The default definition says: Only applicable if log aggregation is disabled :)

We keep 3 month of logs:

TIL, thanks! If keep those logs I don't really see the purpose of keeping around 10k app ids metadata, this is the part that I am still missing. Why the default is 10k? What is the advantage?

Change 497702 merged by Elukey:
[operations/puppet/cdh@master] yarn: allow the configuration of maximum app ids retained in HDFS/Zookeeper

https://gerrit.wikimedia.org/r/497702

Change 498044 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::common: add new Yarn RM properties

https://gerrit.wikimedia.org/r/498044

Change 498044 merged by Elukey:
[operations/puppet@production] profile::hadoop::common: add new Yarn RM properties

https://gerrit.wikimedia.org/r/498044

elukey@an-master1001:~$ hdfs dfs -ls /user/yarn/rmstore/FSRMStateRoot/RMAppRoot | wc -l
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
5013

Recovery time during failover is now down to 6/7 minutes, with a decent retry policy in case HDFS goes down at the same time. This seems not preventing some node manager to shutdown since it wasn't able to contact the RM for too long though, so the long term fix is to move back to zookeeper.

Change 499715 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Switch the Yarn rmstore config back to zookeeper for Analytics Hadoop test

https://gerrit.wikimedia.org/r/499715

Change 499716 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Switch the Yarn rmstore config back to zookeeper for Analytics Hadoop

https://gerrit.wikimedia.org/r/499716

Change 499715 merged by Elukey:
[operations/puppet@production] Switch the Yarn rmstore config back to zookeeper for Analytics Hadoop test

https://gerrit.wikimedia.org/r/499715

Mentioned in SAL (#wikimedia-operations) [2019-03-28T08:33:46Z] <elukey> move hadoop yarn configuration from hdfs back to zookeeper - T218758

Change 499716 merged by Elukey:
[operations/puppet@production] Switch the Yarn rmstore config back to zookeeper for Analytics Hadoop

https://gerrit.wikimedia.org/r/499716

elukey set the point value for this task to 8.Mar 28 2019, 9:14 AM
elukey changed the point value for this task from 8 to 5.

Decided to switch back to zookeeper as precautionary measure, I wasn't comfortable in using the hdfs rmstore anymore. The next step will be to move all the zookeeper configuration to a new cluster when we'll have hosts, but it is not in the scope of this task :)

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Mar 28 2019, 9:16 AM
Nuria closed this task as Resolved.Apr 4 2019, 8:16 PM