After T216952 the Yarn Resource Manager store (basically where Application Ids and their config/parameters are stored) moved from Zookeeper to HDFS. Due to a network maintenance happened yesterday we performed a failover of the Yarn/HDFS master daemons on an-master100[1,2], and we ended up in an outage for Yarn (namely the master not being available and all the nodemanagers on the workers shutting down as consequence after a timeout).
After a long investigation with @JAllemandou we found some interesting things:
- A complete load of all the 10000 Application Ids stored currently in HDFS takes 7~10 minutes. From the code it is easy to see that each of the 10k directories/files are loaded one at the time, and HDFS is not meant to perform as we'd need in this case. The zookeeper equivalent looks faster, since it seems that in one call you can ask to zookeeper all the znodes.
- The new HDFS based storage leverages this property with a questionable default: yarn.resourcemanager.fs.state-store.retry-policy-spec: 2000,500. A very nice explanation in https://community.hortonworks.com/articles/108734/yarn-resource-manager-retry-policy.html.
The outage that happened followed this chain of events:
- Yarn RM was restarted on an-master1001, followed by HDFS Name Node (current active leaders). The expected result should was both daemons on an-master1002 becoming leaders without impact.
- The HDFS Name Node on an-master1002 became the leader immediately (thanks to zookeeper, that we still rely on to decide who's leader) but the Yarn Resource Manager appeared to be stuck on an-master1002 (it was only taking a huge amount of time but it was not clear to me at the time when I was doing the failover).
- To add more joy, restarting the HDFS Name Node after the Yarn Resource Manager triggered the behavior of yarn.resourcemanager.fs.state-store.retry-policy-spec, namely trying 500 times (with 2s of pause on each try) to contact the HDFS Name Node on an-master1001 (currently doing bootstrap since it was restarted) rather than picking up the new elected master on an-master1002.
- After restarting again the Yarn Resource Manager on an-master1002 the new HDFS Namenode master was picked, and loading the Yarn Resource Manager state store took several minutes.
- At this point a rolling restart of all the Yarn Node Managers on the hadoop workers fixed the issue.
Interesting note: in the Hadoop testing cluster we tried a failover and it took ~30s for Yarn to come up to speed. The number of application ids stored is not 10000 but similar (~8500). We think that the HDFS Namenode on the Prod cluster is more prone to slowdowns than the testing one due to the different load that they have to support, this might explain the difference.
We also discovered that Yarn can be configured to store less than 10000k application ids via the following parameters:
- yarn.resourcemanager.state-store.max-completed-applications: The maximum number of completed applications RM state store keeps, less than or equals to ${yarn.resourcemanager.max-completed-applications}. By default, it equals to ${yarn.resourcemanager.max-completed-applications}. This ensures that the applications kept in the state store are consistent with the applications remembered in RM memory. Any values larger than ${yarn.resourcemanager.max-completed-applications} will be reset to ${yarn.resourcemanager.max-completed-applications}. Note that this value impacts the RM recovery performance.Typically, a smaller value indicates better performance on RM recovery.
- yarn.resourcemanager.max-completed-applications: The maximum number of completed applications RM keeps. Default: 10000
Last but not the least, the zookeeper Yarn storage adds an interesting feature: it adds ACLs to the znodes corresponding to a certain Yarn Resource Manager to be sure that only one master at the time gets the privilege to write to zookeeper. In theory in our case this shouldn't be needed since we rely on zookeeper for the Yarn failover/active-leader-election, but it is surely another layer of defense against split brain scenarios.
To wrap up, this task should focus on answering the following:
- Do we need to keep a history of 10k application ids in our Yarn state?
- Do we want to move back to the Zookeeper based storage (given what said above and maybe with less znodes stored) ?