Page MenuHomePhabricator

Hadoop Yarn stores a ton of znodes related to running/old applications
Closed, ResolvedPublic5 Estimated Story Points

Description

Hadoop Yarn uses zookeeper to store data related to its running/old applications:

[zk: localhost:2181(CONNECTED) 3] ls /rmstore/ZKRMStateRoot
[AMRMTokenSecretManagerRoot, RMAppRoot, EpochNode, RMDTSecretManagerRoot, RMVersionNode]
[zk: localhost:2181(CONNECTED) 6] stat /rmstore/ZKRMStateRoot/RMAppRoot
cZxid = 0xc01884a67
ctime = Wed May 06 14:21:45 UTC 2015
mZxid = 0xc01884a67
mtime = Wed May 06 14:21:45 UTC 2015
pZxid = 0x30004a738f
cversion = 8835145
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 10011     <======================= each of these znodes have also children related to app attempts
[zk: localhost:2181(CONNECTED) 12] get /rmstore/ZKRMStateRoot/RMAppRoot/application_1550134620574_26075
�����-�

�����َ-6SELECT
  client_ip,
  COUNT(1) A...client_ip (Stage-1)nice*�
�
jobSubmitDir/job.splitmetainfow
f
hdfsanalytics-hadoop�>"I/user/analytics-search/.staging/job_1550134620574_26075/job.splitmetainfo������- (
[..]
cZxid = 0x300049610f
ctime = Fri Feb 22 12:55:35 UTC 2019
mZxid = 0x3000496121
mtime = Fri Feb 22 12:57:33 UTC 2019
pZxid = 0x3000496110
cversion = 1
dataVersion = 1
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 2597
numChildren = 1

The recent Hadoop test cluster has already a lot of Znodes:

[zk: localhost:2181(CONNECTED) 14] stat /rmstore-analytics-test-hadoop/ZKRMStateRoot/RMAppRoot
cZxid = 0x3000457e04
ctime = Fri Feb 15 09:18:19 UTC 2019
mZxid = 0x3000457e04
mtime = Fri Feb 15 09:18:19 UTC 2019
pZxid = 0x30004a74af
cversion = 2545
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 2545

After reading https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_yarn-resource-management/content/ref-c2ececdf-c68e-4095-99b5-15b4c31701ba.1.html I found that in theory the above state can be stored in HDFS using yarn.resourcemanager.fs.state-store.uri. My idea would be something like:

yarn.resourcemanager.fs.state-store.uri = hdfs://${hadoop-cluster-name}/user/yarn/rmstore/etc..

So we'd have a sort of "chroot" on HDFS and we'd avoid to rely on Zookeeper (storing state in there, with a ton of znodes) . There is already a tight dependency between HDFS and Yarn so I wouldn't be worried to add another one.

To test this we should:

  • change yarn.resourcemanager.store.class to org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore (should be the default, we specifically set zookeeper).
  • set yarn.resourcemanager.fs.state-store.uri

I'd be in favor of testing asap this setting on the Hadoop test cluster (before any kerberos etc.. testing) and possibly apply it to the main one if we are happy with the result.

Event Timeline

elukey triaged this task as Medium priority.Feb 24 2019, 8:13 AM
elukey created this task.

Hm, +1 in general, but doesn't YARN uses ZK to do the auto failover for RM? Does HDFS work just as well for that?

Ah this is not for RM failover, but AppMaster restarts? sounds good.

Yes exactly, this is only to store the AppMaster stuff, all the HA znoes are properly handled from what I can see!

Change 492697 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] hadoop: allow yarn rmstore to be stored on HDFS

https://gerrit.wikimedia.org/r/492697

Change 492697 merged by Elukey:
[operations/puppet/cdh@master] hadoop: allow yarn rmstore to be stored on HDFS

https://gerrit.wikimedia.org/r/492697

Change 492957 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop: store Yarn rmstore state on HDFS

https://gerrit.wikimedia.org/r/492957

Change 492960 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] hadoop: set the default yarn rmstore zk path to undef

https://gerrit.wikimedia.org/r/492960

Change 492960 merged by Elukey:
[operations/puppet/cdh@master] hadoop: set the default yarn rmstore zk path to undef

https://gerrit.wikimedia.org/r/492960

Change 492957 merged by Elukey:
[operations/puppet@production] hadoop: store Yarn rmstore state on HDFS

https://gerrit.wikimedia.org/r/492957

Change 492961 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop: set proper hdfs path for the Yarn rmstore (test cluster)

https://gerrit.wikimedia.org/r/492961

Change 492961 merged by Elukey:
[operations/puppet@production] hadoop: set proper hdfs path for the Yarn rmstore (test cluster)

https://gerrit.wikimedia.org/r/492961

Seems working fine!

elukey@analytics1028:~$ sudo -u hdfs hdfs dfs -ls /user/yarn/rmstore
Found 1 items
drwxr-xr-x   - yarn hadoop          0 2019-02-26 07:32 /user/yarn/rmstore/FSRMStateRoot
elukey@analytics1028:~$ sudo -u hdfs hdfs dfs -ls /user/yarn/rmstore/FSRMStateRoot
Found 5 items
drwxr-xr-x   - yarn hadoop          0 2019-02-26 07:32 /user/yarn/rmstore/FSRMStateRoot/AMRMTokenSecretManagerRoot
-rw-r--r--   3 yarn hadoop          2 2019-02-26 07:32 /user/yarn/rmstore/FSRMStateRoot/EpochNode
drwxr-xr-x   - yarn hadoop          0 2019-02-26 07:39 /user/yarn/rmstore/FSRMStateRoot/RMAppRoot
drwxr-xr-x   - yarn hadoop          0 2019-02-26 07:39 /user/yarn/rmstore/FSRMStateRoot/RMDTSecretManagerRoot
-rw-r--r--   3 yarn hadoop          4 2019-02-26 07:32 /user/yarn/rmstore/FSRMStateRoot/RMVersionNode

After restarting the Yarn RMs I didn't see any interference with the Prod cluster! I have also checked that Yarn apps are correctly executed, everything seems good.

Mentioned in SAL (#wikimedia-operations) [2019-02-26T07:54:47Z] <elukey> removed /rmstore-analytics-test-hadoop from zookeeper main-eqiad - T216952

I had a chat with Joseph about pros and cons of this solution:

  • zookeeper might give more guarantees in split brain scenarios, namely when multiple masters think to be the current active. HDFS can be written without any lock/exclusion, but in theory the HA znodes should prevent this corner case to happen.
  • in case of HDFS overloaded we might end up in slowing down Yarn too, probably reducing its availability. Yarn offers HDFS client retry with backoff by default. This use case is probably not going to be a problem for us but better to keep it in mind.
  • having less zookeeper state is overall a win for availability of other important systems, so it makes sense to proceed.

We'll migrate the Prod cluster to the new scheme on Thursday EU morning if nothing else comes up.

Change 493158 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop analytics: move Yarn rmstore from zookeeper to HDFS

https://gerrit.wikimedia.org/r/493158

Change 493159 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] yarn-site.xml: automatically prepend hdfs:// details to HDFSURIs

https://gerrit.wikimedia.org/r/493159

Change 493159 merged by Elukey:
[operations/puppet/cdh@master] yarn-site.xml: automatically prepend hdfs:// details to HDFSURIs

https://gerrit.wikimedia.org/r/493159

Change 493158 merged by Elukey:
[operations/puppet@production] hadoop test analytics: simplify Yarn rmstore URI

https://gerrit.wikimedia.org/r/493158

Change 493164 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop analytics: move Yarn rmstore from zookeeper to HDFS

https://gerrit.wikimedia.org/r/493164

Change 493164 merged by Elukey:
[operations/puppet@production] hadoop analytics: move Yarn rmstore from zookeeper to HDFS

https://gerrit.wikimedia.org/r/493164

Roll restart of RMs done. Sanity check:

application_1551342556468_0162 is currently running (camus-webrequest):

elukey@an-master1001:~$ hdfs dfs -ls /user/yarn/rmstore/FSRMStateRoot/RMAppRoot/application_1551342556468_0162
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 2 items
-rw-r--r--   3 yarn hadoop        318 2019-02-28 09:10 /user/yarn/rmstore/FSRMStateRoot/RMAppRoot/application_1551342556468_0162/appattempt_1551342556468_0162_000001
-rw-r--r--   3 yarn hadoop       1449 2019-02-28 09:10 /user/yarn/rmstore/FSRMStateRoot/RMAppRoot/application_1551342556468_0162/application_1551342556468_0162


[zk: localhost:2181(CONNECTED) 0] stat /rmstore/ZKRMStateRoot/RMAppRoot/application_1551342556468_0162
Node does not exist: /rmstore/ZKRMStateRoot/RMAppRoot/application_1551342556468_0162

Mentioned in SAL (#wikimedia-operations) [2019-02-28T09:30:42Z] <elukey> start cleanup of 20k+ zookeeper nodes on conf100[4-6] (old Hadoop Yarn state) - T216952

Mentioned in SAL (#wikimedia-operations) [2019-02-28T11:28:33Z] <elukey> pause cleanup of 20k+ zookeeper nodes on conf100[4-6] (old Hadoop Yarn state) - T216952

Mentioned in SAL (#wikimedia-operations) [2019-02-28T13:56:47Z] <elukey> re-start cleanup of 20k+ zookeeper nodes on conf100[4-6] (old Hadoop Yarn state) - T216952

Final znode count: 18.9k (we started from 43.5k)

elukey set the point value for this task to 5.Feb 28 2019, 4:39 PM