Page MenuHomePhabricator

Check if HDFS offers a way to prevent/limit/throttle users to overwhelm the HDFS Namenode
Closed, ResolvedPublic8 Story Points

Description

The last outage of HDFS was related to a Spark job creating in a short time frame (less than a couple of hours) 20M+ temporary files, that caused a slowdown of the HDFS Namenode ending up in a real outage.

Event Timeline

elukey created this task.Apr 11 2019, 2:17 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 11 2019, 2:17 PM
fdans triaged this task as Normal priority.Apr 11 2019, 4:18 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

Change 507250 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] Add the dfs.namenode.handler.count HDFS option

https://gerrit.wikimedia.org/r/507250

Change 507250 merged by Elukey:
[operations/puppet/cdh@master] Add the dfs.namenode.handler.count HDFS option

https://gerrit.wikimedia.org/r/507250

Change 507257 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] Make dfs_namenode_handler_count optional

https://gerrit.wikimedia.org/r/507257

Change 507257 merged by Elukey:
[operations/puppet/cdh@master] Make dfs_namenode_handler_count optional

https://gerrit.wikimedia.org/r/507257

Change 507259 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop: raise dfs.namenode.handler.count from 10 to 80

https://gerrit.wikimedia.org/r/507259

Change 507259 merged by Elukey:
[operations/puppet@production] hadoop: raise dfs.namenode.handler.count from 10 to 80

https://gerrit.wikimedia.org/r/507259

Mentioned in SAL (#wikimedia-operations) [2019-04-30T09:02:50Z] <elukey> roll restart hdfs namenodes on an-master100[1,2] to pick up new settings - T220702

Rationale of the change listed above. During the execution of the Spark job the following happened:


Usually the RPC queue (even before the change) is zero or a value less than 10 for few seconds, meanwhile in the above graph it is way below (and steady). The 10 default workers are not enough with a 54 nodes cluster, so we bumped it to 115 (following the suggestions of the docs). This should give more resiliency to the NameNode. We should also alarm on the length queue of the RPC calls, since a value above zero for more than few seconds indicates something wrong happening.

Change 507333 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Update cdh module to its latest version

https://gerrit.wikimedia.org/r/507333

Change 507333 merged by Elukey:
[operations/puppet@production] Update cdh module to its latest version

https://gerrit.wikimedia.org/r/507333

Mentioned in SAL (#wikimedia-operations) [2019-04-30T15:45:36Z] <elukey> restart hadoop hdfs namenodes on an-master100[1,2] to pick up new logging settings - T220702

Change 507630 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] Fix default log4j logging for Hadoop Namenode

https://gerrit.wikimedia.org/r/507630

Change 507630 merged by Elukey:
[operations/puppet/cdh@master] Fix default log4j logging for Hadoop Namenode

https://gerrit.wikimedia.org/r/507630

Change 507754 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::master: add alert for HDFS NN RCP queue length

https://gerrit.wikimedia.org/r/507754

Change 507754 merged by Elukey:
[operations/puppet@production] profile::hadoop::master: add alert for HDFS NN RCP queue length

https://gerrit.wikimedia.org/r/507754

Up to now I have added:

  • proper HDFS auditing logs
  • more capacity in handling the RPC call queue
  • an alarm on RPC call queue length

The next step is to figure out if there is a limit for users in creating RPC calls (so not only creating files), possibly throttling them?

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.
Milimetric assigned this task to elukey.May 2 2019, 1:29 PM
elukey moved this task from Backlog to In Progress on the User-Elukey board.May 9 2019, 7:35 AM

Change 508989 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet/cdh@master] Improve HDFS accounting

https://gerrit.wikimedia.org/r/508989

Change 508989 merged by Elukey:
[operations/puppet/cdh@master] Improve HDFS accounting

https://gerrit.wikimedia.org/r/508989

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Jun 3 2019, 6:17 AM
elukey moved this task from Done to Paused on the Analytics-Kanban board.Jun 3 2019, 2:59 PM
elukey set the point value for this task to 8.Jun 27 2019, 9:02 AM

We didn't find a way to properly limit users (more specifically, their HDFS usage) but we added logging and monitoring to quickly diagnose and find problems. A solution could be to test new Yarn scheduler's policies, but it is a way bigger project that we currently don't have the bandwidth to do in my opinion. Let's keep it in mind for the future of course! Last but not the least, we have had sporadic incidents like the one described in this task, that causes damages since we weren't properly alarmed until the very last moment. The new alarm has already been triggered and we were able to fix the problem after few minutes (not hours), that seems a big win to me.

elukey moved this task from Paused to Done on the Analytics-Kanban board.Jun 27 2019, 9:05 AM
elukey moved this task from In Progress to Done on the User-Elukey board.Jul 4 2019, 2:50 PM
Nuria closed this task as Resolved.Jul 10 2019, 4:01 PM