The last outage of HDFS was related to a Spark job creating in a short time frame (less than a couple of hours) 20M+ temporary files, that caused a slowdown of the HDFS Namenode ending up in a real outage.
|operations/puppet/cdh : master||Improve HDFS accounting|
|operations/puppet : production||profile::hadoop::master: add alert for HDFS NN RCP queue length|
|operations/puppet/cdh : master||Fix default log4j logging for Hadoop Namenode|
|operations/puppet : production||Update cdh module to its latest version|
|operations/puppet/cdh : master||Add the dfs.namenode.handler.count HDFS option|
|operations/puppet : production||hadoop: raise dfs.namenode.handler.count from 10 to 80|
|operations/puppet/cdh : master||Make dfs_namenode_handler_count optional|
Found a good series of performance tuning for the Namenode's RPC handling:
Rationale of the change listed above. During the execution of the Spark job the following happened:
Usually the RPC queue (even before the change) is zero or a value less than 10 for few seconds, meanwhile in the above graph it is way below (and steady). The 10 default workers are not enough with a 54 nodes cluster, so we bumped it to 115 (following the suggestions of the docs). This should give more resiliency to the NameNode. We should also alarm on the length queue of the RPC calls, since a value above zero for more than few seconds indicates something wrong happening.
Up to now I have added:
- proper HDFS auditing logs
- more capacity in handling the RPC call queue
- an alarm on RPC call queue length
The next step is to figure out if there is a limit for users in creating RPC calls (so not only creating files), possibly throttling them?
We didn't find a way to properly limit users (more specifically, their HDFS usage) but we added logging and monitoring to quickly diagnose and find problems. A solution could be to test new Yarn scheduler's policies, but it is a way bigger project that we currently don't have the bandwidth to do in my opinion. Let's keep it in mind for the future of course! Last but not the least, we have had sporadic incidents like the one described in this task, that causes damages since we weren't properly alarmed until the very last moment. The new alarm has already been triggered and we were able to fix the problem after few minutes (not hours), that seems a big win to me.