Change Details

Email from Christian: --- Hi Andrew, just a quick heads up that the name nodes had heap issues last week [1]. The first two times it occurred, a plain failover and restarting the borked name node helped. The third time it happened, failover did not work. Both nodes were down. Restarting them did not help, as they again failed with the same error during startup. So they would not come up again. Reading up a bit on the how and why, it seems that the number of files in HDFS (not their size) exhausted the name nodes' heap. That would explain why restarting helped the first two times, and no longer works the third time. It would also explain the HDFS issues you saw the last weeks. Hence, I gave increasing the heap a shot, and that allowed the name nodes to come back up, and HDFS recovered. The puppetization [2] of the heap increase is is wrong. I know. This should be turned into a parameter :-) I saw that there is already a hadoop_heapsize parameter, but that would also get picked up by other services. Hence, I went with "HADOOP_NAMENODE_OPTS" to get the name nodes up again. Have fun, Christian P.S.: The 2GB heap fixes the issue for now. But the number of files will again increase further, and the name nodes will run into the same issue again. In ganglia: http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Analytics%20cluster%20eqiad&h=analytics1001.eqiad.wmnet&r=month&z=default&jr=&js=&st=1423653252&v=1170.7725&m=Hadoop.NameNode.JvmMetrics.MemHeapUsedM&ti=Hadoop.NameNode.JvmMetrics.MemHeapUsedM&z=large seems to be a graph that could get used for monitoring/alerting. [1] Each time, the active name node failed with Java heap space error and HDFS was unavailable. With that, load on stat1002 increased. [2] https://gerrit.wikimedia.org/r/#/c/189143/