Email from Christian:
---
Hi Andrew,
just a quick heads up that the name nodes had heap issues last week
[1]. The first two times it occurred, a plain failover and restarting
the borked name node helped.
The third time it happened, failover did not work. Both nodes were
down. Restarting them did not help, as they again failed with the same
error during startup. So they would not come up again.
Reading up a bit on the how and why, it seems that the number of files
in HDFS (not their size) exhausted the name nodes' heap. That would
explain why restarting helped the first two times, and no longer works
the third time.
It would also explain the HDFS issues you saw the last weeks.
Hence, I gave increasing the heap a shot, and that allowed the name
nodes to come back up, and HDFS recovered.
The puppetization [2] of the heap increase is is wrong. I know.
This should be turned into a parameter :-)
I saw that there is already a hadoop_heapsize parameter, but that
would also get picked up by other services. Hence, I went with
"HADOOP_NAMENODE_OPTS" to get the name nodes up again.
Have fun,
Christian
P.S.: The 2GB heap fixes the issue for now. But the number of files
will again increase further, and the name nodes will run into the same
issue again.
In ganglia:
http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Analytics%20cluster%20eqiad&h=analytics1001.eqiad.wmnet&r=month&z=default&jr=&js=&st=1423653252&v=1170.7725&m=Hadoop.NameNode.JvmMetrics.MemHeapUsedM&ti=Hadoop.NameNode.JvmMetrics.MemHeapUsedM&z=large
seems to be a graph that could get used for monitoring/alerting.
[1] Each time, the active name node failed with
Java heap space
error and HDFS was unavailable.
With that, load on stat1002 increased.
[2] https://gerrit.wikimedia.org/r/#/c/189143/