Page MenuHomePhabricator

Increase and monitor Hadoop NameNode heapsize
Closed, ResolvedPublic

Description

Email from Christian:


Hi Andrew,

just a quick heads up that the name nodes had heap issues last week
[1]. The first two times it occurred, a plain failover and restarting
the borked name node helped.

The third time it happened, failover did not work. Both nodes were
down. Restarting them did not help, as they again failed with the same
error during startup. So they would not come up again.

Reading up a bit on the how and why, it seems that the number of files
in HDFS (not their size) exhausted the name nodes' heap. That would
explain why restarting helped the first two times, and no longer works
the third time.
It would also explain the HDFS issues you saw the last weeks.

Hence, I gave increasing the heap a shot, and that allowed the name
nodes to come back up, and HDFS recovered.

The puppetization [2] of the heap increase is is wrong. I know.
This should be turned into a parameter :-)

I saw that there is already a hadoop_heapsize parameter, but that
would also get picked up by other services. Hence, I went with
"HADOOP_NAMENODE_OPTS" to get the name nodes up again.

Have fun,
Christian

P.S.: The 2GB heap fixes the issue for now. But the number of files
will again increase further, and the name nodes will run into the same
issue again.

In ganglia:
http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Analytics%20cluster%20eqiad&h=analytics1001.eqiad.wmnet&r=month&z=default&jr=&js=&st=1423653252&v=1170.7725&m=Hadoop.NameNode.JvmMetrics.MemHeapUsedM&ti=Hadoop.NameNode.JvmMetrics.MemHeapUsedM&z=large
seems to be a graph that could get used for monitoring/alerting.

[1] Each time, the active name node failed with

Java heap space

error and HDFS was unavailable.
With that, load on stat1002 increased.

[2] https://gerrit.wikimedia.org/r/#/c/189143/

Event Timeline

Ottomata created this task.Feb 11 2015, 2:50 PM
Ottomata claimed this task.
Ottomata raised the priority of this task from to Needs Triage.
Ottomata updated the task description. (Show Details)
Ottomata added subscribers: Ottomata, QChris.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 11 2015, 2:50 PM
Ottomata set Security to None.
kevinator triaged this task as High priority.Feb 12 2015, 2:10 AM
gerritbot added a subscriber: gerritbot.

Change 190471 had a related patch set uploaded (by Ottomata):
Update CDH module and set hadoop namenode heapsize to 4G

https://gerrit.wikimedia.org/r/190471

Patch-For-Review

Change 190471 merged by Ottomata:
Update CDH module and set hadoop namenode heapsize to 4G

https://gerrit.wikimedia.org/r/190471

Ottomata closed this task as Resolved.Feb 13 2015, 4:42 PM