Page MenuHomePhabricator

Hadoop HDFS Namenode shutdown on 26/04/2018
Closed, ResolvedPublic

Description

On Thu April 26th we got paged due to the analytics1001's namenode (active) process not present. The cause was this one:

2018-04-26 17:37:58,103 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 20579 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
2018-04-26 17:37:58,103 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 20557ms
2018-04-26 17:37:58,103 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20579ms to send a batch of 2 edits (423 bytes) to remote journal 10.64.36.128:8485
2018-04-26 17:37:58,103 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20579ms to send a batch of 2 edits (423 bytes) to remote journal 10.64.53.14:8485
2018-04-26 17:37:58,103 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to
                [10.64.36.128:8485, 10.64.53.14:8485, 10.64.5.15:8485], stream=QuorumOutputStream starting at txid 1886903961))

More logs in:
https://etherpad.wikimedia.org/p/analytics-namenode-down-26042018

The journalnodes metrics don't show any sign of slowness, so it does not seems something related to persisting the edit log on them. There is one log on analytics1001 that is interesting:

2018-04-26 17:37:58,103 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 20557ms

The JvmPauseMonitor is:

Class which sets up a simple thread which runs in a loop sleeping for a short interval of time. If the sleep takes significantly longer than its target time, it implies that the JVM or host machine has paused processing, which may cause other problems. If such a pause is detected, the thread logs a message.

Me and @JAllemandou noticed that the GC collection timings increase a lot when some worker nodes go back online after a down period (like for a reimage), and there were 4 of them before the Name Node shutdown (batches of two, not all in one go). What we believe is that the Namenode's GC run took more than the edit log persistence timeout when it was not appropriate, triggering the shutdown.

More generally, it seems that the Namenode's GC timings are really horrible, and one interesting thing is that we don't use any recommendation from Cloudera:

https://www.cloudera.com/documentation/enterprise/5-10-x/topics/admin_nn_memory_config.html

-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled

The default GC for young and old gen is Parallel Scavenge, that works by big/long stop-of-the-world GC pauses when it runs, so all the application threads are completely stopped. The Concurrent Mark And Sweep prefers instead small GC stop-of-the-world pauses, doing most of the work concurrently while the application threads are running (more indicated to latency critical applications, like the Namenode).

Last but not the least, the following document suggests that our current heap size (6G) might not be adequate for the number of files that we store in HDFS (~21M):

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_command-line-installation/content/configuring-namenode-heap-size.html

So the plan would be to:

  1. use the CDH suggested settings merging https://gerrit.wikimedia.org/r/#/c/429429 (maybe first on the standby NameNode for a couple of days).
  1. increase the heap size to 8G.

Event Timeline

elukey triaged this task as Medium priority.Apr 27 2018, 5:01 PM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 27 2018, 5:01 PM

Change 429429 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::hadoop::master: change the namenode's GC settings

https://gerrit.wikimedia.org/r/429429

elukey updated the task description. (Show Details)Apr 27 2018, 5:04 PM
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

+1 to all, thanks Luca!

Change 429429 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::hadoop::master: change the namenode's GC settings

https://gerrit.wikimedia.org/r/429429

Mentioned in SAL (#wikimedia-operations) [2018-04-30T07:31:11Z] <elukey> restart HDFS namenode on analytics1002 (standby master) to pick up new JVM settings - T193257

Mentioned in SAL (#wikimedia-operations) [2018-04-30T08:16:04Z] <elukey> force a manual failover of the HDFS Namenode from analytics1001 to analytics1002 to test new GC Settings - T193257

Mentioned in SAL (#wikimedia-operations) [2018-04-30T08:38:15Z] <elukey> restart HDFS namenode on analytics1001 (standby master) to pick up new JVM settings - T193257

Change 429756 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::hadoop::master|standby: set NN heap to 8G

https://gerrit.wikimedia.org/r/429756

Change 429756 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::hadoop::master|standby: set NN heap to 8G

https://gerrit.wikimedia.org/r/429756

CMS settings alone were not sufficient to make everything work, when we restarted the namenode daemon on analytics1001 (that was the standby at the time) it ended up in a ton of GC runs for the old gen, clearly not something good. After a chat with Joseph we decided that the culprit was probably the -XX:CMSInitiatingOccupancyFraction=70 setting, that triggers a old gen GC run if the area is 70% occupied. We increased the Heap size to 8G and everything went to a state that was even better than the original one, namely very few and short oldgen runs and frequent small young gen ones.

All the metrics have been stable for more than two hours, these settings seems really great. The namenode needs more space when the overall number of files crosses a certain threshold, so let's keep it in mind for the future.

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Apr 30 2018, 1:03 PM
Nuria closed this task as Resolved.May 8 2018, 10:44 PM
Vvjjkkii renamed this task from Hadoop HDFS Namenode shutdown on 26/04/2018 to 22daaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed elukey as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
elukey renamed this task from 22daaaaaaa to Hadoop HDFS Namenode shutdown on 26/04/2018.Jul 2 2018, 6:30 AM
elukey closed this task as Resolved.
elukey claimed this task.
elukey lowered the priority of this task from High to Medium.
elukey updated the task description. (Show Details)