On Thu April 26th we got paged due to the analytics1001's namenode (active) process not present. The cause was this one:
```
2018-04-26 17:37:58,103 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 20579 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
2018-04-26 17:37:58,103 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 20557ms
2018-04-26 17:37:58,103 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20579ms to send a batch of 2 edits (423 bytes) to remote journal 10.64.36.128:8485
2018-04-26 17:37:58,103 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20579ms to send a batch of 2 edits (423 bytes) to remote journal 10.64.53.14:8485
2018-04-26 17:37:58,103 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to
[10.64.36.128:8485, 10.64.53.14:8485, 10.64.5.15:8485], stream=QuorumOutputStream starting at txid 1886903961))
```
More logs in:
https://etherpad.wikimedia.org/p/analytics-namenode-down-26042018
The journalnodes metrics don't show any sign of slowness, so it does not seems something related to persisting the edit log on them. There is one log on analytics1001 that is interesting:
```
2018-04-26 17:37:58,103 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 20557ms
```
The JvmPauseMonitor is:
> Class which sets up a simple thread which runs in a loop sleeping for a short interval of time. If the sleep takes significantly longer than its target time, it implies that the JVM or host machine has paused processing, which may cause other problems. If such a pause is detected, the thread logs a message.
Me and @JAllemandou noticed that the GC collection timings increase a lot when some worker nodes go back online after a down period (like for a reimage), and there were 4 of them before the Name Node shutdown (batches of two, not all in one go). What we believe is that the Namenode's GC run took more than the edit log persistence timeout when it was not appropriate, triggering the shutdown.
More generally, it seems that the Namenode's GC timings are really horrible, and one interesting thing is that we don't use any recommendation from Cloudera:
https://www.cloudera.com/documentation/enterprise/5-10-x/topics/admin_nn_memory_config.html
```
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
```
The default GC for young and old gen is Parallel Scavenge, that works by big/long stop-of-the-world GC pauses when it runs, so all the application threads are completely stopped. The Concurrent Mark And Sweep prefers instead small GC stop-of-the-world pauses, doing most of the work concurrently while the application threads are running (more indicated to latency critical applications, like the Namenode).
Last but not the least, the following document suggests that our current heap size (6G) might not be adequate for the number of files that we store in HDFS (~21M):
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_command-line-installation/content/configuring-namenode-heap-size.html
So the plan would be to:
1) use the CDH suggested settings merging https://gerrit.wikimedia.org/r/#/c/429429 (maybe first on the standby NameNode for a couple of days).
2) increase the heap size to 8G.