While trying to run hdfs dfsadmin saveNamespace before the reimage of an-master1002, we got the following chain of events two times in a row (so definitely not a one-off sporadic event):
- saveNamespace starting on both Namenodes, succeeding on 1002 but taking longer time on 1001 (the active master).
HDFS Namenode on 1001 2021-05-25 15:52:06,199 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Save namespace ... 2021-05-25 15:52:06,199 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 6402054524, 6402056546 2021-05-25 15:52:06,269 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2024 Total time for tran sactions(ms): 51 Number of transactions batched in Syncs: 612 Number of syncs: 1412 SyncTimes(ms): 1802 502 2021-05-25 15:52:06,281 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /srv/hadoop/name/ current/edits_inprogress_0000000006402054524 -> /srv/hadoop/name/current/edits_0000000006402054524-0000000006402056547 2021-05-25 15:52:06,311 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Saving image file /srv/hadoop/name/c urrent/fsimage.ckpt_0000000006402056547 using no compression
- Too much time passed (15:52:06 -> 15:52:55), the ZKFC hdfs daemon declared 1002 the active namenode and forced a thread dump. From the logs it seems that 1001 was able to make progress with the fsimage, but too much time passed and while doing it it seemed not responding to ZKFC's health checks.
HDFS ZKFC on 1001 2021-05-25 15:52:54,857 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:36597 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040] Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8040 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:36597 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout 2021-05-25 15:52:54,858 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING 2021-05-25 15:52:54,858 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 entered state: SERVICE_NOT_RESPONDING HDFS Namenode on 1001 2021-05-25 15:52:55,736 INFO org.apache.hadoop.http.HttpServer2: Process Thread Dump: jsp requested 248 active threads [..long list of threads..] 2021-05-25 15:53:47,948 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Image file /srv/hadoop/name/current/fsimage.ckpt_0000000006402056547 of size 5438946925 bytes saved in 101 seconds .
- Critical error with the Journal node cluster. The epoch number is advanced every time a change in the Namenode happens (by the ZKFC daemon) to prevent both Namenodes to try to write to the edit log at the same time. This seems what it happened later on in the logs:
HDFS Namenode on 1001 2021-05-25 15:53:48,833 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 6402056548 failed for required journal (JournalAndStream(mgr=QJM to [10.64.5.27:8485, 10.64.5.29:8485, 10.64.36.113:8485, 10.64.53.29:8485, 10.64.21.116:8485], stream=null)) org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 3/5. 4 exceptions thrown: 10.64.5.27:8485: IPC's epoch 229 is less than the last promised epoch 230 ; journal id: analytics-hadoop
- At this point the Namenode on 1001 shutsdown for safety.
Why the Namenode takes so long on 1001 to create the FSImage is still not clear, maybe it is due to the current heap size and/or GC settings, or maybe something else.