Page MenuHomePhabricator

hdfs dfsadmin saveNamespace fails on an-master1001
Closed, ResolvedPublic

Description

While trying to run hdfs dfsadmin saveNamespace before the reimage of an-master1002, we got the following chain of events two times in a row (so definitely not a one-off sporadic event):

  • saveNamespace starting on both Namenodes, succeeding on 1002 but taking longer time on 1001 (the active master).
HDFS Namenode on 1001

2021-05-25 15:52:06,199 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Save namespace ...
2021-05-25 15:52:06,199 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 6402054524, 6402056546
2021-05-25 15:52:06,269 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2024 Total time for tran
sactions(ms): 51 Number of transactions batched in Syncs: 612 Number of syncs: 1412 SyncTimes(ms): 1802 502 
2021-05-25 15:52:06,281 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /srv/hadoop/name/
current/edits_inprogress_0000000006402054524 -> /srv/hadoop/name/current/edits_0000000006402054524-0000000006402056547
2021-05-25 15:52:06,311 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Saving image file /srv/hadoop/name/c
urrent/fsimage.ckpt_0000000006402056547 using no compression
  • Too much time passed (15:52:06 -> 15:52:55), the ZKFC hdfs daemon declared 1002 the active namenode and forced a thread dump. From the logs it seems that 1001 was able to make progress with the fsimage, but too much time passed and while doing it it seemed not responding to ZKFC's health checks.
HDFS ZKFC on 1001

2021-05-25 15:52:54,857 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:36597 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040] Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8040 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:36597 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
2021-05-25 15:52:54,858 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING
2021-05-25 15:52:54,858 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 entered state: SERVICE_NOT_RESPONDING

HDFS Namenode on 1001

2021-05-25 15:52:55,736 INFO org.apache.hadoop.http.HttpServer2: Process Thread Dump: jsp requested
248 active threads
[..long list of threads..]
2021-05-25 15:53:47,948 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Image file /srv/hadoop/name/current/fsimage.ckpt_0000000006402056547 of size 5438946925 bytes saved in 101 seconds .
  • Critical error with the Journal node cluster. The epoch number is advanced every time a change in the Namenode happens (by the ZKFC daemon) to prevent both Namenodes to try to write to the edit log at the same time. This seems what it happened later on in the logs:
HDFS Namenode on 1001

2021-05-25 15:53:48,833 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 6402056548 failed for required journal (JournalAndStream(mgr=QJM to [10.64.5.27:8485, 10.64.5.29:8485, 10.64.36.113:8485, 10.64.53.29:8485, 10.64.21.116:8485], stream=null))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 3/5. 4 exceptions thrown:
10.64.5.27:8485: IPC's epoch 229 is less than the last promised epoch 230 ; journal id: analytics-hadoop
  • At this point the Namenode on 1001 shutsdown for safety.

Why the Namenode takes so long on 1001 to create the FSImage is still not clear, maybe it is due to the current heap size and/or GC settings, or maybe something else.

Event Timeline

elukey triaged this task as High priority.May 26 2021, 4:00 PM
elukey created this task.

One thing that it is not clear to me is why the issue presents itself only with saveNamespace, meanwhile when the standby Namenode executes its timer to fetchImage from 1001 everything works fine. For example:

elukey@an-master1002:~$ sudo ls -lht /srv/backup/hadoop/namenode
total 108G
-rw-r--r-- 1 hdfs hdfs 5.1G May 27 00:06 fsimage_0000000006407481422
-rw-r--r-- 1 hdfs hdfs 5.1G May 26 00:06 fsimage_0000000006403598226
-rw-r--r-- 1 hdfs hdfs 5.1G May 25 00:06 fsimage_0000000006400004154
-rw-r--r-- 1 hdfs hdfs 5.1G May 24 00:06 fsimage_0000000006396754323
-rw-r--r-- 1 hdfs hdfs 5.1G May 23 00:06 fsimage_0000000006394202692
-rw-r--r-- 1 hdfs hdfs 5.1G May 22 00:07 fsimage_0000000006391737248
-rw-r--r-- 1 hdfs hdfs 5.1G May 21 00:06 fsimage_0000000006388450475
[..]

The above is the list of fsimages fetched and saved from 1001, and as far as I can see it has been running fine for the past days. The timer is also happy:

May 27 00:00:02 an-master1002 kerberos-run-command[5121]: User hdfs executes as user hdfs the command ['/usr/bin/hdfs', 'dfsadmin', '-fetchImage', '/srv/backup/hadoop/namenode']
May 27 00:00:04 an-master1002 kerberos-run-command[5121]: 21/05/27 00:00:04 INFO namenode.TransferFsImage: Opening connection to https://an-master1001.eqiad.wmnet:50470/imagetransfer?getimage=1&txid=latest
May 27 00:06:57 an-master1002 kerberos-run-command[5121]: 21/05/27 00:06:57 INFO common.Util: Combined time for fsimage download and fsync to all disks took 411.30s. The fsimage download took 411.30s at 12947.43 KB/s. Synchronous (fsync) write to disk of /srv/backup/hadoop/namenode/fsimage_0000000006407481422 took 0.00s.

It takes minutes but the active Namenode on 1001 doesn't block while doing its checkpoint. On the active namenode itself I see that the fsimage files are up to date:

root@an-master1001:/home/elukey# ls -l /srv/hadoop/name/current/fsimage_*
-rw-r--r-- 1 hdfs hdfs 5457793515 May 27 04:29 /srv/hadoop/name/current/fsimage_0000000006408528618
-rw-r--r-- 1 hdfs hdfs         62 May 27 04:29 /srv/hadoop/name/current/fsimage_0000000006408528618.md5
-rw-r--r-- 1 hdfs hdfs 5458775070 May 27 05:30 /srv/hadoop/name/current/fsimage_0000000006408680651
-rw-r--r-- 1 hdfs hdfs         62 May 27 05:30 /srv/hadoop/name/current/fsimage_0000000006408680651.md5

I feared that for some reason the checkpoints were not happening and too many edits needed to be folded in into a new fsimage (taking a long time), but it doesn't seem the case.

Change 695933 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: tune some HDFS Namenode GC settings

https://gerrit.wikimedia.org/r/695933

Change 695933 merged by Elukey:

[operations/puppet@production] hadoop: tune some HDFS Namenode GC settings

https://gerrit.wikimedia.org/r/695933

Very interesting graphs for the timeframes of the saveNamespace issue: https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=56&orgId=1&from=1621957037469&to=1621962440236

There is a clear slowdown of RPC handling for port 8040, that is the "Service handler" for HDFS Datanode block/health reports + ZKFC health checks. We have ~50% of threads for the service handler compared to the ones used for the main one (respectively 60 and 127), but maybe the number is too low.

odimitrijevic subscribed.

Verify if the memory changes fixed the issue during cluster upgrade. If saveNamespace fails use downtime to fix issue and if works proceed with upgrade.

I think that more research is needed about why the service handler queue (60 available threads total) wasn't able to process zkfc health checks. It may be something related to the number of threads itself (too few? even if we follow what suggested by upstream) or something else, but the heap/gc changes alone might not be enough. I can keep working on this if the team doesn't have time.

Discussed in grooming today. Plan:

  • Schedule another downtime with the main intention of solving this saveNamespace problem.
    • If Luca's recent patches work and saveNamespace just works, then we will proceed with the Hadoop master OS upgrade as originally planned in T278423: Upgrade the Hadoop masters to Debian Buster
    • If saveNamespace still fails, we will use the scheduled downtime to troubleshoot and try to figure out why. We will not to the Hadoop master OS upgrade at that time.

If we aren't able to do the upgrade during that downtime, Razzi will make a new plan in T278423: Upgrade the Hadoop masters to Debian Buster for an online upgrade. We think this should be totally safe to do, as we can take one NameNode down at a time and reimage.

I took a look to the Threads dump in the HDFS Namenode logs right related to the issue with saveNamespace. After a conversation with Joseph we were wondering about the number of service handler threads running at that time (the ones serving HDFS Datanode block/health reports and ZKFC health checks).

Historical note: we followed what indicated in https://community.cloudera.com/t5/Community-Articles/Scaling-the-HDFS-NameNode-part-2/ta-p/246681, basically half of the log2(#datanodes) * 20 figure when setting up the current setting, 60.

Joseph wondered if those were enough, so I checked their status in the thread dump to find some clue.

All the service handler threads were in this state (port 8040 is the one assigned to the service handler):

Thread 296 (IPC Server handler 59 on default port 8040):
  State: WAITING
  Blocked count: 634
  Waited count: 68826
  Waiting on java.util.concurrent.locks.ReentrantReadWriteLock$FairSync@20fe72e7
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
    java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
    org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.doLock(FSNamesystemLock.java:299)
    org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.readLock(FSNamesystemLock.java:147)
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.readLock(FSNamesystem.java:1566)
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleHeartbeat(FSNamesystem.java:3736)
    org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendHeartbeat(NameNodeRpcServer.java:1480)
    org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.sendHeartbeat(DatanodeProtocolServerSideTranslatorPB.java:119)
    org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:30585)
    org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
    org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
    org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
    org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
    java.security.AccessController.doPrivileged(Native Method)
    javax.security.auth.Subject.doAs(Subject.java:422)
    org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)

All the Threads have org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.sendHeartbeat, holding a fair lock, that is basically following a FIFO (see https://engineering.linkedin.com/blog/2021/the-exabyte-club--linkedin-s-journey-of-scaling-the-hadoop-distr). What it is maybe happening is the following:

  • admin issues a hdfs saveNamespace
  • HDFS locks up to save the fsImage (not sure why this doesn't happen during regular checkpoints, still a mistery for me)
  • the Datanodes heartbeats starts to pile up consuming all the service handler threads
  • the zkfc daemon tries to send a RPC to port 8040 to get the Namenode's state but ends up in a timeout (due to no threads available to serve the request).

If this is true, we should bump the number of service handler threads to something more like #number-of-datanodes + safe margin.

Historical note: we followed what indicated in https://community.cloudera.com/t5/Community-Articles/Scaling-the-HDFS-NameNode-part-2/ta-p/246681, basically half of the log2(#datanodes) * 20 figure when setting up the current setting, 60.

In the code I see this set to 127 since February 2021. I'm not sure what the half was there for in the calculation, but it doesn't seem to be reflected in the current setting. In theory 127 should be sufficiently higher than our 78 worker nodes each running a datanode.

Historical note: we followed what indicated in https://community.cloudera.com/t5/Community-Articles/Scaling-the-HDFS-NameNode-part-2/ta-p/246681, basically half of the log2(#datanodes) * 20 figure when setting up the current setting, 60.

In the code I see this set to 127 since February 2021. I'm not sure what the half was there for in the calculation, but it doesn't seem to be reflected in the current setting. In theory 127 should be sufficiently higher than our 78 worker nodes each running a datanode.

There are two separate RPC queues and related settings:

  • The regular RPC handler listening on port 8020, that by default (if you don't specify any other setting) handles both HDFS RPCs (file system commands basically) and HDFS maintenance/consistency tasks (HDFS block reports and health from datanodes, ZKFC healthchecks, etc..). The setting that you pointed out is related to this queue (that is, more or less, the log2(#datanodes)*20).
  • The service RPC handler running on port 8040 (it is optional but we set it, see the next two settings after the link that you pointed out). This queue takes care of the HDFS maintenance/consistency tasks mentioned above, so that the two kind of traffic are split and handled with different priorities. In this case the setting is 60, more or less half of the 127 figure.

You can see the traffic for both at: https://grafana-rw.wikimedia.org/d/000000585/hadoop?viewPanel=57&orgId=1

During the saveNamespace attempts the call queue length of port 8040 (the service handler) was the only one that showed some slowdown, since if the values are more than 0 some queuing is happening:

https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=54&orgId=1&from=1621957037469&to=1621962440236

Change 698194 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: increase the HDFS Namenode's service handler threads

https://gerrit.wikimedia.org/r/698194

Change 698194 merged by Razzi:

[operations/puppet@production] hadoop: increase the HDFS Namenode's service handler threads

https://gerrit.wikimedia.org/r/698194

elukey claimed this task.

After the changes to heap size and service handler workers we were able to run saveNamespace successfully!