Update April 2024
I am reopening this ticket, since we continue to see this behaviour frequently when running the sre.hadoop.roll-restart-masters cookbook.
It is particularly prevalent on the fail-back operation.
Since the ticket was originally created we have migrated the namenode services to new hosts (an-master100[3-4]) and increased the Java heap available to the namenode process.
Original ticket content below
We have seen an incident whereby the operation to fail back from an-master1002 back to an-master1001 does not work.
It was first seen when using the sre.hadoop.roll-restart-masters cookbook, but it has since been observed when running the commands manually.
The error displayed at the CLI is as follows:
btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet Operation failed: Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8019 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:37979 remote=an-master1001.eqiad.wmnet/10.64.5.26:8019]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
It seems to be the case that all of the service threads are exhausted and the ZKFC health checks are timing out.
Looking at the log file: an-master1001:/var/log/hadoop-hdfs/hadoop-hdfs-zkfc-an-master1001.log we can see the following warning at 45 seconds, followed by failure at 60 seconds.
2022-06-09 14:41:49,639 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 active... 2022-06-09 14:42:35,590 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:34867 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040] Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8040 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:34867 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout 2022-06-09 14:42:35,590 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING 2022-06-09 14:42:49,711 ERROR org.apache.hadoop.ha.ZKFailoverController: Couldn't make NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 active
Followed shortly afterwards by:
2022-06-09 14:42:54,735 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on default port 8019, call Call#0 Retry#0 org.apache.hadoop.ha.ZKFCProtocol.gracefulFailover from 10.64.5.26:37979 org.apache.hadoop.ha.ServiceFailedException: Unable to become active. Service became unhealthy while trying to failover
The namenode process on an-master1001 is gracefully terminated after the failover attempt.