Change Details

We have seen an incident whereby the operation to fail back from an-master1002 back to an-master1001 does not work. It was first seen when using the `sre.hadoop.roll-restart-masters` cookbook, but it has since been observed when running the commands manually. The error displayed at the CLI is as follows: ``` btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet Operation failed: Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8019 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:37979 remote=an-master1001.eqiad.wmnet/10.64.5.26:8019]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout ``` It seems to be the case that all of the service threads are exhausted and the ZKFC health checks are timing out. Looking at the log file: `an-master1001:/var/log/hadoop-hdfs/hadoop-hdfs-zkfc-an-master1001.log` we can see the following warning at 45 seconds, followed by failure at 60 seconds. ``` 2022-06-09 14:41:49,639 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 active... 2022-06-09 14:42:35,590 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:34867 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040] Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8040 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:34867 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout 2022-06-09 14:42:35,590 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING 2022-06-09 14:42:49,711 ERROR org.apache.hadoop.ha.ZKFailoverController: Couldn't make NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 active ``` Followed shortly afterwards by: ``` 2022-06-09 14:42:54,735 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on default port 8019, call Call#0 Retry#0 org.apache.hadoop.ha.ZKFCProtocol.gracefulFailover from 10.64.5.26:37979 org.apache.hadoop.ha.ServiceFailedException: Unable to become active. Service became unhealthy while trying to failover ``` The namenode process on an-master1001 is gracefully terminated after the failover attempt.

## Update April 2024 I am reopening this ticket, since we continue to see this behaviour frequently when running the `sre.hadoop.roll-restart-masters` cookbook. It is particularly prevalent on the **fail-back** operation. Since the ticket was originally created we have migrated the namenode services to new hosts (`an-master100[3-4]`) and increased the Java heap available to the namenode process. ## Original ticket content below ---- We have seen an incident whereby the operation to fail back from an-master1002 back to an-master1001 does not work. It was first seen when using the `sre.hadoop.roll-restart-masters` cookbook, but it has since been observed when running the commands manually. The error displayed at the CLI is as follows: ``` btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet Operation failed: Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8019 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:37979 remote=an-master1001.eqiad.wmnet/10.64.5.26:8019]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout ``` It seems to be the case that all of the service threads are exhausted and the ZKFC health checks are timing out. Looking at the log file: `an-master1001:/var/log/hadoop-hdfs/hadoop-hdfs-zkfc-an-master1001.log` we can see the following warning at 45 seconds, followed by failure at 60 seconds. ``` 2022-06-09 14:41:49,639 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 active... 2022-06-09 14:42:35,590 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:34867 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040] Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8040 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:34867 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout 2022-06-09 14:42:35,590 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING 2022-06-09 14:42:49,711 ERROR org.apache.hadoop.ha.ZKFailoverController: Couldn't make NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 active ``` Followed shortly afterwards by: ``` 2022-06-09 14:42:54,735 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on default port 8019, call Call#0 Retry#0 org.apache.hadoop.ha.ZKFCProtocol.gracefulFailover from 10.64.5.26:37979 org.apache.hadoop.ha.ServiceFailedException: Unable to become active. Service became unhealthy while trying to failover ``` The namenode process on an-master1001 is gracefully terminated after the failover attempt.