HDFS Namenode fail-back failure
Open, HighPublic
Actions

Assigned To

None

Authored By

	BTullis
	Jun 9 2022, 3:18 PM

Description

Update April 2024

I am reopening this ticket, since we continue to see this behaviour frequently when running the sre.hadoop.roll-restart-masters cookbook.

It is particularly prevalent on the fail-back operation.

Since the ticket was originally created we have migrated the namenode services to new hosts (an-master100[3-4]) and increased the Java heap available to the namenode process.

Original ticket content below

We have seen an incident whereby the operation to fail back from an-master1002 back to an-master1001 does not work.

It was first seen when using the sre.hadoop.roll-restart-masters cookbook, but it has since been observed when running the commands manually.

The error displayed at the CLI is as follows:

btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
Operation failed: Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8019 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:37979 remote=an-master1001.eqiad.wmnet/10.64.5.26:8019]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout

It seems to be the case that all of the service threads are exhausted and the ZKFC health checks are timing out.

Looking at the log file: an-master1001:/var/log/hadoop-hdfs/hadoop-hdfs-zkfc-an-master1001.log we can see the following warning at 45 seconds, followed by failure at 60 seconds.

2022-06-09 14:41:49,639 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 active...
2022-06-09 14:42:35,590 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:34867 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040] Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8040 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:34867 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
2022-06-09 14:42:35,590 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING
2022-06-09 14:42:49,711 ERROR org.apache.hadoop.ha.ZKFailoverController: Couldn't make NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 active

Followed shortly afterwards by:

2022-06-09 14:42:54,735 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on default port 8019, call Call#0 Retry#0 org.apache.hadoop.ha.ZKFCProtocol.gracefulFailover from 10.64.5.26:37979
org.apache.hadoop.ha.ServiceFailedException: Unable to become active. Service became unhealthy while trying to failover

The namenode process on an-master1001 is gracefully terminated after the failover attempt.

Details

Subject	Repo	Branch	Lines +/-
hadoop: set quota init threads to speed up failover	operations/puppet	production	+10 -0
Increase the java heap for the Hadoop namenodes again	operations/puppet	production	+2 -2
Remove stray letters from the JVM heap config	operations/puppet	production	+1 -1
Increase the JVM heap for the Hadoop namenode servers	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned Here: T311263: Investigate Gobblin dataloss during namenode failure
T283733: hdfs dfsadmin saveNamespace fails on an-master1001

Event Timeline

BTullis created this task.Jun 9 2022, 3:18 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 9 2022, 3:18 PM

BTullis triaged this task as High priority.Jun 9 2022, 3:19 PM

BTullis added a project: Data-Engineering-Kanban.

BTullis moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.

BTullis moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.

I think that this is similar in nature to the incident reported here: T283733: hdfs dfsadmin saveNamespace fails on an-master1001

One of the steps taken to resolve that incident ass an increase to the value of dfs.namenode.service.handler.count from 60 to 90

https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194/

I suspect that these service handlers are once again becoming exhausted during this failback operation, so I am minded to increase the value again.

I tried starting the namenode on an-master1001 with 120 service handler threads instead of 100.

It seemed to make a small difference, but it ultimately failed in the same way.

I've reset the handler threads back to 100 and now I'm trying again with an increased JVM heap from 64 GB to 72 GB.

2022-06-09 15:54:05,101 INFO org.apache.hadoop.util.GSet: 0.029999999329447746% max memory 72 GB = 22.1 MB

BTullis added a subscriber: JAllemandou.Jun 9 2022, 4:42 PM

It has worked with a heap of 72 GB, howver it had also had about an hour to settle before I tried the failover.

btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
Failover to NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 successful

This is from the hadoop-hdfs-zkfc-an-master1001.log file.

2022-06-09 17:00:14,919 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 to active state
2022-06-09 17:00:14,937 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully became active. Successfully transitioned NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 to active state

I will monitor for stability for a day or two, before considering whether to increase the heap size permanently.

Interestingly, all that restarting of the namenode process and repeatedly loading the fsimage made a noticeable difference to the temperature of the server.

Grafana link

Change 804551 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Increase the JVM heap for the Hadoop namenode servers

https://gerrit.wikimedia.org/r/804551

gerritbot added a project: Patch-For-Review.Jun 10 2022, 9:08 AM

I have created a CR to increase the JVM heap on the namenodes. https://gerrit.wikimedia.org/r/c/operations/puppet/+/804551

Checking the memory utilization graph on the server we can see that even with 72 GB heap there is still a good proportion of the host's RAM that is unused even by the O/S cache. Therefore I think that this increase is safe to do.

Change 804551 merged by Btullis:

[operations/puppet@production] Increase the JVM heap for the Hadoop namenode servers

https://gerrit.wikimedia.org/r/804551

Maintenance_bot removed a project: Patch-For-Review.Jun 10 2022, 1:30 PM

I've merged that patch, but I will wait until mext week before running the cookbook again to test it.

BTullis moved this task from In Progress to Ready to Deploy on the Data-Engineering-Kanban board.Jun 10 2022, 2:57 PM

Mentioned in SAL (#wikimedia-analytics) [2022-06-14T08:48:41Z] <btullis> roll-restarting hadoop masters T310293

Change 805323 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove stray letters from the JVM heap config

https://gerrit.wikimedia.org/r/805323

gerritbot added a project: Patch-For-Review.Jun 14 2022, 9:03 AM

Change 805323 merged by Btullis:

[operations/puppet@production] Remove stray letters from the JVM heap config

https://gerrit.wikimedia.org/r/805323

Maintenance_bot removed a project: Patch-For-Review.Jun 14 2022, 9:30 AM

The cookbook failed again at the same place.
{F35259786,width=60%}

I will have to look at this again to see if there is anything more we can do to increase the stability of this failback process.

@JAllemandou has suggested that increasing the current waiting period of 10 minutes to settle after the restart of the namenode on an-master1001 might be sufficient.
I'm happy to make that change, but I'd like to see if there is anything else we can to do as well.

This time I left it for 38 minutes between starting the namenode on an-master1001 and attempting the failback. It still failed even with almost four times as long to settle.

btullis@an-master1001:~$ sudo systemctl status hadoop-hdfs-namenode.service
● hadoop-hdfs-namenode.service - LSB: Hadoop namenode
   Loaded: loaded (/etc/init.d/hadoop-hdfs-namenode; generated)
   Active: active (running) since Tue 2022-06-21 12:06:26 UTC; 38min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 12271 ExecStart=/etc/init.d/hadoop-hdfs-namenode start (code=exited, status=0/SUCCESS)
 Main PID: 12312 (java)
    Tasks: 347 (limit: 4915)
   Memory: 81.1G
   CGroup: /system.slice/hadoop-hdfs-namenode.service
           └─12312 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Dproc_namenode -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop.log.file=hadoop-hdfs-namenode-an-master1001.log -Dhadoop.home.dir=/usr

Jun 21 12:06:17 an-master1001 systemd[1]: Starting LSB: Hadoop namenode...
Jun 21 12:06:17 an-master1001 runuser[12292]: pam_unix(runuser:session): session opened for user hdfs by (uid=0)
Jun 21 12:06:17 an-master1001 hadoop-hdfs-namenode[12271]: starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.out
Jun 21 12:06:21 an-master1001 runuser[12292]: pam_unix(runuser:session): session closed for user hdfs
Jun 21 12:06:26 an-master1001 hadoop-hdfs-namenode[12271]: Started Hadoop namenode:.
Jun 21 12:06:26 an-master1001 systemd[1]: Started LSB: Hadoop namenode.
btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
standby
btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
Operation failed: Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8019 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:35775 remote=an-master1001.eqiad.wmnet/10.64.5.26:8019]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout

BTullis moved this task from Ready to Deploy to In Progress on the Data-Engineering-Kanban board.Jun 21 2022, 12:48 PM

Trying once again, this time having left it for an hour after starting the namenode service on an-master1001. This is the GC count over the last three hours.
{F35260058}

btullis@an-master1001:~$ sudo systemctl status hadoop-hdfs-namenode.service
● hadoop-hdfs-namenode.service - LSB: Hadoop namenode
   Loaded: loaded (/etc/init.d/hadoop-hdfs-namenode; generated)
   Active: active (running) since Tue 2022-06-21 12:48:04 UTC; 1h 1min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 28722 ExecStart=/etc/init.d/hadoop-hdfs-namenode start (code=exited, status=0/SUCCESS)
 Main PID: 28816 (java)
    Tasks: 344 (limit: 4915)
   Memory: 81.0G
   CGroup: /system.slice/hadoop-hdfs-namenode.service
           └─28816 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Dproc_namenode -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop.log.file=hadoop-hdfs-namenode-an-master1001.log -Dhadoop.home.dir=/usr

Jun 21 12:47:55 an-master1001 systemd[1]: Starting LSB: Hadoop namenode...
Jun 21 12:47:55 an-master1001 runuser[28796]: pam_unix(runuser:session): session opened for user hdfs by (uid=0)
Jun 21 12:47:55 an-master1001 hadoop-hdfs-namenode[28722]: starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.out
Jun 21 12:47:59 an-master1001 runuser[28796]: pam_unix(runuser:session): session closed for user hdfs
Jun 21 12:48:04 an-master1001 hadoop-hdfs-namenode[28722]: Started Hadoop namenode:.
Jun 21 12:48:04 an-master1001 systemd[1]: Started LSB: Hadoop namenode.
btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
standby
btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet

Negative. It did not work. It failed with a slightly different error message this time though.

btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
Operation failed: Unable to become active. Service became unhealthy while trying to failover.
        at org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:697)
        at org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:62)
        at org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:619)
        at org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:616)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
        at org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:616)
        at org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94)
        at org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61)
        at org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854)

Change 807925 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Increase the java heap for the Hadoop namenodes again

https://gerrit.wikimedia.org/r/807925

gerritbot added a project: Patch-For-Review.Jun 23 2022, 9:34 AM

I have created a patch to increase the heap once again, from 72 GB to 84 GB.

I believe that this is still safe to do because the host has around 16 GB of unused RAM and around 18 GB of RAM that is cached by the O/S.

If this doesn't alleviate the problem, then we either have to think about reducing the sensitivity of the health checks or increasing the amount of RAM in the servers, because we're not likely to want to reduce the number of files in HDFS.

Change 807925 merged by Btullis:

[operations/puppet@production] Increase the java heap for the Hadoop namenodes again

https://gerrit.wikimedia.org/r/807925

I've applied the patch. I will restart an-master1001 to pick up the new settings, then wait at least 10 minutes before trying another failback operation from an-master1002.

btullis@an-master1001:~$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for an-master1001.eqiad.wmnet
Info: Unable to serialize catalog to json, retrying with pson
Info: Applying configuration version '(1541fcc64b) Btullis - Increase the java heap for the Hadoop namenodes again'
Notice: /Stage[main]/Bigtop::Hadoop/File[/etc/hadoop/conf.analytics-hadoop/hadoop-env.sh]/content:
--- /etc/hadoop/conf.analytics-hadoop/hadoop-env.sh     2022-06-14 09:18:46.948866101 +0000
+++ /tmp/puppet-file20220623-31627-1auw9ha      2022-06-23 13:01:58.534785928 +0000
@@ -27,7 +27,7 @@


 # Any additionally supplied HADOOP_NAMENODE_OPTS
-HADOOP_NAMENODE_OPTS="$HADOOP_NAMENODE_OPTS -Xms73728m -Xmx73728m -Djava.net.preferIPv4Stack=false -XX:+UseG1GC -XX:MaxGCPauseMillis=4000 -XX:ParallelGCThreads=20 -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=[::]:10080:/etc/prometheus/hdfs_namenode_jmx_exporter.yaml"
+HADOOP_NAMENODE_OPTS="$HADOOP_NAMENODE_OPTS -Xms86016m -Xmx86016m -Djava.net.preferIPv4Stack=false -XX:+UseG1GC -XX:MaxGCPauseMillis=4000 -XX:ParallelGCThreads=20 -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=[::]:10080:/etc/prometheus/hdfs_namenode_jmx_exporter.yaml"

 # Any additionally supplied HADOOP_DATANODE_OPTS
 HADOOP_DATANODE_OPTS="$HADOOP_DATANODE_OPTS -Xms8192m -Xmx8192m -Djava.net.preferIPv4Stack=false -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=[::]:51010:/etc/prometheus/hdfs_datanode_jmx_exporter.yaml"

Info: Computing checksum on file /etc/hadoop/conf.analytics-hadoop/hadoop-env.sh
Info: /Stage[main]/Bigtop::Hadoop/File[/etc/hadoop/conf.analytics-hadoop/hadoop-env.sh]: Filebucketed /etc/hadoop/conf.analytics-hadoop/hadoop-env.sh to puppet with sum dad2475868c33d5c483ebc139a3dfc1f
Notice: /Stage[main]/Bigtop::Hadoop/File[/etc/hadoop/conf.analytics-hadoop/hadoop-env.sh]/content: content changed '{md5}dad2475868c33d5c483ebc139a3dfc1f' to '{md5}608f199e13ae5bf7e4ecaa1be14819ef'
Notice: Applied catalog in 81.87 seconds

Maintenance_bot removed a project: Patch-For-Review.Jun 23 2022, 1:31 PM

Nope, even with an 84 GB heap and having waited over 30 minutes it still failed.

btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
standby
btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
Operation failed: Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8019 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:34355 remote=an-master1001.eqiad.wmnet/10.64.5.26:8019]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
btullis@an-master1001:~$

I will investigate the health check logs and options around the sensitivity.

These are some entries from the health check log at the time: an-master1001:/var/log/hadoop-hdfs/hadoop-hdfs-zkfc-an-master1001.log (stacktraces trimmed)

2022-06-23 13:39:57,868 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/an-master1001.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2022-06-23 13:39:57,893 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Allowed RPC access from hdfs/an-master1001.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS) at 10.64.5.26
2022-06-23 13:39:57,898 INFO org.apache.hadoop.ha.ZKFailoverController: Asking NameNode at an-master1002.eqiad.wmnet/10.64.21.110:8040 to cede its active state for 10000ms
2022-06-23 13:40:00,243 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2022-06-23 13:40:00,244 INFO org.apache.hadoop.ha.ActiveStandbyElector: No old node to fence
2022-06-23 13:40:00,244 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/analytics-hadoop/ActiveBreadCrumb to indicate that the local node is the most recent active...
2022-06-23 13:40:00,246 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 active...
2022-06-23 13:41:00,318 ERROR org.apache.hadoop.ha.ZKFailoverController: Couldn't make NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 active
java.net.SocketTimeoutException: Call From an-master1001/10.64.5.26 to an-master1001.eqiad.wmnet:8040 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.64.5.26:32907 remote=an-master1001.eqiad.wmnet/10.64.5.26:8040]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout

2022-06-23 13:41:10,339 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on default port 8019, call Call#0 Retry#0 org.apache.hadoop.ha.ZKFCProtocol.gracefulFailover from 10.64.5.26:34355: output error
2022-06-23 13:41:10,339 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on default port 8019 caught an exception

2022-06-23 13:41:33,769 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040: java.io.EOFException End of File Exception between local host is: "an-master1001/10.64.5.26"; destination host is: "an-master1001.eqiad.wmnet":8040; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException
2022-06-23 13:41:33,769 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING
2022-06-23 13:41:33,770 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 entered state: SERVICE_NOT_RESPONDING
2022-06-23 13:41:33,772 WARN org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Can't get local NN thread dump due to Connection refused (Connection refused)
2022-06-23 13:41:33,772 INFO org.apache.hadoop.ha.ZKFailoverController: Quitting master election for NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 and marking that fencing is necessary
2022-06-23 13:41:33,772 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
2022-06-23 13:41:33,775 INFO org.apache.zookeeper.ZooKeeper: Session: 0xeb00dd22c9c503fe closed
2022-06-23 13:41:33,775 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down for session: 0xeb00dd22c9c503fe
2022-06-23 13:41:35,777 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: an-master1001.eqiad.wmnet/10.64.5.26:8040. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)

So it keeps hitting a 60 second timeout.

We decided to attempt another failback from an-master1002 to an-master1001. This time we wanted to time it so that:

No gobblin jobs were running because of T311263: Investigate Gobblin dataloss during namenode failure
It was a period of relatively light activity on the cluster.

By looking at this graph of combined datanode activity we reasoned that about 45-50 minutes past each hour is the quietest time.

This time the failback operation was successful. This is the command output.

btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
Failover to NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 successful

This is the relevant snippet from the ZKFC log.

2022-06-29 11:52:02,322 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/an-master1001.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2022-06-29 11:52:02,357 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Allowed RPC access from hdfs/an-master1001.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS) at 10.64.5.26
2022-06-29 11:52:02,371 INFO org.apache.hadoop.ha.ZKFailoverController: Asking NameNode at an-master1002.eqiad.wmnet/10.64.21.110:8040 to cede its active state for 10000ms
2022-06-29 11:52:04,628 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2022-06-29 11:52:04,630 INFO org.apache.hadoop.ha.ActiveStandbyElector: No old node to fence
2022-06-29 11:52:04,630 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/analytics-hadoop/ActiveBreadCrumb to indicate that the local node is the most recent active...
2022-06-29 11:52:04,632 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 active...
2022-06-29 11:52:15,699 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 to active state

So this is a great result and gives weight to the hypothesis that this type of failover/failback operation is currently at greater risk of failure when the cluster activity is higher. It doesn't exactly give us a smoking gun, but we're getting closer to the answer.

NICE! That is good news. It will be interesting to test a failover during high cluster activity after we icebergify too!

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Jun 30 2022, 3:13 PM

JArguello-WMF edited projects, added Data-Engineering-Planning (Sprint 01); removed Data-Engineering-Planning.Jun 30 2022, 3:29 PM

JArguello-WMF moved this task from Ready to In progress on the Data-Engineering-Planning (Sprint 01) board.

• EChetty subscribed.Jun 30 2022, 3:30 PM

xcollazo subscribed.Jun 30 2022, 4:30 PM

• EChetty moved this task from Sprint 01 to Epics on the Data-Engineering-Planning board.Jun 30 2022, 4:37 PM

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering-Planning (Sprint 01).

I'm tempted to resolve this problem for now, given that we know there is a workaround and we know that we should try to avoid failing over during the busiest times for the cluster.

Change 895127 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/puppet@production] hadoop: set quota init threads to speed up failover

https://gerrit.wikimedia.org/r/895127

gerritbot added a project: Patch-For-Review.Mar 7 2023, 9:26 AM

Change 895127 merged by Nicolas Fraison:

[operations/puppet@production] hadoop: set quota init threads to speed up failover

https://gerrit.wikimedia.org/r/895127

Maintenance_bot removed a project: Patch-For-Review.Mar 8 2023, 8:31 AM

BTullis renamed this task from HDFS Namenode failover failure to HDFS Namenode fail-back failure.Apr 15 2024, 11:14 AM

BTullis reopened this task as Open.

BTullis edited projects, added Data-Platform-SRE; removed Data-Engineering.

BTullis updated the task description. (Show Details)

BTullis removed subscribers: xcollazo, • EChetty, Ottomata, elukey.

BTullis removed BTullis as the assignee of this task.Apr 19 2024, 1:59 PM

BTullis moved this task from Incoming to Bugs / Issues on the Data-Platform-SRE board.

	F35224344: image.png
	Jun 10 2022, 9:24 AM

	F35222963: image.png
	Jun 9 2022, 5:14 PM

HDFS Namenode fail-back failureOpen, HighPublicActions