Page MenuHomePhabricator

Investigate failure of Hadoop namenode coinciding with krb1001 reboot
Closed, ResolvedPublic

Description

On the morning of 12/09/2023 we experienced a failure of the Hadoop namenode on an-master1001.
The standby namenode an-master1002 took over gracfeully, so the HDFS services weren't adversely affected, but we need to investigate why the failure occurred.

The incident happened at the same time as a scheduled reboot of krb1001, which is our primary Kerberos KDC.
We have a standby KDC in krb2002, so we need to understand why the namenode services didn't fail over transparently to this standby KDC.

It could be related to the journalnode configuration.

Event Timeline

I've done a bit of investigation, but don't have many useful clues yet.
I manually checked all five of the journalnode logs on hosts (that's an-worker[1078,1080,1090,1142].eqiad.wmnet,analytics1072.eqiad.wmnet)

...and scanned for errors and warning with something like: grep -v INFO /var/log/hadoop-hdfs/hadoop-hdfs-journalnode-an-worker1078.log |less
There wasn't anything mentioned that was near to the time of the incident at approximately 07:05 on 2023-09-12

I also checked for any unusual kerberos messages, but there didn't seem to be anything out of the ordinary.
We have normal INFO messages about kerberos authentication at the time. e.g.

2023-09-12 07:02:11,725 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/an-master1002.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2023-09-12 07:03:12,281 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/an-master1002.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2023-09-12 07:04:12,434 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/an-master1002.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2023-09-12 07:05:42,817 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/an-master1002.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2023-09-12 07:14:03,488 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/an-master1001.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2023-09-12 07:19:01,502 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/an-master1001.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
2023-09-12 07:20:01,654 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/an-master1001.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)

The namenode on an-master1001 had a 20 second delay in waiting for a quorum of journalnodes to respond, but there's nothing in the logs of the journalnodes to explain why this delay might have occurred.
It looks like this was a period of quite busy write activity for the cluster: https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=1694488282106&to=1694509495688

image.png (955×1 px, 277 KB)

There was some GC activity correlating with this too, but in the order of milliseconds, so it still doesn't fully explain it.

I think that the only way forward I can think of at the moment is to try scheduling some downtime for krb1001 and either:

  • powering it off for a while

or

  • stopping the krb5-kdc service

This would force the kerberos clients to use the standby server and might surface any problems we have with the failover configuration when the Hadoop cluster is under load.
What do you think @MoritzMuehlenhoff - Do you think that this option is worth exploring more?

I think that the only way forward I can think of at the moment is to try scheduling some downtime for krb1001 and either:

  • powering it off for a while

or

  • stopping the krb5-kdc service

This would force the kerberos clients to use the standby server and might surface any problems we have with the failover configuration when the Hadoop cluster is under load.
What do you think @MoritzMuehlenhoff - Do you think that this option is worth exploring more?

Sounds worth a try, I don't believe we need to actually shut the server down, but we can just stop/mask the KDC and kadmin services for a test.

OK, I will schedule some time to do that next week.

I've found it impossible to find time to work on this so far, given a number of other incidents we observed this week and overall workload. Maybe we will have a quiet week soon and I will be able to give it enough focus to test it.

I've scheduled some time to work on this on Monday 16th of October at approximately 09:00 UTC.
Is that OK with you @MoritzMuehlenhoff?

My plan is to:

  • log in SAL and #wikimedia-analytics what I'm doing
  • disable puppet on krb1001
  • systemctl stop krb5-kdc.service krb5-admin-server.service
  • Watch the nameserver logs on an-master100[1-2]
  • Watch the kdc logs etc.
  • Check for any other slowdowns or anything else, e.g. presto, superset, hive
  • Restart the services when finished
  • Re-enable puppet

Is this OK, as far as you are concerned. Is thereanything outside of the Hadoop/Analytics space that I need to be careful to watch while this is going on?

I've scheduled some time to work on this on Monday 16th of October at approximately 09:00 UTC.
Is that OK with you @MoritzMuehlenhoff?

I'm off next week, can we do it a week later?

I've scheduled some time to work on this on Monday 16th of October at approximately 09:00 UTC.
Is that OK with you @MoritzMuehlenhoff?

I'm off next week, can we do it a week later?

Yep, sure thing. I'll pencil it in for 09:00 UTC on Monday October 23rd.

Mentioned in SAL (#wikimedia-analytics) [2023-10-23T09:09:02Z] <btullis> disabling puppet on krb1001 for T346135

Mentioned in SAL (#wikimedia-analytics) [2023-10-23T09:47:39Z] <btullis> restarting krb5-kdc.service and krb5-admin-server.service on krb1001 and re-enabling puppet for T346135

BTullis moved this task from In Progress to Done on the Data-Platform-SRE board.

We stopped the krb5-kdc and krb5-admin-server services on krb1001 in order to monitor for any unusual behaviour from the HDFS namenodes, or any other kerberos enabled services.

The services ran this way for about 40 minutes, during a relatively busy period on the cluster. We also stressed presto by loading lots of Superset dashboards.

Everything appeared to function exactly as intended, with the secondary KDC (krb2002) picking up all of the load seamlessly. I checked the log file: /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.log to see if we got any similar journalnode timeouts, or anything like that, but there was nothing.

That's an extremely successful test of the failover capabilities and very reassuring, but it doesn't help us to find the root cause of this particular namenode failure.
I think that we will just have to resolve this ticket and be on the lookout for any similar behaviour in future.