On the morning of 12/09/2023 we experienced a failure of the Hadoop namenode on an-master1001.
The standby namenode an-master1002 took over gracfeully, so the HDFS services weren't adversely affected, but we need to investigate why the failure occurred.
The incident happened at the same time as a scheduled reboot of krb1001, which is our primary Kerberos KDC.
We have a standby KDC in krb2002, so we need to understand why the namenode services didn't fail over transparently to this standby KDC.
It could be related to the journalnode configuration.