Incident report: https://wikitech.wikimedia.org/wiki/Incidents/2022-05-31_Analytics_Data_Lake_-_Hadoop_Namenode_failure
WIP
Action Items:
- [] Make old journalnode edits files are cleaned properly now that namenodes are back online and saving fs image snapshots.
- [] Reduce `profile::hadoop::backup::namenode::fsimage_retention_days`, 20 is too many
- [] Possibly separate image backup storage from namenode data storage partitions
- [] `hdfs dfsadmin -fetchImage` should have kept failing and not recovered.
- [] gobblin did not fail with proper error codes while NameNodes were offline
- [] Make sure journalnodes alert sooner about disk journalnode partition
- [] Check that bacula backups of fs image snapshots are available and usable
- [] Check that the alerting for disk space is correct on an-master hosts - since we seem not to have been alerted to `/srv/` becoming full on an-master1002