Incident report: https://wikitech.wikimedia.org/wiki/Incidents/2022-05-31_Analytics_Data_Lake_-_Hadoop_Namenode_failure
WIP
Action Items:
- Make old journalnode edits files are cleaned properly now that namenodes are back online and saving fs image snapshots.
- Reduce profile::hadoop::backup::namenode::fsimage_retention_days, 20 is too many
- Possibly separate image backup storage from namenode data storage partitions won't do yet
- hdfs dfsadmin -fetchImage should have kept failing and not recovered.
- gobblin did not fail with proper error codes while NameNodes were offline covered by other alerts
- Make sure journalnodes alert sooner about disk journalnode partition
- Check that bacula backups of fs image snapshots are available and usable
- Check that the alerting for disk space is correct on an-master hosts - since we seem not to have been alerted to /srv/ becoming full on an-master1002
