Change Details

Incident report: https://wikitech.wikimedia.org/wiki/Incidents/2022-05-31_Analytics_Data_Lake_-_Hadoop_Namenode_failure WIP Action Items: - [x] Make old journalnode edits files are cleaned properly now that namenodes are back online and saving fs image snapshots. - [x] Reduce `profile::hadoop::backup::namenode::fsimage_retention_days`, 20 is too many - [] Possibly separate image backup storage from namenode data storage partitions **won't do yet** - [x] `hdfs dfsadmin -fetchImage` should have kept failing and not recovered. - [] gobblin did not fail with proper error codes while NameNodes were offline **covered by other alerts** - [x] Make sure journalnodes alert sooner about disk journalnode partition - [x] Check that bacula backups of fs image snapshots are available and usable - [x] Check that the alerting for disk space is correct on an-master hosts - since we seem not to have been alerted to `/srv/` becoming full on an-master1002