There is not proper monitoring of freshness of backups, correct success of their execution, recovery testing. Evaluate the kind of monitoring that is necessary and setup some short-term actionables to get proper general infrastructure backups monitoring so alerts happen if there is an anomalous state.
The deliveries are on purpose vague as part of this ticket there will be first some work on understanding what is the bare minumum setup and how it should work and who it should alert around the backup and restore service. Potential candidates:
- Alert on backup taking failure
- Alert on running out of space/reduction of retention period
- Alert on backup staleness (latest successful backup older than X days)
- Some kind of dashboard/graph to understand the disk utilization for each project
CC @fgiunchedi not because we will ask him for help on setting up this, but we may ask for advice for the monitoring philosophy part