Setup some kind of automation that checks backups (e.g. every week) and detects ongoing or finished backups that run over a configurable amount of time (e.g. more than 12 hours) and send an email with that- similar to how data checks work.
This will serve 2 purposes:
- Detect processes "stuck" for some unknown reason (this was already covered by the backup monitoring, but it is useful also by itself)
- Give some hints of when ES clusters should be split to reduce both the backup time- and more importantly, the recovery time