With the migration to MW-on-K8s and kubernetes CronJobs, we have to find a way to alert correctly when a job fails.
On the kubernetes side:
kube-state-metrics exports the kube_job_failed metrics, create a prometheus alert when a job fails.
This metrics is at the Job level, not the CronJob level, which means we can't aggregate on the number of failures of the CronJob itself. It would be nice to add the following kubernetes labels to the information gathered by kube-state-metrics: script, cronjob and team, as well as passing down the comment annotation to the job (which should probably be a label instead).
Then we would need to aggregate failures on the cronjob label.
The team label would ideally be a phabricator project tag, that can then be used in alertmanager to open a task to the right project on CronJob failure.