We have fairly regular upstream failures in our image suggestions pipelines (some dataset we depend on not getting generated), also sometimes something will stop working because some part of the infrastructure has changed (e.g. recently we had to switch or data checking script to get kicked off via skein rather than via spark)
So far our alerting has worked ok, and we've been notified when something is wrong, also we have some instructions for what to do in case of a failure.
If, however, there is no available engineer to take action on a failure quickly (i.e. before the next pipeline run), it's unclear what should happen. This ticket is to give us a place to make and record decisions about that