@dpifke and I accidentally broke performance.wikimedia.org for ~30 minutes and no alarms fired besides one for the apache2 systemd unit having failed. I would expect that we have some HTTP checks that the various sites are working properly.
We have Icinga checks for most of the backends (XHGui, ArcLamp), but not for Apache on webperf1001 itself.
Ideally, we can monitor error rates (possibly at the Varnish layer as it tries to reach webperf1001) and not just up/down. For most of the downtime, we were throwing 500 errors downloading SVGs from Swift due to cert errors; a simple check of a static Apache URL wouldn't detect this.
In https://gerrit.wikimedia.org/r/c/operations/puppet/+/608973, I had hoped that the Prometheus Apache exporter would give us cumulative error counts on which to alert, but it just reports instantaneous status from mod_status like number of workers.