Service_checker does a good job of verifying the functionality of a service on a single machine. During the restbase/parsoid/ve outage we received no pages, as we have not functional test on the cluster as a whole, and we don't page on single-machine outages (rightly so).
Ideas on how to implement functional testing:
- Create a cluster check in icinga (using https://www.monitoring-plugins.org/doc/man/check_cluster.html) on the service_checker status
- Run service_checker (or a reduced version, like performing the first GET in the spec) on the LVS IP
https://wikitech.wikimedia.org/wiki/Incident_documentation/20160505-ChangeProp_RESTBase_Parsoid