Page MenuHomePhabricator

Create functional cluster checks for all services (and have them page!)
Open, MediumPublic

Description

Service_checker does a good job of verifying the functionality of a service on a single machine. During the restbase/parsoid/ve outage we received no pages, as we have not functional test on the cluster as a whole, and we don't page on single-machine outages (rightly so).

Ideas on how to implement functional testing:

https://wikitech.wikimedia.org/wiki/Incident_documentation/20160505-ChangeProp_RESTBase_Parsoid

Event Timeline

Change 287907 had a related patch set uploaded (by Giuseppe Lavagetto):
nagios_common: Add command for using service_checker

https://gerrit.wikimedia.org/r/287907

Change 287908 had a related patch set uploaded (by Giuseppe Lavagetto):
monitoring: use service_checker for mobileapps LVS

https://gerrit.wikimedia.org/r/287908

Change 287907 merged by Giuseppe Lavagetto:
nagios_common: Add command for using service_checker

https://gerrit.wikimedia.org/r/287907

Change 288589 had a related patch set uploaded (by Giuseppe Lavagetto):
mobileapps: add experimental cluster check

https://gerrit.wikimedia.org/r/288589

Change 288589 merged by Giuseppe Lavagetto:
mobileapps: add experimental cluster check

https://gerrit.wikimedia.org/r/288589

Mobileapps is now monitored via service_checker on the LVS IPs, and will send emails to the service team and alert on IRC in #-operations

Change 289151 had a related patch set uploaded (by Giuseppe Lavagetto):
lvs::monitor: monitor all services via service_checker

https://gerrit.wikimedia.org/r/289151

Change 289151 merged by Giuseppe Lavagetto:
lvs::monitor: monitor all services via service_checker

https://gerrit.wikimedia.org/r/289151

All active services are now monitored as mobileapps; we can see in a week how many mails did the services team receive and decide if it's ok to page for these.

I think we can consider this resolved now?

I have not gotten any service-global alerts so far, and would expect them to be very rare in any case. I do get occasional alerts for individual service nodes, typically connected to ongoing work.

Given the low probability & severity of a service-wide outage, I think we should make those alerts paging. Per-node alerts are more frequent and more of an early warning sign, and do not warrant paging in my opinion.

We have recently seen some service-wide alerts after deployment issues, and they were all accurate. @Joe, I think we can go ahead & make those paging.

This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please prioritize it appropriately relative to your other work. If you have any questions, feel free to ask me (Greg Grossmeier).

Joe removed Joe as the assignee of this task.Oct 5 2016, 7:54 AM
Joe added a project: User-Joe.

@Joe, has this been resolved by the external checks (including Varnish) that you set up recently?

Change 287908 abandoned by Giuseppe Lavagetto:
monitoring: use service_checker for mobileapps LVS

https://gerrit.wikimedia.org/r/287908