Page MenuHomePhabricator

add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites
Closed, ResolvedPublic

Description

during today's outage (T124804) we did not get any Icinga alerts and nothing changed on http://status.wikimedia.org because only wikis in .wikimedia.org were affected and we were still serving the generic portal page.

Since we check HTTP status but do not check for some specific content this went unnoticed by monitoring.

Can we check for specific strings that never change under normal circumstances but are specific to a project and would have caught this?

Event Timeline

Dzahn created this task.Jan 26 2016, 7:14 PM
Dzahn raised the priority of this task from to Needs Triage.
Dzahn updated the task description. (Show Details)
Dzahn added projects: Operations, observability.
Dzahn added a subscriber: Dzahn.
Restricted Application added subscribers: StudiesWorld, Steinsplitter, Aklapper. · View Herald TranscriptJan 26 2016, 7:14 PM
Pine awarded a token.Jan 26 2016, 7:15 PM
Dzahn updated the task description. (Show Details)Jan 26 2016, 7:18 PM
Dzahn set Security to None.

Checking for specific strings would make sense - standard HTTP tokens or headers perhaps? But beyond that, the user expectation of status.wikimedia.org is that it will report the status of the Wikimedia projects, regardless of whether that is automatically or manually detected. If it's not already possible for operations staff to manually trigger a service outage report on status.wikimedia.org (maybe through a big red button in the WMF offices? ;-) ), then that would be a good thing to add.

Additionally, are there presubmit/integration checks that would have caught this? The builds looked green on push.

jayvdb updated the task description. (Show Details)Jan 26 2016, 7:39 PM
jayvdb added a subscriber: jayvdb.
Dzahn added a comment.Jan 26 2016, 7:40 PM

There is a script called apache-fast-test. (modules/apache/files/apache-fast-test) but it's not run automatically by integration. It relies on a human creating a file with URLs to test. There is also T72068 and T45266.

KTC added a subscriber: KTC.Jan 26 2016, 9:35 PM
Dzahn added a comment.Mar 11 2016, 9:54 PM

how about checking for "Picture of the day" on the Main_Page of commons ?

Andrew triaged this task as High priority.Apr 14 2016, 7:52 PM

Change 290606 had a related patch set uploaded (by Dzahn):
add icinga monitoring for content on commons

https://gerrit.wikimedia.org/r/290606

Change 290606 merged by Dzahn:
add icinga monitoring for content on commons

https://gerrit.wikimedia.org/r/290606

Dzahn added a comment.May 25 2016, 1:13 AM

So the Icinga part is there now. What i don't know is: Should that be paging now? and todo is watchmouse

Dzahn added a comment.May 27 2016, 8:14 PM

I added the same type of check to "watchmouse" too:

http://status.wikimedia.org/8777/438553/

it's a https check on https://commons.wikimedia.org/wiki/Main_Page but additionally for the string "Picture of the Day" and it's green .. all the settings like other things in core services

shows up as "commons.wikimedia.org main page content" on http://status.wikimedia.org/ right now but once refreshed will be "https content - commons" for more consistent naming.

Change 291347 had a related patch set uploaded (by Dzahn):
icinga: make commonts content check critical (paging)

https://gerrit.wikimedia.org/r/291347

Change 291347 merged by Dzahn:
icinga: make commons content check critical (paging)

https://gerrit.wikimedia.org/r/291347

Dzahn closed this task as Resolved.May 27 2016, 8:58 PM
Dzahn claimed this task.
Dzahn removed a project: Patch-For-Review.