Page MenuHomePhabricator

add icinga and watchmouse https checks for content on commons. or other sites
Closed, ResolvedPublic


during today's outage (T124804) we did not get any Icinga alerts and nothing changed on because only wikis in were affected and we were still serving the generic portal page.

Since we check HTTP status but do not check for some specific content this went unnoticed by monitoring.

Can we check for specific strings that never change under normal circumstances but are specific to a project and would have caught this?

Event Timeline

Dzahn raised the priority of this task from to Needs Triage.
Dzahn updated the task description. (Show Details)
Dzahn added projects: SRE, observability.
Dzahn subscribed.
Dzahn set Security to None.

Checking for specific strings would make sense - standard HTTP tokens or headers perhaps? But beyond that, the user expectation of is that it will report the status of the Wikimedia projects, regardless of whether that is automatically or manually detected. If it's not already possible for operations staff to manually trigger a service outage report on (maybe through a big red button in the WMF offices? ;-) ), then that would be a good thing to add.

Additionally, are there presubmit/integration checks that would have caught this? The builds looked green on push.

There is a script called apache-fast-test. (modules/apache/files/apache-fast-test) but it's not run automatically by integration. It relies on a human creating a file with URLs to test. There is also T72068 and T45266.

how about checking for "Picture of the day" on the Main_Page of commons ?

Andrew triaged this task as High priority.Apr 14 2016, 7:52 PM

Change 290606 had a related patch set uploaded (by Dzahn):
add icinga monitoring for content on commons

Change 290606 merged by Dzahn:
add icinga monitoring for content on commons

So the Icinga part is there now. What i don't know is: Should that be paging now? and todo is watchmouse

I added the same type of check to "watchmouse" too:

it's a https check on but additionally for the string "Picture of the Day" and it's green .. all the settings like other things in core services

shows up as " main page content" on right now but once refreshed will be "https content - commons" for more consistent naming.

Change 291347 had a related patch set uploaded (by Dzahn):
icinga: make commonts content check critical (paging)

Change 291347 merged by Dzahn:
icinga: make commons content check critical (paging)

Dzahn claimed this task.
Dzahn removed a project: Patch-For-Review.