Page MenuHomePhabricator

Setup (simple) catchpoint monitoring and metrics for enwiki betacluster just like production
Closed, DeclinedPublic

Description

Right now availability metrics for beta cluster are coming off a diamond collector that shells to varnish top to see 5xx rates.

I think having a simple catchpoint web check (super cheap!) hit enwiki to see if it's listening would be a good other step. In an ideal world we'd duplicate all the prod checks for beta, but in an unideal world, we could setup *some* cheap checks so we have more reliable metrics to deal with.

So I'd say, to start with, just a simple check for enwiki on betacluster's main page? Ideally we'd do one with debug=false, and then check for a string to be present in there (this is what shinken does atm).

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a subscriber: yuvipanda.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 1 2015, 10:17 PM
greg set Security to None.May 1 2015, 10:22 PM
greg added a subscriber: mmodell.
hashar added a subscriber: hashar.

@yuvipanda can you handle replicating one of the catchpoint probe to hit en.wikipedia.beta.wmflabs.org ? Whatever is done for the production enwiki would be a good first step.

hashar renamed this task from Setup (simple) catchpoint monitoring for betacluster to Setup (simple) catchpoint monitoring for enwiki betacluster just like production.Jun 22 2015, 7:43 PM
hashar moved this task from To Triage to Externally Blocked on the Beta-Cluster-Infrastructure board.

Poked our internal ops mailling list.

Restricted Application added subscribers: Luke081515, Matanya. · View Herald TranscriptJul 20 2015, 7:36 PM
greg renamed this task from Setup (simple) catchpoint monitoring for enwiki betacluster just like production to Setup (simple) catchpoint monitoring and metrics for enwiki betacluster just like production.Jul 21 2015, 3:52 PM

We talked about this on ops list:
https://lists.wikimedia.org/mailman/private/ops/2015-July/049244.html

And during our RelEng weekly meeting:
https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Checkin_archive/20150721#Beta_Cluster

Quote:

I just created T106421 to track that. If we can get to this (not yet prioritized etc) item before this is closed, we're good.

hashar closed this task as Declined.Jul 22 2015, 8:18 AM
hashar claimed this task.

From a reply I made to ops-l:

I thought Catchpoint to be super cheap. From our RelEng weekly meeting yesterday, we agreed to dismiss it and brew our own solution.

We will probably go with a small Selenium based smoke test that exercise the beta cluster every x minutes and have Jenkins build the graphs and send notifications. We already have all we need, just need a bit of glueing.

Minutes from our RelEng weekly meeting:

So Catchpoint comes with a cost and we have all the bricks to fulfill our needs.

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptJan 11 2016, 10:57 PM