Page MenuHomePhabricator

[Quarterly Success Metric] Stable uptime metrics of the Staging cluster
Closed, ResolvedPublic

Description

This is now live on beta cluster and it will be available on staging as soon as the staging-cache-* machines come online.

You can check it out on the wmflabs.org graphite dashboard.

Individual Graphs:

So far we are averaging about 99.5% availability and this is counting 404 as non-availability so that is fairly good IMO.

Event Timeline

greg raised the priority of this task from to Medium.
greg updated the task description. (Show Details)
greg added a project: Staging.
greg added subscribers: greg, yuvipanda, mmodell.
greg renamed this task from [Quarterly Success Metric] - Stable uptime metrics of the Staging cluster to [Quarterly Success Metric] Stable uptime metrics of the Staging cluster.Feb 5 2015, 7:07 PM
greg set Security to None.

The success:error ratio, along with the volume of requests and the average response times would be really good metrics for the health of the staging cluster.

Perhaps, but I'll note that this also has flaws - if the nginx / varnish machines themselves are down. or DNS is down and people can't reach this. etc, etc, etc...

Did someone talk with @chasemp to see what he's doing wrt prod? They also want stability metrics...

@yuvipanda: yes I talked to chase about this a while back but he didn't seem to think he had anything we could immediately apply to our stuff, at least not yet.

For overall availability I would take any gaps in the graph to mean 100% errors during those times (since we weren't handling any requests, it must have been down)

So our current txstatsd implementation interprets 'no new data' to be 'repeat last data point' (yes, stupid) rather than 'no new data'. I'm sure we can overcome this properly with some diamond code, but needs to be evaluated properly.

Still seems a bit fragile. 0 vs null would be an issue, perhaps.

I do not have any easy alternatives to offer, however. My personal preference would be to just wait for Chase to do the prod stuff and just use exactly the same metrics...

So our current txstatsd implementation interprets 'no new data' to be 'repeat last data point' (yes, stupid) rather than 'no new data'. I'm sure we can overcome this properly with some diamond code, but needs to be evaluated properly.

I think txstatsd is going to be replaced soon (T89857).

My personal preference would be to just wait for Chase to do the prod stuff and just use exactly the same metrics...

Can't, it's way too expensive (TM). cc @chasemp

@yuvipanda: the number is cumulative, so repeating the last data point would be the same as no new data.

I'm in ping central today gentlemen. My 2 cents.

Yes, txstatsd is not great and will die. Yes, it seems to confuse no metrics with "submit something so we don't have to care about xfactor" which is awful. Stability or uptime metrics this seems sane to me for staging. I would say make a diamond collector that only does a subset of metrics to reduce load for labs friendliness and manage it via puppet. The prod stuff I am doing currently is mostly unrelated and I think maybe three concepts are being a bit conflated: stability, availability, and reachability. i.e. is the grocery store going to collapse, is it open, and do we have a ride to get there. You could get stability and availability type things from diamond/statsd/graphite doing local polling and have diamond submit noise values to offset txstatsd for now though it pains me to suggest that. Then when we move away fix that, or some over thing if you want, but in general filippo is doing statsd stuff in prod and no one has a super defined grand plan yet.

Change 199302 had a related patch set uploaded (by 20after4):
VarnishStatusCollector for diamond.

https://gerrit.wikimedia.org/r/199302

Change 199302 merged by Filippo Giunchedi:
VarnishStatusCollector for diamond.

https://gerrit.wikimedia.org/r/199302

Although this isn't a perfect solution, we don't have the time to work on this any further right now, and the graphs (linked above) are a decent start. We will try to improve things a bit when we roll out the staging cluster in the coming months.

Can you update the urls (maybe in the description)? I'd love to use these for a few things coming me/our way, but they're broken now :/

Ok I updated the dashboard, improved the graphs just a little