[Quarterly Success Metric] Stable uptime metrics of the Staging cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	greg
	Feb 5 2015, 6:25 PM

Description

This is now live on beta cluster and it will be available on staging as soon as the staging-cache-* machines come online.

You can check it out on the wmflabs.org graphite dashboard.

Individual Graphs:

So far we are averaging about 99.5% availability and this is counting 404 as non-availability so that is fairly good IMO.

Details

	Subject	Repo	Branch	Lines +/-
	VarnishStatusCollector for diamond.	operations/puppet	production	+101 -0

Customize query in gerrit

Related Objects

Mentioned In: T97865: Setup (simple) catchpoint monitoring and metrics for enwiki betacluster just like production
rOPUPc50803130e37: VarnishStatusCollector for diamond.
T1000: Update Beta Cluster status documentation (re Q3 intradepartamental priority)
Mentioned Here: T89857: scale statsd reporting/aggregation (plan)
P357 http status code stats from varnish

Event Timeline

greg created this task.Feb 5 2015, 6:25 PM

greg raised the priority of this task from to Medium.

greg updated the task description. (Show Details)

greg added a project: Staging.

greg added subscribers: greg, yuvipanda, • mmodell.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2015, 6:25 PM

greg renamed this task from [Quarterly Success Metric] - Stable uptime metrics of the Staging cluster to [Quarterly Success Metric] Stable uptime metrics of the Staging cluster.Feb 5 2015, 7:07 PM

greg set Security to None.

greg mentioned this in T1000: Update Beta Cluster status documentation (re Q3 intradepartamental priority).

greg added a project: releng-201415-Q3.Feb 24 2015, 1:03 AM

thcipriani subscribed.Mar 3 2015, 4:58 PM

The success:error ratio, along with the volume of requests and the average response times would be really good metrics for the health of the staging cluster.

• mmodell claimed this task.Mar 4 2015, 10:23 PM

• mmodell updated the task description. (Show Details)

Perhaps, but I'll note that this also has flaws - if the nginx / varnish machines themselves are down. or DNS is down and people can't reach this. etc, etc, etc...

Did someone talk with @chasemp to see what he's doing wrt prod? They also want stability metrics...

@yuvipanda: yes I talked to chase about this a while back but he didn't seem to think he had anything we could immediately apply to our stuff, at least not yet.

For overall availability I would take any gaps in the graph to mean 100% errors during those times (since we weren't handling any requests, it must have been down)

So our current txstatsd implementation interprets 'no new data' to be 'repeat last data point' (yes, stupid) rather than 'no new data'. I'm sure we can overcome this properly with some diamond code, but needs to be evaluated properly.

Still seems a bit fragile. 0 vs null would be an issue, perhaps.

I do not have any easy alternatives to offer, however. My personal preference would be to just wait for Chase to do the prod stuff and just use exactly the same metrics...

In T88705#1090443, @yuvipanda wrote:

So our current txstatsd implementation interprets 'no new data' to be 'repeat last data point' (yes, stupid) rather than 'no new data'. I'm sure we can overcome this properly with some diamond code, but needs to be evaluated properly.

I think txstatsd is going to be replaced soon (T89857).

In T88705#1090443, @yuvipanda wrote:

My personal preference would be to just wait for Chase to do the prod stuff and just use exactly the same metrics...

Can't, it's way too expensive (TM). cc @chasemp

@yuvipanda: the number is cumulative, so repeating the last data point would be the same as no new data.

I'm in ping central today gentlemen. My 2 cents.

Yes, txstatsd is not great and will die. Yes, it seems to confuse no metrics with "submit something so we don't have to care about xfactor" which is awful. Stability or uptime metrics this seems sane to me for staging. I would say make a diamond collector that only does a subset of metrics to reduce load for labs friendliness and manage it via puppet. The prod stuff I am doing currently is mostly unrelated and I think maybe three concepts are being a bit conflated: stability, availability, and reachability. i.e. is the grocery store going to collapse, is it open, and do we have a ride to get there. You could get stability and availability type things from diamond/statsd/graphite doing local polling and have diamond submit noise values to offset txstatsd for now though it pains me to suggest that. Then when we move away fix that, or some over thing if you want, but in general filippo is doing statsd stuff in prod and no one has a super defined grand plan yet.

• mmodell moved this task from Backlog to In Progress on the Staging board.Mar 12 2015, 4:57 PM

Change 199302 had a related patch set uploaded (by 20after4):
VarnishStatusCollector for diamond.

https://gerrit.wikimedia.org/r/199302

gerritbot added a project: Patch-For-Review.Mar 24 2015, 5:57 PM

https://graphite.wmflabs.org/render?from=-2days&until=now&width=500&height=350&target=deployment-prep.deployment-cache-text02.availability.availability.value&title=Beta%20Availability&_uniq=0.6155465365580979

Beta Cluster Availability Graph | Dashboard

Change 199302 merged by Filippo Giunchedi:
VarnishStatusCollector for diamond.

https://gerrit.wikimedia.org/r/199302

fgiunchedi mentioned this in rOPUPc50803130e37: VarnishStatusCollector for diamond..Mar 27 2015, 11:20 AM

• mmodell updated the task description. (Show Details)Mar 28 2015, 2:40 PM

• mmodell removed a project: Patch-For-Review.

• mmodell updated the task description. (Show Details)Mar 31 2015, 4:22 PM

• mmodell updated the task description. (Show Details)Mar 31 2015, 8:04 PM

Although this isn't a perfect solution, we don't have the time to work on this any further right now, and the graphs (linked above) are a decent start. We will try to improve things a bit when we roll out the staging cluster in the coming months.

In T88705#1156024, @mmodell wrote:

Beta Cluster Availability Graph | Dashboard

Can you update the urls (maybe in the description)? I'd love to use these for a few things coming me/our way, but they're broken now :/

@greg: https://graphite.wmflabs.org/dashboard/#availability should work now

• mmodell updated the task description. (Show Details)Apr 24 2015, 10:20 PM

• mmodell updated the task description. (Show Details)Apr 24 2015, 11:15 PM

• mmodell updated the task description. (Show Details)Apr 25 2015, 1:14 AM

Ok I updated the dashboard, improved the graphs just a little

greg moved this task from In Progress to Done on the Staging board.Apr 30 2015, 3:48 PM

greg mentioned this in T97865: Setup (simple) catchpoint monitoring and metrics for enwiki betacluster just like production.May 1 2015, 10:51 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:55 PM

Restricted Application added a project: Release-Engineering-Team (Kanban). · View Herald TranscriptJun 7 2017, 6:55 PM

• Phabricator_maintenance edited projects, added RelEng-Archive-FY201718-Q1; removed Release-Engineering-Team (Kanban).Sep 26 2017, 11:48 PM

[Quarterly Success Metric] Stable uptime metrics of the Staging clusterClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

[Quarterly Success Metric] Stable uptime metrics of the Staging cluster
Closed, ResolvedPublic
Actions