Define a metric to track OpenStack system availability
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bd808
	Jun 10 2017, 12:34 AM

Description

Define a metric that can be used to track the operational availability of the OpenStack platform managed by the cloud-services-team. This metric should be based on factors largely within the control of the team and reflect the fitness of the OpenStack system for the core use cases of its customers. Once defined, we should implement a system to make tracking and reporting changes in the metric easy and use it to gauge the effectiveness of changes made to our VPS hosting product.

Details

	Subject	Repo	Branch	Lines +/-
	fullstack: optionally clean up leaked VMs after a point	operations/puppet	production	+22 -4
	prometheus: Add additional blackbox module http_200_300_connect	operations/puppet	production	+8 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		bd808	T166396 Program 1 Outcome 4: VPS hosting
		Resolved		Andrew	T167556 Define a metric to track OpenStack system availability

Event Timeline

bd808 created this task.Jun 10 2017, 12:34 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 10 2017, 12:34 AM

bd808 triaged this task as Medium priority.Jun 10 2017, 12:34 AM

bd808 added a parent task: T166396: Program 1 Outcome 4: VPS hosting.

bd808 added a project: Goal.Jun 10 2017, 12:36 AM

bd808 mentioned this in T166034: Decide on FY17/18 Q1 goals for Cloud Services.Jun 10 2017, 12:52 AM

bd808 renamed this task from Define a metric to track OpenStack system availabilty to Define a metric to track OpenStack system availability.Jun 21 2017, 12:07 AM

Assigning to @Andrew as the tech lead for this initiative. He will be responsible for creating a plan for this work and helping me report on it as the quarter progresses.

bd808 mentioned this in T171618: Create a "state of the cloud" monthly report.Jul 25 2017, 5:04 PM

Here are some user-facing things that I'd like to have metrics for:

Openstack APIs
- Keystone API availability
- Nova API availability
- Designate API availability
- Glance API availability
Horizon web UI availability
Continued web service from an instance
Access to existing instances
- DNS resolution for existing instances
- DNS resolution for public IPs
- Login (ldap/pam)
New instance creation
- Scheduling
- Spawning
- Firstboot/puppet
Labs puppetmaster serving properly
ldap/sudo policies working properly

fgiunchedi subscribed.Sep 11 2017, 2:53 PM

Change 377288 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] prometheus: Add additional blackbox module http_tolerant_connect

https://gerrit.wikimedia.org/r/377288

gerritbot added a project: Patch-For-Review.Sep 11 2017, 3:34 PM

Change 377288 merged by Andrew Bogott:
[operations/puppet@production] prometheus: Add additional blackbox module http_200_300_connect

https://gerrit.wikimedia.org/r/377288

I have some api uptime stats at https://grafana.wikimedia.org/dashboard/db/wmcs-api-uptimes?orgId=1

Obviously that dashboard needs a lot of fancying up.

I'm still convinced that fullstack success is the real metric of interest here... I'm currently working on getting an uptime % out of that. To get better numbers we'll probably need to modify the fullstack test so that it cleans up failed VMs periodically rather than bailing out after six failures.

I've added fullstack success % to the above graph. We still need to add some auto-cleanup functions to the fullstack test to keep accurate numbers.

Change 379388 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] fullstack: optionally clean up leaked VMs after a point

https://gerrit.wikimedia.org/r/379388

Change 379388 merged by Andrew Bogott:
[operations/puppet@production] fullstack: optionally clean up leaked VMs after a point

https://gerrit.wikimedia.org/r/379388

Andrew closed this task as Resolved.Sep 29 2017, 3:25 PM

Define a metric to track OpenStack system availabilityClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Define a metric to track OpenStack system availability
Closed, ResolvedPublic
Actions

Related Objects
Search...