Define a metric that can be used to track the operational availability of the OpenStack platform managed by the cloud-services-team. This metric should be based on factors largely within the control of the team and reflect the fitness of the OpenStack system for the core use cases of its customers. Once defined, we should implement a system to make tracking and reporting changes in the metric easy and use it to gauge the effectiveness of changes made to our VPS hosting product.
|Resolved||• bd808||T166396 Program 1 Outcome 4: VPS hosting|
|Resolved||Andrew||T167556 Define a metric to track OpenStack system availability|
Here are some user-facing things that I'd like to have metrics for:
- Openstack APIs
- Keystone API availability
- Nova API availability
- Designate API availability
- Glance API availability
- Horizon web UI availability
- Continued web service from an instance
- Access to existing instances
- DNS resolution for existing instances
- DNS resolution for public IPs
- Login (ldap/pam)
- New instance creation
- Labs puppetmaster serving properly
- ldap/sudo policies working properly
I have some api uptime stats at https://grafana.wikimedia.org/dashboard/db/wmcs-api-uptimes?orgId=1
Obviously that dashboard needs a lot of fancying up.
I'm still convinced that fullstack success is the real metric of interest here... I'm currently working on getting an uptime % out of that. To get better numbers we'll probably need to modify the fullstack test so that it cleans up failed VMs periodically rather than bailing out after six failures.