Page MenuHomePhabricator

Define a metric to track OpenStack system availability
Closed, ResolvedPublic

Description

Define a metric that can be used to track the operational availability of the OpenStack platform managed by the cloud-services-team. This metric should be based on factors largely within the control of the team and reflect the fitness of the OpenStack system for the core use cases of its customers. Once defined, we should implement a system to make tracking and reporting changes in the metric easy and use it to gauge the effectiveness of changes made to our VPS hosting product.

Event Timeline

bd808 renamed this task from Define a metric to track OpenStack system availabilty to Define a metric to track OpenStack system availability.Jun 21 2017, 12:07 AM
bd808 added a subscriber: Andrew.

Assigning to @Andrew as the tech lead for this initiative. He will be responsible for creating a plan for this work and helping me report on it as the quarter progresses.

Here are some user-facing things that I'd like to have metrics for:

  • Openstack APIs
    • Keystone API availability
    • Nova API availability
    • Designate API availability
    • Glance API availability
  • Horizon web UI availability
  • Continued web service from an instance
  • Access to existing instances
    • DNS resolution for existing instances
    • DNS resolution for public IPs
    • Login (ldap/pam)
  • New instance creation
    • Scheduling
    • Spawning
    • Firstboot/puppet
  • Labs puppetmaster serving properly
  • ldap/sudo policies working properly

Change 377288 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] prometheus: Add additional blackbox module http_tolerant_connect

https://gerrit.wikimedia.org/r/377288

Change 377288 merged by Andrew Bogott:
[operations/puppet@production] prometheus: Add additional blackbox module http_200_300_connect

https://gerrit.wikimedia.org/r/377288

I have some api uptime stats at https://grafana.wikimedia.org/dashboard/db/wmcs-api-uptimes?orgId=1

Obviously that dashboard needs a lot of fancying up.

I'm still convinced that fullstack success is the real metric of interest here... I'm currently working on getting an uptime % out of that. To get better numbers we'll probably need to modify the fullstack test so that it cleans up failed VMs periodically rather than bailing out after six failures.

I've added fullstack success % to the above graph. We still need to add some auto-cleanup functions to the fullstack test to keep accurate numbers.

Change 379388 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] fullstack: optionally clean up leaked VMs after a point

https://gerrit.wikimedia.org/r/379388

Change 379388 merged by Andrew Bogott:
[operations/puppet@production] fullstack: optionally clean up leaked VMs after a point

https://gerrit.wikimedia.org/r/379388