Page MenuHomePhabricator

cloudvps: metrics and analytics
Closed, ResolvedPublic

Description

We would like to collect metrics of our Cloud VPS deployments, specially from the new eqiad1 deployment.
Not only server metrics, but openstack metrics, such as: state of neutron agents, amount of VMs, projects, nova scheduling state, etc (i.e, high level openstack values) and also rabbit queues.

There seems to be some work done already:

However I don't find any Debian package, we may need to create it ourselves.

While at it, fix the rabbitmq exporter for cloudcontrol1003/cloudcontrol1004 which is not working.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -1
operations/puppetproduction+16 -22
operations/puppetproduction+0 -7
operations/puppetproduction+22 -16
operations/puppetproduction+19 -1
operations/puppetproduction+10 -2
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+46 -2
operations/puppetproduction+3 -3
operations/puppetproduction+10 -0
operations/debs/prometheus-openstack-exportermaster+1 -0
operations/puppetproduction+2 -3
operations/puppetproduction+43 -35
operations/puppetproduction+10 -10
operations/puppetproduction+10 -10
operations/puppetproduction+1 -1
operations/puppetproduction+2 -1
operations/puppetproduction+138 -0
operations/debs/prometheus-openstack-exportermaster+2 -3
operations/debs/prometheus-openstack-exportermaster+98 -0
operations/puppetproduction+4 -3
operations/puppetproduction+6 -0
Show related patches Customize query in gerrit

Event Timeline

aborrero triaged this task as Medium priority.Aug 30 2018, 4:16 PM
aborrero created this task.
aborrero updated the task description. (Show Details)

Change 456569 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: rabbitmq: create monitoring user

https://gerrit.wikimedia.org/r/456569

Change 456569 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: rabbitmq: create monitoring user

https://gerrit.wikimedia.org/r/456569

Change 456572 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: rabbitmq: monitoring user is administrator

https://gerrit.wikimedia.org/r/456572

aborrero renamed this task from cloudvps: metrics to cloudvps: metrics and analytics .Aug 31 2018, 8:20 AM

Change 456572 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: rabbitmq: monitoring user is administrator

https://gerrit.wikimedia.org/r/456572

I've been playing with https://github.com/CanonicalLtd/prometheus-openstack-exporter in cloudcontrol1003.wikimedia.org.

It seems to work, but we would need some patching to adapt it to our environment (specially the keystone credentials), something like https://phabricator.wikimedia.org/source/tool-keystone-browser/browse/master/keystone_browser/keystone.py$45

Also, there are missing bits to integrate the script with diamond/graphite.

I created this preliminary grafana dashboard https://grafana.wikimedia.org/dashboard/db/cloudvps-eqiad1?orgId=1&from=now%2FM&to=now to try to answer the question: how is our openstack deployment eqiad1 working/performing?
But that dashboard would greatly benefit from having openstack-specific metrics.

We decided to follow @fgiunchedi advice which is to create a debian package to deploy this. We will probably follow what other similar packages do (for example operations/deb/prometheus-rabbitmq-exporter.git).

Also @fgiunchedi also clarified that we can forget about diamond and just go with prometheus. Once we have the code deployed and running (remember, we need some code changes to adapt to our environment) we can instruct prometheus to scrap this data.

Right now @GTirloni is creating the gerrit repo so we can start wit the debian packaging.

Change 461376 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/debs/prometheus-openstack-exporter@master] debian: initial code import (jessie)

https://gerrit.wikimedia.org/r/461376

Change 461376 merged by Arturo Borrero Gonzalez:
[operations/debs/prometheus-openstack-exporter@master] debian: initial code import (jessie)

https://gerrit.wikimedia.org/r/461376

Mentioned in SAL (#wikimedia-operations) [2018-09-19T13:03:57Z] <arturo> T203177 add initial prometheus-openstack-exporter package to reprepro (v0.0.8-1)

Change 462455 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: add prometheus-openstack-exporter

https://gerrit.wikimedia.org/r/462455

Change 463445 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/debs/prometheus-openstack-exporter@master] d/service: fix templates leftovers

https://gerrit.wikimedia.org/r/463445

Change 463445 merged by Arturo Borrero Gonzalez:
[operations/debs/prometheus-openstack-exporter@master] d/service: fix templates leftovers

https://gerrit.wikimedia.org/r/463445

Mentioned in SAL (#wikimedia-operations) [2018-09-28T11:38:33Z] <arturo> add prometheus-openstack-exporter 0.0.8-2 to reprepro (T203177)

With the current status of package and puppet patch, I see this error message from the exporter script:

Starting data gather thread
Client setup done, keystone ver 3
Error getting tenants.list, continue with projects.list
Number of projects: 172
Error getting stats: Traceback (most recent call last):
  File "/usr/bin/prometheus-openstack-exporter", line 153, in run
    prodstack['hypervisors'] = [x._info for x in nova.hypervisors.list()]
  File "/usr/lib/python2.7/dist-packages/novaclient/v2/hypervisors.py", line 43, in list
    return self._list('/os-hypervisors%s' % detail, 'hypervisors')
  File "/usr/lib/python2.7/dist-packages/novaclient/base.py", line 242, in _list
    resp, body = self.api.client.get(url)
  File "/usr/lib/python2.7/dist-packages/keystoneauth1/adapter.py", line 173, in get
    return self.request(url, 'GET', **kwargs)
  File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 94, in request
    raise exceptions.from_response(resp, body, url, method)
Forbidden: Policy doesn't allow os_compute_api:os-hypervisors to be performed. (HTTP 403) (Request-ID: req-452a983d-c93e-4920-8112-73cf5b41f51a)

I wonder if we can tune the API permissions to let this pass or we should just avoid running that particular code path in the script (not collect those metrics).

Mentioned in SAL (#wikimedia-operations) [2018-09-28T12:28:42Z] <arturo> downtime cloudcontrol1004.wikimedia.org for 2H (tests related to T203177)

Change 462455 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: add prometheus-openstack-exporter

https://gerrit.wikimedia.org/r/462455

Mentioned in SAL (#wikimedia-operations) [2018-09-28T12:42:13Z] <arturo> downtime cloudcontrol1003.wikimedia.org for 2H (tests related to T203177)

Change 463465 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus-openstack-exporter: fix ferm syntax

https://gerrit.wikimedia.org/r/463465

Change 463465 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus-openstack-exporter: fix ferm syntax

https://gerrit.wikimedia.org/r/463465

Change 463466 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus-openstack-exporter: fix comma in ferm array

https://gerrit.wikimedia.org/r/463466

Change 463466 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus-openstack-exporter: fix comma in ferm array

https://gerrit.wikimedia.org/r/463466

Change 463467 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus-openstack-exporter: typo in config template

https://gerrit.wikimedia.org/r/463467

Change 463467 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus-openstack-exporter: typo in config template

https://gerrit.wikimedia.org/r/463467

Change 463470 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus-openstack-exporter: fix syntax in variables used in template

https://gerrit.wikimedia.org/r/463470

Change 463470 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus-openstack-exporter: fix syntax in variables used in template

https://gerrit.wikimedia.org/r/463470

Change 463472 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus-openstack-exporter: fix subvars usage

https://gerrit.wikimedia.org/r/463472

Change 463472 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus-openstack-exporter: fix subvars usage

https://gerrit.wikimedia.org/r/463472

Status of this: code has been deployed (after a lot of back and forth with puppet) and we get the Forbidden: Policy doesn't allow os_compute_api:os-hypervisors to be performed. (HTTP 403) error.
Also, not sure if we need something else to instruct prometheus to read from this exporter.

Status of this: code has been deployed (after a lot of back and forth with puppet) and we get the Forbidden: Policy doesn't allow os_compute_api:os-hypervisors to be performed. (HTTP 403) error.
Also, not sure if we need something else to instruct prometheus to read from this exporter.

Yes, once you can curl successfully from labmon hosts you have to add a "job" to have Prometheus poll for openstack-exporter to role::labs::prometheus

Change 463732 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: eqiad1 metrics: use labmon instead of main prod monitoring hosts

https://gerrit.wikimedia.org/r/463732

Change 463732 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: eqiad1 metrics: use labmon instead of main prod monitoring hosts

https://gerrit.wikimedia.org/r/463732

Mentioned in SAL (#wikimedia-operations) [2018-10-01T11:52:22Z] <arturo> install prometheus-openstack-exporte 0.0.8-3 in reprepro T203177

Change 463736 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/debs/prometheus-openstack-exporter@master] d/dirs: include /var/cache/prometheus-openstack-exporter

https://gerrit.wikimedia.org/r/463736

Change 463736 merged by Arturo Borrero Gonzalez:
[operations/debs/prometheus-openstack-exporter@master] d/dirs: include /var/cache/prometheus-openstack-exporter

https://gerrit.wikimedia.org/r/463736

Change 463790 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: allow queries to the nova API by novaobserver

https://gerrit.wikimedia.org/r/463790

Change 463795 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus-openstack-exporter: ensure permissions

https://gerrit.wikimedia.org/r/463795

Change 463795 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus-openstack-exporter: ensure permissions

https://gerrit.wikimedia.org/r/463795

Change 463790 abandoned by Arturo Borrero Gonzalez:
cloudvps: unprotect some nova API queries

Reason:
Let's use novaadmin creds.

https://gerrit.wikimedia.org/r/463790

Change 464124 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus-openstack-exporter: use novaadmin credentials

https://gerrit.wikimedia.org/r/464124

Change 464124 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus-openstack-exporter: use novaadmin credentials

https://gerrit.wikimedia.org/r/464124

Mentioned in SAL (#wikimedia-operations) [2018-10-03T11:38:38Z] <arturo> downtime cloudcontrol1003,1004 for 2h for T203177

Change 464130 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus-openstack-exporter: typo in puppet URL: missing /

https://gerrit.wikimedia.org/r/464130

Change 464130 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus-openstack-exporter: typo in puppet URL: missing /

https://gerrit.wikimedia.org/r/464130

Change 464131 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus-openstack-exporter: s/content/source/ in systemd::service

https://gerrit.wikimedia.org/r/464131

Change 464131 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus-openstack-exporter: s/content/source/ in systemd::service

https://gerrit.wikimedia.org/r/464131

Change 464134 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus-openstack-exporter: use systemd::service content parameter

https://gerrit.wikimedia.org/r/464134

Change 464134 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus-openstack-exporter: use systemd::service content parameter

https://gerrit.wikimedia.org/r/464134

Change 464137 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus-openstack-exporter: replace instead of override systemd service file

https://gerrit.wikimedia.org/r/464137

Change 464137 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus-openstack-exporter: replace instead of override systemd service file

https://gerrit.wikimedia.org/r/464137

Change 464144 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] prometheus-openstack-exporter: Add monitored endpoints

https://gerrit.wikimedia.org/r/464144

Change 464144 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus-openstack-exporter: add monitored endpoints

https://gerrit.wikimedia.org/r/464144

The Prometheus instance running on labmon1001 is scraping data from cloudcontrol1003 but I can't find the series in https://grafana-labs.wikimedia.org.

If I connect to Prometheus directly on port 9900, the series are there but I've failed to find a suitable data source in grafana-labs that would allow me to see that.

Also, I'm confused about grafana-labs vs grafana vs tools-prometheus. It seems we'd want to store OpenStack metrics in the prod prometheus since they're one level below our tools stack.

The Prometheus instance running on labmon1001 is scraping data from cloudcontrol1003 but I can't find the series in https://grafana-labs.wikimedia.org.

If I connect to Prometheus directly on port 9900, the series are there but I've failed to find a suitable data source in grafana-labs that would allow me to see that.

Also, I'm confused about grafana-labs vs grafana vs tools-prometheus. It seems we'd want to store OpenStack metrics in the prod prometheus since they're one level below our tools stack.

I don't know about grafana-labs but https://grafana.wikimedia.org does have a datasource for labmon's prometheus, so metrics should be available at https://grafana.wikimedia.org. AIUI grafana-labs historically is used for metrics from VMs themselves, though nothing stops you from adding labmon's prometheus there as a data source too. WRT tools-prometheus that's "tools' prometheus installation" where said prometheus only polls tools' metrics from the tools project, IOW completely independent/autonomous from other prometheus (ditto for deployment-prep's), hope that helps!

Change 464493 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus-openstack-exporter: only makes sense in active control node

https://gerrit.wikimedia.org/r/464493

Change 464493 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus-openstack-exporter: only makes sense in active control node

https://gerrit.wikimedia.org/r/464493

Mentioned in SAL (#wikimedia-operations) [2018-10-04T09:13:21Z] <arturo> T203177 schedule 8h icinga downtime for cloudcontrol1003,1004 and labmon1001

Change 464495 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: metrics: cleanup unused hiera datafile

https://gerrit.wikimedia.org/r/464495

Change 464495 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: metrics: cleanup unused hiera datafile

https://gerrit.wikimedia.org/r/464495

Change 464496 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] Revert "prometheus-openstack-exporter: only makes sense in active control node"

https://gerrit.wikimedia.org/r/464496

Change 464496 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] Revert "prometheus-openstack-exporter: only makes sense in active control node"

https://gerrit.wikimedia.org/r/464496

Change 464500 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: metrics: adjust depedency on novaenv

https://gerrit.wikimedia.org/r/464500

Change 464500 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: metrics: we don't require observerenv

https://gerrit.wikimedia.org/r/464500