Page MenuHomePhabricator

Deploy a proof of concept prometheus server in cloudvps to replace shinken
Closed, ResolvedPublic

Description

Create a new OpenStack project and Prometheus server to scrape metrics from the existing node-exporter running on virtual machines in tools and cloud-infra projects. Once these projects are configured WMCS will evaluate adding more CloudVPS projects to this configuration.

Initial steps to deploy the new monitoring stack:

Once we have an idea on data retention and usage:

  • Update openstack service discovery to either monitor all projects or a specific list of projects - update existing security groups as appropriate
  • Configure new project template with updated security group rules (if we decide to do all projects above)

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The new VM is created with a dedicated network port. Having a dedicated port reserves an IP address making future architecture changes or rebuilds more flexible.

$ OS_PROJECT_ID=metricsinfra openstack port create --network 7425e328-560c-4f00-8e99-706f3fb90bb4 --description "reserved address for monitoring" prometheus01.metricsinfra.eqiad.wikimedia.cloud

$ OS_PROJECT_ID=metricsinfra openstack server create --image debian-10.0-buster --flavor bigdisk2 --nic port-id=2e67a486-e840-4800-b974-d9220f5e107a prometheus01

Doesn't it also make it impossible to fully replace the instance without ops intervention? I'd prefer we had a special DNS name that stayed the same.

Doesn't it also make it impossible to fully replace the instance without ops intervention? I'd prefer we had a special DNS name that stayed the same.

It does, we'd need to detach and reattach the port to the new instance. But I think either way we'll still require ops intervention. If we fully replaced an instance we'd have to update each project's security groups with the new IP address.

Maybe a new port with a reserved address and DNS name in addition to the local VM's network interface would be better?

Doesn't it also make it impossible to fully replace the instance without ops intervention? I'd prefer we had a special DNS name that stayed the same.

It does, we'd need to detach and reattach the port to the new instance. But I think either way we'll still require ops intervention. If we fully replaced an instance we'd have to update each project's security groups with the new IP address.

Ugh, yeah, you're right re: IP and security groups I think. This is probably the right solution, and it doesn't necessarily make sense to try to fix the problem of special IPs like this for ordinary tenants as this is really specific to cloudinfra-type systems.

Maybe a new port with a reserved address and DNS name in addition to the local VM's network interface would be better?

I'm not sure I understand the distinction beyond the addition of a particular DNS name?

Maybe a new port with a reserved address and DNS name in addition to the local VM's network interface would be better?

I'm not sure I understand the distinction beyond the addition of a particular DNS name?

I was thinking of it more like a dedicated service address and name pair. Something that we could detach from the underlying host without losing full network connectivity. We're probably getting in the weeds here though :)

Yeah, I think what you've got so far sounds like the right thing for now at
least.

Change 588803 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: Add metricsinfra prometheus server

https://gerrit.wikimedia.org/r/588803

For the record, we have a grafana dashboard per project https://grafana-labs.wikimedia.org/d/000000059/cloud-vps-project-board?orgId=1&var-project=project-proxy&var-server=All&from=now-2d&to=now which is think is using data from graphite instead of prometheus.

Ah, is the plan to use the existing grafana-labs in prod?

Ah, is the plan to use the existing grafana-labs in prod?

Yeah, we can add this Prometheus server as a datasource to Grafana running on cloudmetrics. (Similar setup to the current tools Prometheus server)

Change 588803 merged by Jhedden:
[operations/puppet@production] cloudvps: Add metricsinfra prometheus server

https://gerrit.wikimedia.org/r/588803

Mentioned in SAL (#wikimedia-cloud) [2020-04-15T20:09:08Z] <jeh> update default security group to allow prometheus01.metricsinfra.eqiad.wmflabs TCP 9100 T250206

Mentioned in SAL (#wikimedia-cloud) [2020-04-15T20:10:10Z] <jeh> update default security group to allow prometheus01.metricsinfra.eqiad.wmflabs TCP 9100 T250206

Command used to update security groups for tools and cloudinfra

OS_PROJECT_ID=tools openstack security group rule create default --protocol tcp --dst-port 9100:9100 --remote-ip 172.16.0.229/32
OS_PROJECT_ID=cloudinfra openstack security group rule create default --protocol tcp --dst-port 9100:9100 --remote-ip 172.16.0.229/32

The new prometheus server is up and scraping node-exporter metrics from all the VMs in tools and cloudinfra

jeh@prometheus01:~$ curl -s http://$(hostname -f)/cloud/api/v1/targets | jq '.data.activeTargets | .[] | .labels.instance+" "+.health' | grep -c up
171
jeh@prometheus01:~$ curl -s http://$(hostname -f)/cloud/api/v1/targets | jq '.data.activeTargets | .[] | .labels.instance+" "+.health' | grep -v up
"tools-sgeexec-0912 down"
Krenair updated the task description. (Show Details)

Change 589398 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: update project service discovery prometheus config

https://gerrit.wikimedia.org/r/589398

Change 589398 merged by Jhedden:
[operations/puppet@production] cloudvps: update project service discovery prometheus config

https://gerrit.wikimedia.org/r/589398

Change 589716 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: add prometheus alert rules for project instances

https://gerrit.wikimedia.org/r/589716

Change 589716 merged by Jhedden:
[operations/puppet@production] cloudvps: add prometheus alert rules for project instances

https://gerrit.wikimedia.org/r/589716

Change 589864 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: update prometheus rule annotations

https://gerrit.wikimedia.org/r/589864

Change 589864 merged by Jhedden:
[operations/puppet@production] cloudvps: update prometheus rule annotations

https://gerrit.wikimedia.org/r/589864

Change 591053 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: Add new role for metricsinfra

https://gerrit.wikimedia.org/r/591053

Change 591053 merged by Jhedden:
[operations/puppet@production] cloudvps: Add new role for metricsinfra

https://gerrit.wikimedia.org/r/591053

Change 591202 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: metricsinfra add prometheus alert manager and email notifications

https://gerrit.wikimedia.org/r/591202

Change 591202 merged by Jhedden:
[operations/puppet@production] cloudvps: metricsinfra add prometheus alert manager and email notifications

https://gerrit.wikimedia.org/r/591202

Email based alert notifications are now enabled for the tools and cloudinfra projects.

Note that the OpenStack service discovery has support for monitoring all projects, but unfortunately our current packaged version of Prometheus drops the instance's project name which breaks multi-tenancy. Once our version of Prometheus includes this upstream patch [0] we have the option to enable monitoring for every project.

[0] https://github.com/prometheus/prometheus/commit/9c5370fdfe7cf51fd5d58151bb745ac10f6c2dac

I poked at tools-sgeexec-0901 just out of curiosity, and it was apt. After running sudo apt clean:

/dev/vda3                                                           19G   12G  6.2G  66% /

That's down from 80%. That's pretty common in Toolforge (seen it before), but I don't have a strong opinion on how to fix it. Just noting that the condition was out there in case that is useful to you.

Added a Grafana dashboard for detailed instance metrics using the metricsinfra prometheus server: https://grafana-labs.wikimedia.org/d/000000590/metricsinfra-cloudvps-instance-details

Change 593042 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: metricsinfra add project label and default alert rules

https://gerrit.wikimedia.org/r/593042

Change 593042 merged by Jhedden:
[operations/puppet@production] cloudvps: metricsinfra add project label and default alert rules

https://gerrit.wikimedia.org/r/593042

Change 593048 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: enable project monitoring for the metricsinfra project

https://gerrit.wikimedia.org/r/593048

Change 593048 merged by Jhedden:
[operations/puppet@production] cloudvps: enable project monitoring for the metricsinfra project

https://gerrit.wikimedia.org/r/593048

Change 593054 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: enable monitoring for projects using shinken

https://gerrit.wikimedia.org/r/593054

Change 593054 merged by Jhedden:
[operations/puppet@production] cloudvps: enable monitoring for projects using shinken

https://gerrit.wikimedia.org/r/593054

Change 593342 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: metricsinfra alert on puppet agent disabled state

https://gerrit.wikimedia.org/r/593342

Change 593342 merged by Jhedden:
[operations/puppet@production] cloudvps: metricsinfra alert on puppet agent disabled state

https://gerrit.wikimedia.org/r/593342

Bstorm triaged this task as High priority.Jun 2 2020, 4:16 PM

Since implementing apt cache autocleaning for T127374: Avoid indefinite growing of apt caches and old kernel images, I think we probably should enable the disk size monitor.

Rerunning https://prometheus.wmflabs.org/cloud/graph?g0.range_input=1h&g0.expr=100%20-%20(node_filesystem_avail_bytes%7Bfstype%3D%22ext4%22%7D%2Fnode_filesystem_size_bytes%20*%20100)%20%3E%3D%2080&g0.tab=1 shows that the only thing in tools that would alert are the docker registries (which we should be concerned about and need to clean up). I'm not sure we want to alert on deployment-prep's disks. They are quite full, it seems.

Things that I think we could do next here:

  • setup a project puppetmaster so that we can have secret storage for things like volunteer's emails
  • refactor the puppet module so that the alertmanager config merges public and private hiera hashes into the config so that email addresses, etc can be kept non-public
  • setup irc relay on the the metricsinfra node
  • change reverse proxy restrictions so that the vpsalertmanager.toolforge.org deployment can silence alerts. We had an email discussion and decided that abuse of random silences
bd808 assigned this task to JHedden.

Let's call this {{done}} as far as a POC goes. See T266050: Build Prometheus service for use by all Cloud VPS projects and their instances for a larger follow up project.