Page MenuHomePhabricator

Deploy a proof of concept prometheus server in cloudvps to replace shinken
Open, Needs TriagePublic

Description

Create a new OpenStack project and Prometheus server to scrape metrics from the existing node-exporter running on virtual machines in tools and cloud-infra projects. Once these projects are configured WMCS will evaluate adding more CloudVPS projects to this configuration.

Initial steps to deploy the new monitoring stack:

Once we have an idea on data retention and usage:

  • Update openstack service discovery to either monitor all projects or a specific list of projects - update existing security groups as appropriate
  • Configure new project template with updated security group rules (if we decide to do all projects above)

Event Timeline

JHedden created this task.Apr 14 2020, 5:14 PM
JHedden updated the task description. (Show Details)Apr 14 2020, 5:19 PM
JHedden updated the task description. (Show Details)
JHedden updated the task description. (Show Details)Apr 14 2020, 5:34 PM

(Will also need some ferm rules but if we're reusing prod manifests that is probably already be taken care of)

The new VM is created with a dedicated network port. Having a dedicated port reserves an IP address making future architecture changes or rebuilds more flexible.

$ OS_PROJECT_ID=metricsinfra openstack port create --network 7425e328-560c-4f00-8e99-706f3fb90bb4 --description "reserved address for monitoring" prometheus01.metricsinfra.eqiad.wikimedia.cloud

$ OS_PROJECT_ID=metricsinfra openstack server create --image debian-10.0-buster --flavor bigdisk2 --nic port-id=2e67a486-e840-4800-b974-d9220f5e107a prometheus01

The new VM is created with a dedicated network port. Having a dedicated port reserves an IP address making future architecture changes or rebuilds more flexible.

$ OS_PROJECT_ID=metricsinfra openstack port create --network 7425e328-560c-4f00-8e99-706f3fb90bb4 --description "reserved address for monitoring" prometheus01.metricsinfra.eqiad.wikimedia.cloud

$ OS_PROJECT_ID=metricsinfra openstack server create --image debian-10.0-buster --flavor bigdisk2 --nic port-id=2e67a486-e840-4800-b974-d9220f5e107a prometheus01

Doesn't it also make it impossible to fully replace the instance without ops intervention? I'd prefer we had a special DNS name that stayed the same.

Doesn't it also make it impossible to fully replace the instance without ops intervention? I'd prefer we had a special DNS name that stayed the same.

It does, we'd need to detach and reattach the port to the new instance. But I think either way we'll still require ops intervention. If we fully replaced an instance we'd have to update each project's security groups with the new IP address.

Maybe a new port with a reserved address and DNS name in addition to the local VM's network interface would be better?

Doesn't it also make it impossible to fully replace the instance without ops intervention? I'd prefer we had a special DNS name that stayed the same.

It does, we'd need to detach and reattach the port to the new instance. But I think either way we'll still require ops intervention. If we fully replaced an instance we'd have to update each project's security groups with the new IP address.

Ugh, yeah, you're right re: IP and security groups I think. This is probably the right solution, and it doesn't necessarily make sense to try to fix the problem of special IPs like this for ordinary tenants as this is really specific to cloudinfra-type systems.

Maybe a new port with a reserved address and DNS name in addition to the local VM's network interface would be better?

I'm not sure I understand the distinction beyond the addition of a particular DNS name?

Maybe a new port with a reserved address and DNS name in addition to the local VM's network interface would be better?

I'm not sure I understand the distinction beyond the addition of a particular DNS name?

I was thinking of it more like a dedicated service address and name pair. Something that we could detach from the underlying host without losing full network connectivity. We're probably getting in the weeds here though :)

Yeah, I think what you've got so far sounds like the right thing for now at
least.

Change 588803 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: Add metricsinfra prometheus server

https://gerrit.wikimedia.org/r/588803

Krenair updated the task description. (Show Details)Apr 15 2020, 2:05 AM

For the record, we have a grafana dashboard per project https://grafana-labs.wikimedia.org/d/000000059/cloud-vps-project-board?orgId=1&var-project=project-proxy&var-server=All&from=now-2d&to=now which is think is using data from graphite instead of prometheus.

Ah, is the plan to use the existing grafana-labs in prod?

Ah, is the plan to use the existing grafana-labs in prod?

Yeah, we can add this Prometheus server as a datasource to Grafana running on cloudmetrics. (Similar setup to the current tools Prometheus server)

Change 588803 merged by Jhedden:
[operations/puppet@production] cloudvps: Add metricsinfra prometheus server

https://gerrit.wikimedia.org/r/588803

Mentioned in SAL (#wikimedia-cloud) [2020-04-15T20:09:08Z] <jeh> update default security group to allow prometheus01.metricsinfra.eqiad.wmflabs TCP 9100 T250206

Mentioned in SAL (#wikimedia-cloud) [2020-04-15T20:10:10Z] <jeh> update default security group to allow prometheus01.metricsinfra.eqiad.wmflabs TCP 9100 T250206

Command used to update security groups for tools and cloudinfra

OS_PROJECT_ID=tools openstack security group rule create default --protocol tcp --dst-port 9100:9100 --remote-ip 172.16.0.229/32
OS_PROJECT_ID=cloudinfra openstack security group rule create default --protocol tcp --dst-port 9100:9100 --remote-ip 172.16.0.229/32
JHedden updated the task description. (Show Details)Apr 15 2020, 8:14 PM

The new prometheus server is up and scraping node-exporter metrics from all the VMs in tools and cloudinfra

jeh@prometheus01:~$ curl -s http://$(hostname -f)/cloud/api/v1/targets | jq '.data.activeTargets | .[] | .labels.instance+" "+.health' | grep -c up
171
jeh@prometheus01:~$ curl -s http://$(hostname -f)/cloud/api/v1/targets | jq '.data.activeTargets | .[] | .labels.instance+" "+.health' | grep -v up
"tools-sgeexec-0912 down"
JHedden updated the task description. (Show Details)Apr 15 2020, 9:22 PM
JHedden updated the task description. (Show Details)
Krenair updated the task description. (Show Details)Apr 16 2020, 12:33 PM
Krenair updated the task description. (Show Details)

Change 589398 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: update project service discovery prometheus config

https://gerrit.wikimedia.org/r/589398

Change 589398 merged by Jhedden:
[operations/puppet@production] cloudvps: update project service discovery prometheus config

https://gerrit.wikimedia.org/r/589398

Change 589716 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: add prometheus alert rules for project instances

https://gerrit.wikimedia.org/r/589716

JHedden updated the task description. (Show Details)Apr 17 2020, 9:57 PM
JHedden updated the task description. (Show Details)Apr 17 2020, 10:11 PM

Change 589716 merged by Jhedden:
[operations/puppet@production] cloudvps: add prometheus alert rules for project instances

https://gerrit.wikimedia.org/r/589716

Krenair updated the task description. (Show Details)Apr 18 2020, 3:16 PM

Change 589864 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: update prometheus rule annotations

https://gerrit.wikimedia.org/r/589864

Change 589864 merged by Jhedden:
[operations/puppet@production] cloudvps: update prometheus rule annotations

https://gerrit.wikimedia.org/r/589864

Change 591053 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: Add new role for metricsinfra

https://gerrit.wikimedia.org/r/591053

Change 591053 merged by Jhedden:
[operations/puppet@production] cloudvps: Add new role for metricsinfra

https://gerrit.wikimedia.org/r/591053

Change 591202 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: metricsinfra add prometheus alert manager and email notifications

https://gerrit.wikimedia.org/r/591202

Change 591202 merged by Jhedden:
[operations/puppet@production] cloudvps: metricsinfra add prometheus alert manager and email notifications

https://gerrit.wikimedia.org/r/591202

Email based alert notifications are now enabled for the tools and cloudinfra projects.

Note that the OpenStack service discovery has support for monitoring all projects, but unfortunately our current packaged version of Prometheus drops the instance's project name which breaks multi-tenancy. Once our version of Prometheus includes this upstream patch [0] we have the option to enable monitoring for every project.

[0] https://github.com/prometheus/prometheus/commit/9c5370fdfe7cf51fd5d58151bb745ac10f6c2dac

JHedden updated the task description. (Show Details)Apr 27 2020, 7:11 PM

It looks like things will be noisy if we add the alert space rules right now.
https://prometheus.wmflabs.org/cloud/graph?g0.range_input=1h&g0.expr=100%20-%20(node_filesystem_avail_bytes%7Bfstype%3D%22ext4%22%7D%2Fnode_filesystem_size_bytes%20*%20100)%20%3E%3D%2080&g0.tab=1

Going to hold off adding that and see what I can clean up first.

I poked at tools-sgeexec-0901 just out of curiosity, and it was apt. After running sudo apt clean:

/dev/vda3                                                           19G   12G  6.2G  66% /

That's down from 80%. That's pretty common in Toolforge (seen it before), but I don't have a strong opinion on how to fix it. Just noting that the condition was out there in case that is useful to you.

Added a Grafana dashboard for detailed instance metrics using the metricsinfra prometheus server: https://grafana-labs.wikimedia.org/d/000000590/metricsinfra-cloudvps-instance-details

Change 593042 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: metricsinfra add project label and default alert rules

https://gerrit.wikimedia.org/r/593042

Change 593042 merged by Jhedden:
[operations/puppet@production] cloudvps: metricsinfra add project label and default alert rules

https://gerrit.wikimedia.org/r/593042

Change 593048 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: enable project monitoring for the metricsinfra project

https://gerrit.wikimedia.org/r/593048

Change 593048 merged by Jhedden:
[operations/puppet@production] cloudvps: enable project monitoring for the metricsinfra project

https://gerrit.wikimedia.org/r/593048

Change 593054 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: enable monitoring for projects using shinken

https://gerrit.wikimedia.org/r/593054

Change 593054 merged by Jhedden:
[operations/puppet@production] cloudvps: enable monitoring for projects using shinken

https://gerrit.wikimedia.org/r/593054

Change 593342 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: metricsinfra alert on puppet agent disabled state

https://gerrit.wikimedia.org/r/593342

Change 593342 merged by Jhedden:
[operations/puppet@production] cloudvps: metricsinfra alert on puppet agent disabled state

https://gerrit.wikimedia.org/r/593342

bd808 removed JHedden as the assignee of this task.Sun, May 31, 8:20 PM
bd808 edited projects, added cloud-services-team (Kanban); removed Patch-For-Review.