Page MenuHomePhabricator

[promethus,haproxy] Move to haproxy internal metrics from haproxy_exporter
Open, HighPublic

Description

Currently we have an haproxy exporter that we scrape with prometheus to extract haproxy stats, but since haproxy 2.0.0 it ships with it's own internal metrics endpoint:

See https://github.com/prometheus/haproxy_exporter for details

We should move to that internal endpoint instead and avoid the extra process.

As the stats are named differently, a proposed action plan is:

  • move all the current users of haproxy_exporter to also collect haproxy stats
    • Thumbor is moving itself soon, so no patches needed there
    • Toolforge elasticsearch uses haproxy<2, so no stats, might want to upgrade first, but needs changing the config as the current one does not work with 2.1
  • notify teams that the stats changed
  • remove the scraping of all the haproxy_exporter and set profile::haproxy_exporter::enable to absent to uninstall/stop
  • cleanup the haproxy_exporter entries once puppet ran everywhere

Note that we would be collecting duplicated stats for some time until we move everyone to the haproxy internal metrics.

Event Timeline

dcaro triaged this task as High priority.Aug 9 2023, 12:58 PM
dcaro created this task.

Change 947353 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] haproxy_exporter: allow setting as absent

https://gerrit.wikimedia.org/r/947353

Change 947354 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] prometheus: gather stats from haproxy for openstack and cloudlb

https://gerrit.wikimedia.org/r/947354

dcaro updated the task description. (Show Details)
dcaro added a subscriber: fgiunchedi.

Change 947353 merged by David Caro:

[operations/puppet@production] haproxy_exporter: allow setting as absent

https://gerrit.wikimedia.org/r/947353

@dcaro that last comment should be moved to T343872 :)

@dcaro that last comment should be moved to T343872 :)

yep xd

Change 947354 merged by David Caro:

[operations/puppet@production] prometheus: gather stats from haproxy for openstack and cloudlb

https://gerrit.wikimedia.org/r/947354

We'll also need to allow tcp port 9900 between prometheus and cloudlb/openstack hosts, I just noticed prometheus can't talk to those:

prometheus1005:~$ curl cloudlb1001.eqiad.wmnet:9900/metrics -v
* Uses proxy env variable no_proxy == '.wmnet'
*   Trying 2620:0:861:11f:10:64:151:2:9900...
*   Trying 10.64.151.2:9900...

Change 948083 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] prometheus: fix typo in job name

https://gerrit.wikimedia.org/r/948083

Change 948084 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] cloudlb: allow access to haproxy stats from prometheus

https://gerrit.wikimedia.org/r/948084

Change 948087 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] dns::dotls: expose and gather haproxy internal metrics

https://gerrit.wikimedia.org/r/948087

Change 948092 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] thumbor: expose and fetch metrics from haproxy internal endpoint

https://gerrit.wikimedia.org/r/948092

Change 948092 abandoned by David Caro:

[operations/puppet@production] thumbor: expose and fetch metrics from haproxy internal endpoint

Reason:

superseded by htps://gerrit.wikimedia.org/r/c/operations/puppet/+/946951

https://gerrit.wikimedia.org/r/948092

Change 948098 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/alerts@master] openstack: use the haproxy internal stat for alerts

https://gerrit.wikimedia.org/r/948098

Change 948083 merged by David Caro:

[operations/puppet@production] prometheus: fix typo in job name

https://gerrit.wikimedia.org/r/948083

Change 948084 merged by David Caro:

[operations/puppet@production] cloudlb: allow access to haproxy stats from prometheus

https://gerrit.wikimedia.org/r/948084

Change 948104 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] cloudlb: move to wmcs prometheus

https://gerrit.wikimedia.org/r/948104

We'll also need to allow tcp port 9900 between prometheus and cloudlb/openstack hosts, I just noticed prometheus can't talk to those:

I wasn't clear enough on this point: we need to open this port on the production/cloud firewall on cr devices

Change 948098 merged by David Caro:

[operations/alerts@master] openstack: use the haproxy internal stat for alerts

https://gerrit.wikimedia.org/r/948098

We'll also need to allow tcp port 9900 between prometheus and cloudlb/openstack hosts, I just noticed prometheus can't talk to those:

I wasn't clear enough on this point: we need to open this port on the production/cloud firewall on cr devices

FTR I just noticed other metric collections are affected in this case (e.g. bird):

prometheus1005:~$ curl -v cloudlb1001.eqiad.wmnet:9324/metrics
* Uses proxy env variable no_proxy == '.wmnet'
*   Trying 2620:0:861:11f:10:64:151:2:9324...
*   Trying 10.64.151.2:9324...
^C

I am guessing we can allow all ports from prometheus hosts ACLs on the cr devices (so we don't have to individually allow ports, this matches the production ferm configuration where prometheus hosts can access any ports)

I am guessing we can allow all ports from prometheus hosts ACLs on the cr devices (so we don't have to individually allow ports, this matches the production ferm configuration where prometheus hosts can access any ports)

FTR I've moved the haproxy jobs to the cloud prometheus instance in https://gerrit.wikimedia.org/r/c/operations/puppet/+/954001 so that solves the specific needs for this task (although we need to fix the cr firewall issue as a part of T336854 anyways).

Change 948104 abandoned by David Caro:

[operations/puppet@production] cloudlb: move to wmcs prometheus

Reason:

https://gerrit.wikimedia.org/r/948104