Page MenuHomePhabricator

Replace Torrus with Prometheus snmp_exporter for PDUs monitoring
Open, HighPublic

Description

ATM we're using Torrus (https://torrus.wikimedia.org) only for PDUs aggregates to report and track power usage. All of SNMP polling for network devices for example is handled inside librenms instead. I took a look at https://github.com/prometheus/snmp_exporter which recently got rewritten in Go and it might suit "PDU metrics" use case too.

Implementation would look like this:

  • snmp_exporter deployed on the host(s) that will do SNMP polling
  • Configure snmp_exporter with snmp community and a list of interesting OIDs to poll
  • The exporter above exposes a /snmp endpoint over HTTP that will poll a specified "target" when asked
  • Configure Prometheus to call the above endpoint for each PDU to monitor

TODO:

  • Integrate servertech4 MIB too, for newer PDUs (all of ulsfo, part of eqiad as of Jul 2019)
  • Namespace snmp_exporter metrics with e.g. snmp or pdu instead of the bare OID name (e.g. infeed)
  • (TBD how hard/complext it is to do) join infeed IDs with line IDs to have XYZ in metric labels instead of numeric IDs
  • Aggregate said metrics into the Prometheus global instance

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 18 2016, 4:07 PM
fgiunchedi triaged this task as Normal priority.Oct 18 2016, 4:07 PM

I've tried a sample configuration with OIDs from Sentry3.mib (ftp://ftp.servertech.com/Pub/SNMP/sentry3) e.g. for infeedPower with this configuration

- name: infeedPower
  oid: 1.3.6.1.4.1.1718.3.2.2.1.12
  indexes:
    - labelname: towerID
      type: Integer32 
    - labelname: infeedID
      type: Integer32
  lookups:    
    - labels: [towerID]
      labelname: towerID
      oid: 1.3.6.1.4.1.1718.3.2.1.1.2
    - labels: [infeedID]
      labelname: infeedID
      oid: 1.3.6.1.4.1.1718.3.2.2.1.2

results in the following metrics when scraping e.g. ps1-a1-codfw

netmon1001:~$ curl -s 'localhost:9116/snmp?target=ps1-a1-codfw.mgmt.codfw.wmnet'  | grep -i infeedPower
# HELP infeedPower 
# TYPE infeedPower untyped
infeedPower{infeedID="1",towerID="A"} 525
infeedPower{infeedID="1",towerID="B"} 514
infeedPower{infeedID="2",towerID="A"} 389
infeedPower{infeedID="2",towerID="B"} 261
infeedPower{infeedID="3",towerID="A"} 676
infeedPower{infeedID="3",towerID="B"} 604

Full metrics from ps1-a2-codfw in

1# HELP infeedApparentPower
2# TYPE infeedApparentPower gauge
3infeedApparentPower{infeedIndex="1",towerIndex="1"} 266
4infeedApparentPower{infeedIndex="1",towerIndex="2"} 259
5infeedApparentPower{infeedIndex="2",towerIndex="1"} 143
6infeedApparentPower{infeedIndex="2",towerIndex="2"} 80
7infeedApparentPower{infeedIndex="3",towerIndex="1"} 1050
8infeedApparentPower{infeedIndex="3",towerIndex="2"} 672
9# HELP infeedCapabilities
10# TYPE infeedCapabilities gauge
11infeedCapabilities{infeedCapabilities="?",infeedIndex="1",towerIndex="1"} 1
12infeedCapabilities{infeedCapabilities="?",infeedIndex="1",towerIndex="2"} 1
13infeedCapabilities{infeedCapabilities="?",infeedIndex="2",towerIndex="1"} 1
14infeedCapabilities{infeedCapabilities="?",infeedIndex="2",towerIndex="2"} 1
15infeedCapabilities{infeedCapabilities="?",infeedIndex="3",towerIndex="1"} 1
16infeedCapabilities{infeedCapabilities="?",infeedIndex="3",towerIndex="2"} 1
17# HELP infeedCapacity
18# TYPE infeedCapacity gauge
19infeedCapacity{infeedIndex="1",towerIndex="1"} 30
20infeedCapacity{infeedIndex="1",towerIndex="2"} 30
21infeedCapacity{infeedIndex="2",towerIndex="1"} 30
22infeedCapacity{infeedIndex="2",towerIndex="2"} 30
23infeedCapacity{infeedIndex="3",towerIndex="1"} 30
24infeedCapacity{infeedIndex="3",towerIndex="2"} 30
25# HELP infeedCapacityUsed
26# TYPE infeedCapacityUsed gauge
27infeedCapacityUsed{infeedIndex="1",towerIndex="1"} 199
28infeedCapacityUsed{infeedIndex="1",towerIndex="2"} 138
29infeedCapacityUsed{infeedIndex="2",towerIndex="1"} 51
30infeedCapacityUsed{infeedIndex="2",towerIndex="2"} 44
31infeedCapacityUsed{infeedIndex="3",towerIndex="1"} 187
32infeedCapacityUsed{infeedIndex="3",towerIndex="2"} 117
33# HELP infeedCrestFactor
34# TYPE infeedCrestFactor gauge
35infeedCrestFactor{infeedIndex="1",towerIndex="1"} 15
36infeedCrestFactor{infeedIndex="1",towerIndex="2"} 15
37infeedCrestFactor{infeedIndex="2",towerIndex="1"} 24
38infeedCrestFactor{infeedIndex="2",towerIndex="2"} 17
39infeedCrestFactor{infeedIndex="3",towerIndex="1"} 14
40infeedCrestFactor{infeedIndex="3",towerIndex="2"} 15
41# HELP infeedEnergy
42# TYPE infeedEnergy gauge
43infeedEnergy{infeedIndex="1",towerIndex="1"} 46076
44infeedEnergy{infeedIndex="1",towerIndex="2"} 45153
45infeedEnergy{infeedIndex="2",towerIndex="1"} 24830
46infeedEnergy{infeedIndex="2",towerIndex="2"} 14380
47infeedEnergy{infeedIndex="3",towerIndex="1"} 130293
48infeedEnergy{infeedIndex="3",towerIndex="2"} 89785
49# HELP infeedID
50# TYPE infeedID gauge
51infeedID{infeedID="AA",infeedIndex="1",towerIndex="1"} 1
52infeedID{infeedID="AB",infeedIndex="2",towerIndex="1"} 1
53infeedID{infeedID="AC",infeedIndex="3",towerIndex="1"} 1
54infeedID{infeedID="BA",infeedIndex="1",towerIndex="2"} 1
55infeedID{infeedID="BB",infeedIndex="2",towerIndex="2"} 1
56infeedID{infeedID="BC",infeedIndex="3",towerIndex="2"} 1
57# HELP infeedLineID
58# TYPE infeedLineID gauge
59infeedLineID{infeedIndex="1",infeedLineID="A:X",towerIndex="1"} 1
60infeedLineID{infeedIndex="1",infeedLineID="B:X",towerIndex="2"} 1
61infeedLineID{infeedIndex="2",infeedLineID="A:Y",towerIndex="1"} 1
62infeedLineID{infeedIndex="2",infeedLineID="B:Y",towerIndex="2"} 1
63infeedLineID{infeedIndex="3",infeedLineID="A:Z",towerIndex="1"} 1
64infeedLineID{infeedIndex="3",infeedLineID="B:Z",towerIndex="2"} 1
65# HELP infeedLineToLineID
66# TYPE infeedLineToLineID gauge
67infeedLineToLineID{infeedIndex="1",infeedLineToLineID="A:X-Y",towerIndex="1"} 1
68infeedLineToLineID{infeedIndex="1",infeedLineToLineID="B:X-Y",towerIndex="2"} 1
69infeedLineToLineID{infeedIndex="2",infeedLineToLineID="A:Y-Z",towerIndex="1"} 1
70infeedLineToLineID{infeedIndex="2",infeedLineToLineID="B:Y-Z",towerIndex="2"} 1
71infeedLineToLineID{infeedIndex="3",infeedLineToLineID="A:Z-X",towerIndex="1"} 1
72infeedLineToLineID{infeedIndex="3",infeedLineToLineID="B:Z-X",towerIndex="2"} 1
73# HELP infeedLoadHighThresh
74# TYPE infeedLoadHighThresh gauge
75infeedLoadHighThresh{infeedIndex="1",towerIndex="1"} 24
76infeedLoadHighThresh{infeedIndex="1",towerIndex="2"} 24
77infeedLoadHighThresh{infeedIndex="2",towerIndex="1"} 24
78infeedLoadHighThresh{infeedIndex="2",towerIndex="2"} 24
79infeedLoadHighThresh{infeedIndex="3",towerIndex="1"} 24
80infeedLoadHighThresh{infeedIndex="3",towerIndex="2"} 24
81# HELP infeedLoadStatus
82# TYPE infeedLoadStatus gauge
83infeedLoadStatus{infeedIndex="1",towerIndex="1"} 0
84infeedLoadStatus{infeedIndex="1",towerIndex="2"} 0
85infeedLoadStatus{infeedIndex="2",towerIndex="1"} 0
86infeedLoadStatus{infeedIndex="2",towerIndex="2"} 0
87infeedLoadStatus{infeedIndex="3",towerIndex="1"} 0
88infeedLoadStatus{infeedIndex="3",towerIndex="2"} 0
89# HELP infeedLoadValue
90# TYPE infeedLoadValue gauge
91infeedLoadValue{infeedIndex="1",towerIndex="1"} 598
92infeedLoadValue{infeedIndex="1",towerIndex="2"} 414
93infeedLoadValue{infeedIndex="2",towerIndex="1"} 152
94infeedLoadValue{infeedIndex="2",towerIndex="2"} 136
95infeedLoadValue{infeedIndex="3",towerIndex="1"} 562
96infeedLoadValue{infeedIndex="3",towerIndex="2"} 352
97# HELP infeedName
98# TYPE infeedName gauge
99infeedName{infeedIndex="1",infeedName="Link_X",towerIndex="2"} 1
100infeedName{infeedIndex="1",infeedName="Master_X",towerIndex="1"} 1
101infeedName{infeedIndex="2",infeedName="Link_Y",towerIndex="2"} 1
102infeedName{infeedIndex="2",infeedName="Master_Y",towerIndex="1"} 1
103infeedName{infeedIndex="3",infeedName="Link_Z",towerIndex="2"} 1
104infeedName{infeedIndex="3",infeedName="Master_Z",towerIndex="1"} 1
105# HELP infeedOutletCount
106# TYPE infeedOutletCount gauge
107infeedOutletCount{infeedIndex="1",towerIndex="1"} 0
108infeedOutletCount{infeedIndex="1",towerIndex="2"} 0
109infeedOutletCount{infeedIndex="2",towerIndex="1"} 0
110infeedOutletCount{infeedIndex="2",towerIndex="2"} 0
111infeedOutletCount{infeedIndex="3",towerIndex="1"} 0
112infeedOutletCount{infeedIndex="3",towerIndex="2"} 0
113# HELP infeedPhaseCurrent
114# TYPE infeedPhaseCurrent gauge
115infeedPhaseCurrent{infeedIndex="1",towerIndex="1"} 130
116infeedPhaseCurrent{infeedIndex="1",towerIndex="2"} 126
117infeedPhaseCurrent{infeedIndex="2",towerIndex="1"} 70
118infeedPhaseCurrent{infeedIndex="2",towerIndex="2"} 39
119infeedPhaseCurrent{infeedIndex="3",towerIndex="1"} 510
120infeedPhaseCurrent{infeedIndex="3",towerIndex="2"} 324
121# HELP infeedPhaseID
122# TYPE infeedPhaseID gauge
123infeedPhaseID{infeedIndex="1",infeedPhaseID="A:X-Y",towerIndex="1"} 1
124infeedPhaseID{infeedIndex="1",infeedPhaseID="B:X-Y",towerIndex="2"} 1
125infeedPhaseID{infeedIndex="2",infeedPhaseID="A:Y-Z",towerIndex="1"} 1
126infeedPhaseID{infeedIndex="2",infeedPhaseID="B:Y-Z",towerIndex="2"} 1
127infeedPhaseID{infeedIndex="3",infeedPhaseID="A:Z-X",towerIndex="1"} 1
128infeedPhaseID{infeedIndex="3",infeedPhaseID="B:Z-X",towerIndex="2"} 1
129# HELP infeedPhaseVoltage
130# TYPE infeedPhaseVoltage gauge
131infeedPhaseVoltage{infeedIndex="1",towerIndex="1"} 2058
132infeedPhaseVoltage{infeedIndex="1",towerIndex="2"} 2062
133infeedPhaseVoltage{infeedIndex="2",towerIndex="1"} 2069
134infeedPhaseVoltage{infeedIndex="2",towerIndex="2"} 2078
135infeedPhaseVoltage{infeedIndex="3",towerIndex="1"} 2061
136infeedPhaseVoltage{infeedIndex="3",towerIndex="2"} 2073
137# HELP infeedPower
138# TYPE infeedPower gauge
139infeedPower{infeedIndex="1",towerIndex="1"} 261
140infeedPower{infeedIndex="1",towerIndex="2"} 254
141infeedPower{infeedIndex="2",towerIndex="1"} 112
142infeedPower{infeedIndex="2",towerIndex="2"} 63
143infeedPower{infeedIndex="3",towerIndex="1"} 1012
144infeedPower{infeedIndex="3",towerIndex="2"} 604
145# HELP infeedPowerFactor
146# TYPE infeedPowerFactor gauge
147infeedPowerFactor{infeedIndex="1",towerIndex="1"} 98
148infeedPowerFactor{infeedIndex="1",towerIndex="2"} 98
149infeedPowerFactor{infeedIndex="2",towerIndex="1"} 78
150infeedPowerFactor{infeedIndex="2",towerIndex="2"} 79
151infeedPowerFactor{infeedIndex="3",towerIndex="1"} 96
152infeedPowerFactor{infeedIndex="3",towerIndex="2"} 90
153# HELP infeedReactance
154# TYPE infeedReactance gauge
155infeedReactance{infeedIndex="1",towerIndex="1"} 1
156infeedReactance{infeedIndex="1",towerIndex="2"} 1
157infeedReactance{infeedIndex="2",towerIndex="1"} 1
158infeedReactance{infeedIndex="2",towerIndex="2"} 1
159infeedReactance{infeedIndex="3",towerIndex="1"} 1
160infeedReactance{infeedIndex="3",towerIndex="2"} 1
161# HELP infeedStatus
162# TYPE infeedStatus gauge
163infeedStatus{infeedIndex="1",towerIndex="1"} 1
164infeedStatus{infeedIndex="1",towerIndex="2"} 1
165infeedStatus{infeedIndex="2",towerIndex="1"} 1
166infeedStatus{infeedIndex="2",towerIndex="2"} 1
167infeedStatus{infeedIndex="3",towerIndex="1"} 1
168infeedStatus{infeedIndex="3",towerIndex="2"} 1
169# HELP infeedVoltage
170# TYPE infeedVoltage gauge
171infeedVoltage{infeedIndex="1",towerIndex="1"} 2058
172infeedVoltage{infeedIndex="1",towerIndex="2"} 2062
173infeedVoltage{infeedIndex="2",towerIndex="1"} 2069
174infeedVoltage{infeedIndex="2",towerIndex="2"} 2078
175infeedVoltage{infeedIndex="3",towerIndex="1"} 2061
176infeedVoltage{infeedIndex="3",towerIndex="2"} 2073
177# HELP snmp_scrape_duration_seconds Total SNMP time scrape took (walk and processing).
178# TYPE snmp_scrape_duration_seconds gauge
179snmp_scrape_duration_seconds 1.329710242
180# HELP snmp_scrape_pdus_returned PDUs returned from walk.
181# TYPE snmp_scrape_pdus_returned gauge
182snmp_scrape_pdus_returned 134
183# HELP snmp_scrape_walk_duration_seconds Time SNMP walk/bulkwalk took.
184# TYPE snmp_scrape_walk_duration_seconds gauge
185snmp_scrape_walk_duration_seconds 1.328168187
186# HELP sysUpTime
187# TYPE sysUpTime gauge
188sysUpTime 1.760325407e+09
using the default config shipped by snmp_exporter for servertech sentry3

Change 341005 had a related patch set uploaded (by filippo):
[operations/puppet] [WIP] prometheus: add snmp_exporter module and role

https://gerrit.wikimedia.org/r/341005

Change 341533 had a related patch set uploaded (by filippo):
[operations/puppet] facilities: add row and site parameters for pdus

https://gerrit.wikimedia.org/r/341533

Change 341534 had a related patch set uploaded (by filippo):
[operations/puppet] facilities: add codfw PDUs

https://gerrit.wikimedia.org/r/341534

Change 341535 had a related patch set uploaded (by filippo):
[operations/puppet] [WIP] add PDUs jobs to prometheus

https://gerrit.wikimedia.org/r/341535

Change 341533 merged by Filippo Giunchedi:
[operations/puppet] facilities: add row and site parameters for pdus

https://gerrit.wikimedia.org/r/341533

Change 341534 merged by Filippo Giunchedi:
[operations/puppet] facilities: add codfw PDUs

https://gerrit.wikimedia.org/r/341534

Change 342648 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet] Add network::monitor role

https://gerrit.wikimedia.org/r/342648

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.

Change 341005 merged by Filippo Giunchedi:
[operations/puppet] prometheus: add snmp_exporter module and profile

https://gerrit.wikimedia.org/r/341005

Change 342648 merged by Filippo Giunchedi:
[operations/puppet] Add network::monitor role

https://gerrit.wikimedia.org/r/342648

Change 342862 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet] prometheus: fix file permissions and servertech template

https://gerrit.wikimedia.org/r/342862

Change 342862 merged by Filippo Giunchedi:
[operations/puppet] prometheus: fix file permissions and servertech template

https://gerrit.wikimedia.org/r/342862

Change 341535 merged by Filippo Giunchedi:
[operations/puppet@production] add PDUs jobs to prometheus

https://gerrit.wikimedia.org/r/341535

Change 344953 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet@production] prometheus: fix PDU detection and snmp_exporter config

https://gerrit.wikimedia.org/r/344953

Change 344953 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: fix PDU detection and snmp_exporter config

https://gerrit.wikimedia.org/r/344953

fgiunchedi added a comment.EditedMar 27 2017, 3:07 PM

Most pieces are in place now, left to do:

  • allow prometheus in codfw to talk to netmon1001
  • report non-accepted values for "capabilities" OIDs to snmp_exporter upstream
  • aggregate and collect metrics globally too
  • grafana dashboards

Change 347622 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet@production] hieradata: allow codfw prometheus to talk to netmon eqiad

https://gerrit.wikimedia.org/r/347622

Change 347622 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: allow codfw prometheus to talk to netmon eqiad

https://gerrit.wikimedia.org/r/347622

fgiunchedi renamed this task from Evaluate prometheus snmp_exporter for Torrus PDUs metrics use case to Replace Torrus with Prometheus snmp_exporter for PDUs monitoring.May 2 2017, 8:30 AM
fgiunchedi claimed this task.

Change 352800 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add aggregated PDU stats

https://gerrit.wikimedia.org/r/352800

Change 352800 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add aggregated PDU stats

https://gerrit.wikimedia.org/r/352800

faidon moved this task from Backlog to In progress on the observability board.Jul 10 2017, 12:38 PM
faidon moved this task from In progress to Up next on the observability board.Jul 24 2017, 3:10 PM
fgiunchedi closed this task as Declined.Jul 27 2017, 8:43 AM

We've ultimately gone with pushing librenms data into graphite in T171167: Evaluate LibreNMS' Graphite backend

mark added a subscriber: mark.Sep 6 2017, 2:48 PM

@fgiunchedi: Could you elaborate why the SNMP exporter to prometheus didn't work for this in the end?

mark raised the priority of this task from Normal to High.Jan 11 2019, 3:13 PM

Right now I can only find a single graph with eqiad/codfw total (aggregated) power usage, but proper per-rack power usage data is still entirely missing. This makes it currently very hard to determine the total amount of power used per rack (across all phases) and to monitor things like phase imbalance.

I see LibreNMS seems to be exporting its PDU data (from SNMP) into Graphite, so probably these graphs can be created in Graphite as well. Please setup graphs in Graphite that allow power usage readings in a way where they are useful and meaningful for managing data center operations.

mark reopened this task as Open.Jan 11 2019, 3:13 PM
faidon added a subscriber: faidon.EditedJan 18 2019, 3:02 PM

@fgiunchedi so could you describe in a bit more detail what is needed here and what were the challenges you faced with prometheus-snmp-exporter last time you attempted this?

@fgiunchedi so could you describe in a bit more detail what is needed here and what were the challenges you faced with prometheus-snmp-exporter last time you attempted this?

For sure: IIRC (and by re-reading T87840 for context too) the main challenge was around retention, hence the choice to go with librenms -> graphite instead. However snmp_exporter at the moment is working as expected in the sense that we do have metrics from it in Prometheus. I'll update the task description with more details on next steps to complete snmp_exporter deployment.

Right now I can only find a single graph with eqiad/codfw total (aggregated) power usage, but proper per-rack power usage data is still entirely missing. This makes it currently very hard to determine the total amount of power used per rack (across all phases) and to monitor things like phase imbalance.
I see LibreNMS seems to be exporting its PDU data (from SNMP) into Graphite, so probably these graphs can be created in Graphite as well. Please setup graphs in Graphite that allow power usage readings in a way where they are useful and meaningful for managing data center operations.

Agreed, given that this task is about snmp_exporter and we'll be using librenms data in graphite now I've opened a new task specifically for this: T214183: Setup graphs for power usage readings in Grafana and likely we'll be needing some more visualizations than what I've put in the task description.

fgiunchedi updated the task description. (Show Details)Jan 18 2019, 5:19 PM
fgiunchedi updated the task description. (Show Details)Jan 21 2019, 4:46 PM
fgiunchedi moved this task from Doing to Up next on the User-fgiunchedi board.Jan 24 2019, 10:06 AM
fgiunchedi updated the task description. (Show Details)Jul 25 2019, 3:21 PM
fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.Jul 26 2019, 8:44 AM

Change 526615 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add sentry4 snmp_exporter config

https://gerrit.wikimedia.org/r/526615

Change 526616 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: prefix pdu metrics

https://gerrit.wikimedia.org/r/526616

Change 526615 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add sentry4 snmp_exporter config

https://gerrit.wikimedia.org/r/526615

Change 526616 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: prefix pdu metrics

https://gerrit.wikimedia.org/r/526616

Change 526619 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add sentry4 PDUs support

https://gerrit.wikimedia.org/r/526619

Change 526625 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: fix pdu_ metrics prefixing

https://gerrit.wikimedia.org/r/526625

Change 526619 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add sentry4 PDUs support

https://gerrit.wikimedia.org/r/526619

Change 526625 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: fix pdu_ metrics prefixing

https://gerrit.wikimedia.org/r/526625

Change 526633 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] facilities: add model to pdu monitoring

https://gerrit.wikimedia.org/r/526633

Change 526634 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: query pdu resources based on model

https://gerrit.wikimedia.org/r/526634

Change 526640 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: generate targets for sentry4 PDUs too

https://gerrit.wikimedia.org/r/526640

fgiunchedi updated the task description. (Show Details)Jul 31 2019, 1:24 PM

Change 526633 merged by Filippo Giunchedi:
[operations/puppet@production] facilities: add model to pdu monitoring

https://gerrit.wikimedia.org/r/526633

Change 526634 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: query pdu resources based on model

https://gerrit.wikimedia.org/r/526634

Change 526640 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: generate targets for sentry4 PDUs too

https://gerrit.wikimedia.org/r/526640

Change 527498 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: skip duplicates when generating pdu configuration

https://gerrit.wikimedia.org/r/527498

Change 527498 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: skip duplicates when generating pdu configuration

https://gerrit.wikimedia.org/r/527498

Change 527548 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: don't snmp-poll st4InputCordNotifications

https://gerrit.wikimedia.org/r/527548

Change 527548 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: don't snmp-poll st4InputCordNotifications

https://gerrit.wikimedia.org/r/527548

Change 528805 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: update snmp_exporter config

https://gerrit.wikimedia.org/r/528805

Change 528805 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: update snmp_exporter config

https://gerrit.wikimedia.org/r/528805

Change 528856 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: bump timeout for pdu jobs

https://gerrit.wikimedia.org/r/528856

Change 528857 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add sentry4 outlet OIDs

https://gerrit.wikimedia.org/r/528857

Change 528856 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: bump timeout for pdu jobs

https://gerrit.wikimedia.org/r/528856

Change 528857 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add sentry4 outlet OIDs

https://gerrit.wikimedia.org/r/528857

fgiunchedi updated the task description. (Show Details)Aug 12 2019, 9:12 AM

Change 529790 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] facilities: introduce monitor_pdu_phase for ulsfo PDUs

https://gerrit.wikimedia.org/r/529790

Change 529791 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: generate targets for single phase PDUs

https://gerrit.wikimedia.org/r/529791

Change 529790 merged by Filippo Giunchedi:
[operations/puppet@production] facilities: introduce monitor_pdu_phase for ulsfo PDUs

https://gerrit.wikimedia.org/r/529790

Change 529791 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: generate targets for single phase PDUs

https://gerrit.wikimedia.org/r/529791

Change 529797 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: let Prometheus on PoPs talk to snmp_exporter

https://gerrit.wikimedia.org/r/529797

Change 529797 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: let Prometheus on PoPs talk to snmp_exporter

https://gerrit.wikimedia.org/r/529797

Change 529800 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: don't poll st4OutletCapabilities

https://gerrit.wikimedia.org/r/529800

Change 529800 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: don't poll st4OutletCapabilities

https://gerrit.wikimedia.org/r/529800

Change 529914 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: fetch active netmon server from hiera

https://gerrit.wikimedia.org/r/529914

Change 529914 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: fetch active netmon server from hiera

https://gerrit.wikimedia.org/r/529914

We're now collecting metrics from all managed PDUs into prometheus, including environmental sensors. The names reflect what's in the snmp mib, modulo the pdu_ prefix we're using to namespace the metrics in Prometheus.

Sentry3

pdu_envMonContactClosureCount
pdu_envMonID
pdu_envMonName
pdu_envMonStatus
pdu_envMonTempHumidSensorCount
pdu_infeedApparentPower
pdu_infeedCapacityUsed
pdu_infeedCapacity
pdu_infeedCrestFactor
pdu_infeedEnergy
pdu_infeedID
pdu_infeedLineID
pdu_infeedLineToLineID
pdu_infeedLoadHighThresh
pdu_infeedLoadStatus
pdu_infeedLoadValue
pdu_infeedName
pdu_infeedOutletCount
pdu_infeedPhaseCurrent
pdu_infeedPhaseID
pdu_infeedPhaseVoltage
pdu_infeedPowerFactor
pdu_infeedPower
pdu_infeedReactance
pdu_infeedStatus
pdu_infeedVoltage
pdu_outletApparentPower
pdu_outletCapacity
pdu_outletControlAction
pdu_outletControlState
pdu_outletCrestFactor
pdu_outletEnergy
pdu_outletID
pdu_outletLoadHighThresh
pdu_outletLoadLowThresh
pdu_outletLoadStatus
pdu_outletLoadValue
pdu_outletName
pdu_outletPostOnDelay
pdu_outletPowerFactor
pdu_outletPower
pdu_outletStatus
pdu_outletVoltage
pdu_outletWakeupState
pdu_sysUpTime
pdu_tempHumidSensorHumidHighThresh
pdu_tempHumidSensorHumidLowThresh
pdu_tempHumidSensorHumidRecDelta
pdu_tempHumidSensorHumidStatus
pdu_tempHumidSensorHumidValue
pdu_tempHumidSensorID
pdu_tempHumidSensorName
pdu_tempHumidSensorStatus
pdu_tempHumidSensorTempHighThresh
pdu_tempHumidSensorTempLowThresh
pdu_tempHumidSensorTempRecDelta
pdu_tempHumidSensorTempScale
pdu_tempHumidSensorTempStatus
pdu_tempHumidSensorTempValue
pdu_towerActivePower
pdu_towerApparentPower
pdu_towerEnergy
pdu_towerID
pdu_towerInfeedCount
pdu_towerLineFrequency
pdu_towerModelNumber
pdu_towerName
pdu_towerPowerFactor
pdu_towerProductSN
pdu_towerStatus
pdu_towerVACapacityUsed
pdu_towerVACapacity

And Sentry4:

pdu_st4BranchCurrentCapacity
pdu_st4BranchCurrentStatus
pdu_st4BranchCurrentUtilized
pdu_st4BranchCurrent
pdu_st4BranchID
pdu_st4BranchLabel
pdu_st4BranchOcpID
pdu_st4BranchOutletCount
pdu_st4BranchPhaseID
pdu_st4BranchState
pdu_st4BranchStatus
pdu_st4HumidSensorID
pdu_st4HumidSensorName
pdu_st4HumidSensorStatus
pdu_st4HumidSensorValue
pdu_st4InputCordActivePowerStatus
pdu_st4InputCordActivePower
pdu_st4InputCordApparentPowerStatus
pdu_st4InputCordApparentPower
pdu_st4InputCordBranchCount
pdu_st4InputCordCurrentCapacityMax
pdu_st4InputCordCurrentCapacity
pdu_st4InputCordEnergy
pdu_st4InputCordFrequency
pdu_st4InputCordID
pdu_st4InputCordInletType
pdu_st4InputCordLineCount
pdu_st4InputCordName
pdu_st4InputCordNominalVoltageMax
pdu_st4InputCordNominalVoltageMin
pdu_st4InputCordNominalVoltage
pdu_st4InputCordOcpCount
pdu_st4InputCordOutOfBalanceStatus
pdu_st4InputCordOutOfBalance
pdu_st4InputCordOutletCount
pdu_st4InputCordPhaseCount
pdu_st4InputCordPowerCapacity
pdu_st4InputCordPowerFactorStatus
pdu_st4InputCordPowerFactor
pdu_st4InputCordPowerUtilized
pdu_st4InputCordState
pdu_st4InputCordStatus
pdu_st4LineCurrentCapacity
pdu_st4LineCurrentStatus
pdu_st4LineCurrentUtilized
pdu_st4LineCurrent
pdu_st4LineID
pdu_st4LineLabel
pdu_st4LineState
pdu_st4LineStatus
pdu_st4OcpBranchCount
pdu_st4OcpCurrentCapacityMax
pdu_st4OcpCurrentCapacity
pdu_st4OcpID
pdu_st4OcpLabel
pdu_st4OcpOutletCount
pdu_st4OcpStatus
pdu_st4OcpType
pdu_st4OutletBranchID
pdu_st4OutletCurrentCapacity
pdu_st4OutletID
pdu_st4OutletName
pdu_st4OutletOcpID
pdu_st4OutletPhaseID
pdu_st4OutletPowerCapacity
pdu_st4OutletSocketType
pdu_st4OutletState
pdu_st4OutletStatus
pdu_st4PhaseActivePower
pdu_st4PhaseApparentPower
pdu_st4PhaseBranchCount
pdu_st4PhaseCurrentCrestFactor
pdu_st4PhaseCurrent
pdu_st4PhaseEnergy
pdu_st4PhaseID
pdu_st4PhaseLabel
pdu_st4PhaseNominalVoltage
pdu_st4PhaseOutletCount
pdu_st4PhasePowerFactorStatus
pdu_st4PhasePowerFactor
pdu_st4PhaseReactance
pdu_st4PhaseState
pdu_st4PhaseStatus
pdu_st4PhaseVoltageDeviation
pdu_st4PhaseVoltageStatus
pdu_st4PhaseVoltage
pdu_st4TempSensorID
pdu_st4TempSensorName
pdu_st4TempSensorStatus
pdu_st4TempSensorValueMax
pdu_st4TempSensorValueMin
pdu_st4TempSensorValue
pdu_sysUpTime

From a chat with @faidon it emerged that we have at least three main use cases for PDU metrics:

  1. Checking overload / availability of rack infeeds (e.g. for redundant power, if we're using over 50% of available power that means that going non-redundant will trip the breaker)
  2. Power consumption for general site monitoring (per row/rack/site)
  3. Capacity planning (e.g. for footprint expansion or shrinkage as needed) (per row/rack/site)

I'd like to get some input / review on which of the above infeed metrics we should be looking at to get the right numbers out, cc DC-Ops @wiki_willy

RobH added a subscriber: RobH.Aug 22 2019, 5:24 PM

From a chat with @faidon it emerged that we have at least three main use cases for PDU metrics:

  1. Checking overload / availability of rack infeeds (e.g. for redundant power, if we're using over 50% of available power that means that going non-redundant will trip the breaker)
  2. Power consumption for general site monitoring (per row/rack/site)
  3. Capacity planning (e.g. for footprint expansion or shrinkage as needed) (per row/rack/site)

I'd like to get some input / review on which of the above infeed metrics we should be looking at to get the right numbers out, cc DC-Ops @wiki_willy

So we also would love it if the metrics showed the phase loads on the XYZ phases for our 3 phase power. Those three phases need to stay closely balanced to prevent issues like loss of power efficiency and heat buildup, or the overload of one of the 3 phases before the others causing the PDU to improperly be at capacity. Seeing all of this in an easy metric review would be excellent.

So we likely need the following metrics for each PDU tower:

  • input voltage/amps for each tower (to show we're getting proper power delivery from the provider)
  • load/amps/voltage for the overall PDU utilization (to ensure no PDU is going over 50% capacity)
    • load/amps/voltage of the overall Rack (combine ps1+ps2 totall power utilization)
    • load/amps/voltage utilization for each phase in a 3 phase PDU (to ensure no phase is over 50% capacity & to keep them in sync)
      • Error reporting if these are ever more then X% out of sync. (We need to investigate what that % should be via best practices, right now we just try to get them as close as possible.)

This will allow us to do the things you outline, being:

  • checking overload/available power overhead in each rack.
  • overall power consumption on rack/site for reporting
  • capacity planning

I'll go over the above list in more detail and pick out the specific line items, but I wanted to output what my overall use of metrics is for PDUs right away.

Thanks a lot @RobH for the explanation! Please let me know if I can help with progressing this further

Note that the data is in LibreNMS as well, but with some limitations:

  • 5min granularity
  • Not possible to stack or sum graphs (each power graph is independent)

On the plus side we do have threshold alerting.

fgiunchedi reassigned this task from fgiunchedi to RobH.Tue, Sep 17, 3:55 PM

Following up from irc with @RobH, what would be needed is the list of metrics from above to process and how (e.g. do they need to be combined, depending on the model?) to be able to do both alerting on e.g. phase imbalance (for three phase, for single phase we'll need to alert differently) capacity planning, and power usage reporting