Page MenuHomePhabricator

Add PDU redundancy server/router/switch checks in Icinga
Closed, ResolvedPublic

Description

T107691 is a task for the atlas anchor in ULSFO having gone offline on August 1st due to the loss of redundant b side power in the ulsfo rack. Since it is a single PSU system, it simply went offline.

However, none of our other systems, networking routers or switches gave any kind of remotely visible alarms. (All of them threw the orange LEDs on their fronts, but that is it.)

We need to enable monitoring of redundant power for all servers, switches, routers, disk shelves, pfw's, etc... All of our infrastructure uses redundant power, EXCEPT the following: mr (management routers), atlas anchors, or mgmt switches (the core mgmt switch per site does, but not rack level mgmt swtiches). Everything else should be redundant power; and should have checks to ensure such power is uninterrupted.

Having these checks will monitor both the power supply health (directly) and power feed status (indirectly.)

Details

Related Gerrit Patches:

Event Timeline

RobH created this task.Aug 21 2015, 9:39 PM
RobH raised the priority of this task from to Needs Triage.
RobH updated the task description. (Show Details)
RobH added projects: acl*sre-team, observability.
RobH added subscribers: RobH, faidon.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptAug 21 2015, 9:39 PM

+1 that'd be really useful, are the PDUs in ulsfo monitorable too? I'm seeing only codfw and eqiad in librenms

akosiaris triaged this task as High priority.Aug 27 2015, 10:34 AM
akosiaris added a subscriber: akosiaris.

Raised to high as this might save us from outages

Gehel added a subscriber: Gehel.Jul 19 2016, 5:06 PM

Some additional info as discussed with @RobH:

  • We don't have direct monitoring of PDU in ulsfo, we do get email alerts from UnitedLayer (those alerts do not always work). Those alerts are not integrated with our icinga.
  • We should monitor power from the consumer side. A server with redundant power could tell us if it is loosing some power input, and we could extrapolate the state of the PDU. Even for sites where the PDU are under our control, it make sense to monitor power from the consumer side as well.
Gehel removed a subscriber: Gehel.Sep 22 2016, 2:01 PM
herron added a subscriber: herron.Aug 24 2017, 3:11 PM

For servers the ipmi sensor check used for monitoring temperature could also be used to monitor additional sensors like power supplies.

~# /usr/local/lib/nagios/plugins/check_ipmi_sensor -T Power_Supply --nosel
Sensor Type(s) Power_Supply Status: OK

To save resources within icinga one check could monitor multiple sensors at the expense of alert readability.

~# /usr/local/lib/nagios/plugins/check_ipmi_sensor -T Power_Supply -T Temperature --nosel
Sensor Type(s) Power_Supply, Temperature Status: OK | 'Inlet Temp'=23.00;3.00:42.00;-7.00:47.00 'Temp'=60.00

IMHO it would be worth testing power cable pulls on a few different hardware types to make sure the power supply sensor check works/alerts as expected.

This looks like a nice way forward. For what is worth merging the check with the temperature one is probably fine. I don't see any real benefit in splitting them up.

Volans added a subscriber: Volans.Aug 25 2017, 9:37 AM

I agree, the only drawback I see to have them bundled together is that we couldn't use stalking to tell them apart given that the temperature will change on each check.

Change 376048 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] WIP: Change check_ipmi_temp to check_ipmi_sensor and monitor PSUs

https://gerrit.wikimedia.org/r/376048

RobH added a comment.Sep 5 2017, 4:28 PM

So is the approach to avoid checking the PDU towers themselves directly? I can see the addition of server checks, but I still think monitoring the PDU tower feeds/phases/fuses directly is ideal.

herron added a comment.Sep 5 2017, 5:04 PM

Monitoring both makes sense to me but it sounds like direct PDU monitoring isn't always an option. From the server perspective it could also help detect bad/loose cables, failed PSUs, etc. that the PDU might not detect.

https://gerrit.wikimedia.org/r/376048 has yet to be tested against different hardware models. It would be good to pull cables and power supplies to make sure the check command behaves as expected.

faidon moved this task from Inbox to Up next on the observability board.Sep 6 2017, 1:14 PM

If the power supply ipmi sensor approach seems worthwhile to folks could we arrange a time to test pulling cables and power supplies? I'm happy to run with this given a test system or two connected to switched PDU. Otherwise I'll need some help from someone who can physically yank cables in the dc.

Check_ipmi_sensor is showing failures on 3 out of 4 of the Dell PowerEdge R620 class systems that UnitedLayer recently reported as having failed PSUs.

herron@cp4007:~$ /usr/local/lib/nagios/plugins/check_ipmi_sensor -T Power_Supply -T Temperature --nosel
Sensor Type(s) Power_Supply, Temperature Status: Critical [PS Redundancy = Critical] | 'Inlet Temp'=20.00;3.00:42.00;-7.00:47.00 'Exhaust Temp'=41.00;8.00:70.00;3.00:75.00 'Temp'=63.00 'Temp'=61.00

herron@cp4008:~$ /usr/local/lib/nagios/plugins/check_ipmi_sensor -T Power_Supply -T Temperature --nosel
Sensor Type(s) Power_Supply, Temperature Status: Critical [PS Redundancy = Critical, Status = Critical] | 'Inlet Temp'=20.00;3.00:42.00;-7.00:47.00 'Exhaust Temp'=39.00;8.00:70.00;3.00:75.00 'Temp'=71.00 'Temp'=72.00

herron@lvs4002:~$ /usr/local/lib/nagios/plugins/check_ipmi_sensor -T Power_Supply -T Temperature --nosel
Sensor Type(s) Power_Supply, Temperature Status: Critical [PS Redundancy = Critical, Status = Critical] | 'Inlet Temp'=19.00;3.00:42.00;-7.00:47.00 'Exhaust Temp'=37.00;8.00:70.00;3.00:75.00 'Temp'=67.00 'Temp'=68.00

cp4010 comes back OK, but it could be that this system was already addressed or had an intermittent issue.

herron@cp4010:~$ /usr/local/lib/nagios/plugins/check_ipmi_sensor -T Power_Supply -T Temperature --nosel
Sensor Type(s) Power_Supply, Temperature Status: OK | 'Inlet Temp'=20.00;3.00:42.00;-7.00:47.00 'Exhaust Temp'=39.00;8.00:70.00;3.00:75.00 'Temp'=68.00 'Temp'=69.00

Check_ipmi_sensor is showing failures on 3 out of 4 of the Dell PowerEdge R620 class systems that UnitedLayer recently reported as having failed PSUs.

Yup, check seems to work fine, so let's add it, I'd say :)

Change 376048 merged by Herron:
[operations/puppet@production] Change check_ipmi_temp to check_ipmi_sensor and monitor Power_Supply

https://gerrit.wikimedia.org/r/376048

herron added a comment.Oct 2 2017, 2:04 PM

Ipmi power_supply monitoring is looking good so far. Alerts should start trickeling in over the next hour or two.

cp4007 	CRITICAL 	2017-10-02 14:01:37 	0d 0h 0m 29s 	1/3 	Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical]
faidon moved this task from Up next to In progress on the observability board.Oct 2 2017, 2:08 PM
faidon renamed this task from add pdu redundancy checking to server/router/switch checks in icinga to Add PDU redundancy server/router/switch checks in Icinga.Oct 2 2017, 3:47 PM
faidon assigned this task to herron.

What's the status and what's left here? @herron?

Server PSU monitoring via icinga check_ipmi is complete and each PSU problem has been acknowledged with a sub task to investigate the affected system(s).

Sounds like we receive alert emails for PDUs. Do we also receive alert emails (or traps, etc) from our other non-server devices like switches, routers, disk shelves, etc? If so I think we can consider this complete. We could optionally follow-up with a task to aggregate and parse email alerts into monitoring events if desired.

RobH awarded a token.Oct 16 2017, 6:05 PM
faidon closed this task as Resolved.Oct 19 2017, 6:20 PM

For switches/routers we have alerts on Juniper's system/chassis alarms, which we know trips when they lose PDU redundancy, or any kind of other error. I don't think our disk shelves are connected to the network at all, so I don't see how we'd be able to monitor that? Resolving for now, if there is additional work to be done, feel free to reopen :)