Page MenuHomePhabricator

audit/rebalance power in a5-eqiad
Closed, ResolvedPublic

Description

@ayounsi noticed alerting from https://librenms.wikimedia.org/graphs/to=1582118100/id=9026/type=sensor_current/from=1582031700/

After IRC discussion, that limit is properly set.

Each circuit should only ever have 80% of load, and we run redundant power, so each circuit running in redundancy pairs should only be loaded to 40% of their potential. We have 30 amps available in eqiad per circuit, so that is 10amps hard ceiling per phase. Then we only use up to 8 amps when non-redundant, and 4 amps when redundant. The alerting at 4.23 is therefore a valid alert threshold.

When viewing the overall amperage loads per phase, they are not in balance:

https://librenms.wikimedia.org/device/device=41/
https://librenms.wikimedia.org/device/device=41/tab=health/metric=current/
http://ps1-a5-eqiad.mgmt.eqiad.wmnet/

The ps2 tower is pulling less power than ps1. Unfortunately, our main sites (codfw/eqiad) do NOT have power outlet cable mapping in netbox, so @RobH cannot remotely troubleshoot this effectively.

The end result of the on-site work should have the end result of the phases current/voltage readings to be very close/in balance. @RobH recommends the following steps, checking the power output readings with the above links between steps:

  • - physically audit all power in a5-eqiad, ensuring that all systems have ALL power cords securely seated and power is being provided to every power supply.
  • - balance phases on each tower, as the draw between them should be even. This may involve moving individual sets of power plugs up or down the PDU into different phases/power outlets.
  • - software audit the mgmt settings on each device to ensure they are set to pull evenly from both power supplies

Once all the above steps are done, we should see the power come into balance in a5-eqiad. If they are not, please coordinate with @RobH for further troubleshooting.

Event Timeline

RobH triaged this task as Medium priority.Feb 19 2020, 6:16 PM
RobH created this task.

Please note this imbalance is what is triggering the email alerts: Alert for device ps1-a5-eqiad.mgmt.eqiad.wmnet - Sensor over limit

I disabled alerting for that host as it has been alerting/flapping regularly.

To be turned back on when fixed: https://librenms.wikimedia.org/device/device=41/tab=edit/

I looked into this a little bit, ps1-a5-eqiad has some value setup under Health and in the high column
https://librenms.wikimedia.org/device/device=41/tab=edit/section=health/
Line, AA:L1, Current 3.86 High 4.23
Line, AA:L2, Current 4.3 High 4.815
Line, AA:L3, Current 4.48 High 5.157
Line, BA:L1, Current 4.2 1 High 4.215
Line, BA:L2, Current 3.72 High 4.71
Line, BA:L3, Current 4.14 High 5.1

ps1-a2-eqiad has no values setup
https://librenms.wikimedia.org/device/device=38/tab=edit/section=health/
and ps1-a4-eiad has high values
https://librenms.wikimedia.org/device/device=40/tab=edit/section=health/
Line, AA:L1, Current 5.6 High 9.9
Line, AA:L2, Current 4.56 High 8.565
Line, AA:L3, Current 5.35 High 9.585
Line, BA:L1, Current 602 High 11.055
Line, BA:L2, Current 4.39 High 8.58
Line, BA:L3, Current 5.5 High 10.395

I think we need to focus first on why we have some PDU setup with high values some with low values and some with none. I think that the values setup in ps1-a5 eqiad are the cause of the alerts we are getting in the first place.

Thanks

@ayounsi - from @Papaul 's comment above, it seems like an issue with the threshold being set too low. If the dotted line on this graph represents when ps1-a5-eqiad starts alerting, it seems to be set way lower compared to its neighboring PDUs:

https://librenms.wikimedia.org/graphs/to=1582118100/id=9026/type=sensor_current/from=1582031700/

Also, I'm attaching a picture @Jclark-ctr took of the rack, which also shows the amps being pulled are pretty equal on both sides.

2020-02-28.jpg (4×3 px, 3 MB)

Thanks,
Willy

The way LibreNMS works is that it will either:

  • get the alerting value from the device if it supports (exposes) it. I don't think that's the case for PDUs
  • guess an alerting value based on the sensor current value during their first setup. Which mean usually something that becomes incorrect when loads is added to the device

We previously fixed that for phase current, by running a SQL query setting the max threshold at 12 for all the devices.

If you think we should be alerted for line current (or any other monitored sensors), I can do the same thing here and mass set a proper value to all the PDUs. Just tell me what it should be.

Hi @ayounsi - since each line pair can go up to 30amps (for 30amp PDUs), we should probably set ours to 12amps for alerting. (which would still include the 20% buffer, then divide by 2, since we're running dual PDUs) Right now, it looks like each line pair on ps1-a5-eqiad is only configured to alert at 4amps, which is triggering the alarm. Can you adjust it to 12amps?

Thanks,
Willy

Thanks @ayounsi , the ones you pasted should have a threshold set to 3.44kw (or 3440 watts) for both the master and link cord. Actually, would it be possible setting all the PDUs we have in just eqiad and codfw (not including caching sites) to same thresholds? It seems like there's a lot of discrepancies when spot checking them.

Thanks,
Willy

Mentioned in SAL (#wikimedia-operations) [2020-03-05T14:03:07Z] <XioNoX> set all eqiad/codfw PDUs, cord W thresholds to 3440 - T245655