@ayounsi noticed alerting from https://librenms.wikimedia.org/graphs/to=1582118100/id=9026/type=sensor_current/from=1582031700/
After IRC discussion, that limit is properly set.
Each circuit should only ever have 80% of load, and we run redundant power, so each circuit running in redundancy pairs should only be loaded to 40% of their potential. We have 30 amps available in eqiad per circuit, so that is 10amps hard ceiling per phase. Then we only use up to 8 amps when non-redundant, and 4 amps when redundant. The alerting at 4.23 is therefore a valid alert threshold.
When viewing the overall amperage loads per phase, they are not in balance:
https://librenms.wikimedia.org/device/device=41/
https://librenms.wikimedia.org/device/device=41/tab=health/metric=current/
http://ps1-a5-eqiad.mgmt.eqiad.wmnet/
The ps2 tower is pulling less power than ps1. Unfortunately, our main sites (codfw/eqiad) do NOT have power outlet cable mapping in netbox, so @RobH cannot remotely troubleshoot this effectively.
The end result of the on-site work should have the end result of the phases current/voltage readings to be very close/in balance. @RobH recommends the following steps, checking the power output readings with the above links between steps:
- - physically audit all power in a5-eqiad, ensuring that all systems have ALL power cords securely seated and power is being provided to every power supply.
- - balance phases on each tower, as the draw between them should be even. This may involve moving individual sets of power plugs up or down the PDU into different phases/power outlets.
- - software audit the mgmt settings on each device to ensure they are set to pull evenly from both power supplies
Once all the above steps are done, we should see the power come into balance in a5-eqiad. If they are not, please coordinate with @RobH for further troubleshooting.