We got a few emails from LibreNMS about those two sensor going above their 3440W threshold (set in T247358):
https://librenms.wikimedia.org/graphs/to=1615184700/id=11444/type=sensor_power/from=1615098300/
https://librenms.wikimedia.org/graphs/to=1615184700/id=8980/type=sensor_power/from=1615098300/
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Ottomata | T274795 Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster | |||
Resolved | BTullis | T275767 Add 6 worker nodes to the HDFS Namenode config of the Analytics Hadoop cluster | |||
Resolved | • Cmjohnson | T276239 Try to move some new analytics worker nodes to different racks | |||
Resolved | • Cmjohnson | T280203 decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) | |||
Resolved | Jclark-ctr | T276743 ps1-a7-eqiad power over threshold alerts |
Event Timeline
@Cmjohnson @elukey - just a heads up, this may put a wrench in moving one of the an-worker servers to A7. Let me see when the next time something in this rack is scheduled to be decom'd.
I moved ms-be1060 to a different phase. I think we could add the server to A7 but it cannot be on the same phase as ms-be1060.
Got another similar alert, see:
https://librenms.wikimedia.org/graphs/id=8980/type=sensor_power/from=1616136600/to=1616223000
It's barely touching the alerting threshold though.
It keeps alerting, I disabled alerting for that device until then.
Once fixed please re-enable it in https://librenms.wikimedia.org/device/43/edit
I'm going to reassign this over to @Jclark-ctr, since he's working on refreshing some mw servers, which will the @Dzahn and the Service-Ops team the ability to decom a few of the older mw servers out of this rack. Thanks, Willy
@wiki_willy, I'd imagine that once we start decoming the mw servers in the rack that the issue will self resolve. I do not think there is any need to keep this task open. Do you?
Hi @Cmjohnson - it's going to keep alerting, until the mw servers are decommissioned, so might as well leave it open until then. Thanks, Willy
the MW servers are out of the rack, will make sure to balance power better with new servers racked in A7
Re-opening as I noticed that alerting was still disabled for that device and the power briefly goes above threshold.
See https://librenms.wikimedia.org/device/device=43/tab=logs/section=eventlog/ and https://librenms.wikimedia.org/graphs/id=11444/type=sensor_power/from=1659098100/
it has been2 weeks with out any alerts closing ticket nothing else will be added to this rack untill we can decom some host from it.