T107691 is a task for the atlas anchor in ULSFO having gone offline on August 1st due to the loss of redundant b side power in the ulsfo rack. Since it is a single PSU system, it simply went offline.
However, none of our other systems, networking routers or switches gave any kind of remotely visible alarms. (All of them threw the orange LEDs on their fronts, but that is it.)
We need to enable monitoring of redundant power for all servers, switches, routers, disk shelves, pfw's, etc... All of our infrastructure uses redundant power, EXCEPT the following: mr (management routers), atlas anchors, or mgmt switches (the core mgmt switch per site does, but not rack level mgmt swtiches). Everything else should be redundant power; and should have checks to ensure such power is uninterrupted.
Having these checks will monitor both the power supply health (directly) and power feed status (indirectly.)