Page MenuHomePhabricator

Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted
Closed, ResolvedPublic

Description

Got this email from LibreNMS and it reports:

Uptime 10 minutes 40 seconds

I think next steps are:

  1. check on the PDU's UI if it really rebooted
  2. check if there are any logs on why
  3. if not a false positive decide if we should replace it

Event Timeline

ayounsi triaged this task as Medium priority.Feb 13 2020, 4:03 PM
ayounsi created this task.

Still from LibreNMS:

2020-02-13 15:46:52 notice ps1-a8-codfw SENTRY3_5179AF] EVENT: System boot complete notice
2020-02-13 15:46:52 notice ps1-a8-codfw NO MATCH [Sentry3_5179af] EVENT: TCP/IP stack has started notice

2020-02-13 15:47:30 Device status changed to Up from icmp check.
2020-02-13 15:47:30 Device rebooted after 357 days 19 hours 37 minutes 2 seconds -> 43s
2020-02-13 15:37:30 Device status changed to Down from icmp check.

So doesn't look like a false positive.

Uptime: 0 days 0 hours 20 minutes 27 seconds

It did indeed reboot.

Ok, it was firmware

Sentry Switched CDU Version 7.1b

and is now upgraded to firmware

Sentry Switched CDU Version 7.1d

This caused the PDU interface to reboot a second time, but does NOT affect power outlets.

This does NOT clear this PDU from being suspect. As one of our two network racks, this is likely due for upgrade/refresh ahead of other racks.

Mentioned in SAL (#wikimedia-operations) [2020-02-13T16:32:44Z] <robh> ps1-a8-codfw.mgmt.codfw.wmnet firmware upgraded via T245164

So this came back after my firmware update, and I logged in, but then I logged out after looking that firmware updated. Then Arzhel pointed out it wasn't showing online in librenms, and I go to login a second time, and it doesn't work.

I'm not sure what is up with this PDU. We may want to look at replacing it. The next troubleshooting step could be to do the following:

Please note this work is taking place in a networking rack, and thus needs to have its maint window cleared with @ayounsi!

  1. remove the ps1-ps2 link, we dont want ps2 being affected
  2. remove and reseat the mgmt/controller module which resets it without removing power to the servers.
  3. see if it comes back online

In parallel, we may want to replace this entirely.

@wiki_willy ps1-a8 is not stable to still in production. I notice that there is a clicking noise coming from the PDU and the readings are not stable it keeps flapping.

option1: buy a replacement
option2: i have 1 old one in storage that i can replace with

Please advance .

thanks

@Papaul - if the spare one in storage is the same one, I think we can try replacing it with that first. Thanks, Willy

I open a request ticket (TICKET NO.1578279) with CY1 to assistance me on unplugging the old PDU and plugging the new one tomorrow the 19th at 10:30 Dallas time

We replaced the PDU, all good