Page MenuHomePhabricator

Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted
Closed, ResolvedPublic

Description

Got this email from LibreNMS and it reports:

Uptime 10 minutes 40 seconds

I think next steps are:

  1. check on the PDU's UI if it really rebooted
  2. check if there are any logs on why
  3. if not a false positive decide if we should replace it

Event Timeline

ayounsi triaged this task as Medium priority.Thu, Feb 13, 4:03 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Feb 13, 4:03 PM

Still from LibreNMS:

2020-02-13 15:46:52 notice ps1-a8-codfw SENTRY3_5179AF] EVENT: System boot complete notice
2020-02-13 15:46:52 notice ps1-a8-codfw NO MATCH [Sentry3_5179af] EVENT: TCP/IP stack has started notice

2020-02-13 15:47:30 Device status changed to Up from icmp check.
2020-02-13 15:47:30 Device rebooted after 357 days 19 hours 37 minutes 2 seconds -> 43s
2020-02-13 15:37:30 Device status changed to Down from icmp check.

So doesn't look like a false positive.

RobH added a comment.Thu, Feb 13, 4:06 PM

Uptime: 0 days 0 hours 20 minutes 27 seconds

It did indeed reboot.

RobH added a comment.Thu, Feb 13, 4:32 PM

Ok, it was firmware

Sentry Switched CDU Version 7.1b

and is now upgraded to firmware

Sentry Switched CDU Version 7.1d

This caused the PDU interface to reboot a second time, but does NOT affect power outlets.

This does NOT clear this PDU from being suspect. As one of our two network racks, this is likely due for upgrade/refresh ahead of other racks.

Mentioned in SAL (#wikimedia-operations) [2020-02-13T16:32:44Z] <robh> ps1-a8-codfw.mgmt.codfw.wmnet firmware upgraded via T245164

RobH added a comment.EditedThu, Feb 13, 6:20 PM

So this came back after my firmware update, and I logged in, but then I logged out after looking that firmware updated. Then Arzhel pointed out it wasn't showing online in librenms, and I go to login a second time, and it doesn't work.

I'm not sure what is up with this PDU. We may want to look at replacing it. The next troubleshooting step could be to do the following:

Please note this work is taking place in a networking rack, and thus needs to have its maint window cleared with @ayounsi!

  1. remove the ps1-ps2 link, we dont want ps2 being affected
  2. remove and reseat the mgmt/controller module which resets it without removing power to the servers.
  3. see if it comes back online

In parallel, we may want to replace this entirely.

Mentioned in SAL (#wikimedia-operations) [2020-02-18T17:00:37Z] <papaul> restting ps1-a8-codfw see T245164

RobH removed a subscriber: RobH.Tue, Feb 18, 6:23 PM
Papaul assigned this task to wiki_willy.Tue, Feb 18, 6:34 PM

@wiki_willy ps1-a8 is not stable to still in production. I notice that there is a clicking noise coming from the PDU and the readings are not stable it keeps flapping.

option1: buy a replacement
option2: i have 1 old one in storage that i can replace with

Please advance .

thanks

@Papaul - if the spare one in storage is the same one, I think we can try replacing it with that first. Thanks, Willy

wiki_willy reassigned this task from wiki_willy to Papaul.Tue, Feb 18, 9:02 PM

I open a request ticket (TICKET NO.1578279) with CY1 to assistance me on unplugging the old PDU and plugging the new one tomorrow the 19th at 10:30 Dallas time

Papaul closed this task as Resolved.Thu, Feb 20, 3:24 AM

We replaced the PDU, all good