Page MenuHomePhabricator

codfw A1 power outage
Closed, ResolvedPublic

Description

Looks like part of codfw rack A1 lost power, which took msw-a1-codfw down.

Event Timeline

ayounsi triaged this task as High priority.Mar 14 2022, 8:58 AM
ayounsi created this task.

Surprisingly both msw1-codfw PSUs are ON:

msw1-codfw> show chassis environment 
Class Item                           Status     Measurement
Power FPC 0 Power Supply 0           OK        
      FPC 0 Power Supply 1           OK

But for example:

cr1-codfw> show system alarms 
6 alarms currently active
Alarm time               Class  Description
2022-03-14 08:03:58 UTC  Major  Host 1 fxp0 : Ethernet Link Down
2022-03-14 08:03:48 UTC  Major  Host 0 fxp0 : Ethernet Link Down
2022-03-14 08:03:43 UTC  Major  PEM 2 Input Failure
2022-03-14 08:03:43 UTC  Major  PEM 2 Not OK
2022-03-14 08:03:43 UTC  Major  PEM 1 Input Failure
2022-03-14 08:03:43 UTC  Major  PEM 1 Not OK

and

asw-a-codfw> show system alarms 
1 alarms currently active
Alarm time               Class  Description
2022-03-14 08:03:46 UTC  Major  FPC 1 PEM 0 is not powered

So it's maybe half a PDU that lost power?

TICKET NO.
2213827 U
open with CY1

Replaced the PDU with a spare one we had on site.

Change 808048 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new model for new PDU in rack A1

https://gerrit.wikimedia.org/r/808048

Change 808048 merged by Papaul:

[operations/puppet@production] Add new model for new PDU in rack A1

https://gerrit.wikimedia.org/r/808048