Page MenuHomePhabricator

ps1-22-ulsfo & ps1-23-ulsfo
Closed, ResolvedPublic

Description

This task will track the troubleshooting of ps1-22-ulsfo and ps1-23-ulsfo. These have been in alert status in icinga since the power maintenance by Digital Realty at ulsfo.

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=ps1-22-ulsfo
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=ps1-23-ulsfo

Investigation shows that they are delivering power to both sides, so it isn't an urgent enough issue to troubleshoot on a Friday. Instead, I'll (@RobH) will go onsite next Monday and troubleshoot.

Product Manual online

The external reset button wipes the config, which seems worse than just reseating the NIC to powercycle the mgmt interface. However, either solution should allow for Traffic to be aware of the work in advance.

Summary of work:

  • confirmed in docs that the pro2 will indeed allow hot swap of its network card (the older pro1 will not)
  • scheduled work with @BBlack for Traffic cooperation (no impact expected)
  • unplugged all data/serial/link/temp cables from the network card (which houses the mgmt interface) and unseated it
  • re-seated the NIC, repowering the mgmt interface
  • plugged back in all serial/network/data and tested all connections for both ps1-22-ulsfo and ps1-23-ulsfo

Event Timeline

RobH triaged this task as Normal priority.Oct 18 2019, 6:14 PM
RobH created this task.
Restricted Application added a project: Operations. · View Herald TranscriptOct 18 2019, 6:14 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH added a comment.Oct 21 2019, 3:58 PM
  1. reseat the hot swap nic, should reset
  2. unplug the ps1, leaving ps2 powered, to reset the nic
  3. reset with the reset button, will have to reconfigure the entire pdu (non-ideal, these have configs and switched ports, i rather not do this one)

I'd like to try them in the above order, as it should require the least amount of reprogramming. Power loss to one of the two sides (if we have to use option 2) doesn't seem too bad overall, since all but mgmt and serial are dual feed.

RobH added a subscriber: BBlack.EditedOct 21 2019, 3:59 PM

Mainly I'd like @BBlack buy in on a date/time for me to do this work, since option 2 requires Traffic approval imo. (It would cause them work if any of the systems fail.)

Edit addition: I am fine to do this work during the day on any day this week.

RobH added a comment.Oct 21 2019, 6:23 PM

Ok, I'm onsite and going to attempt the following on ps1-22-ulsfo:

  1. unplug all the data/serial/network connections (leave all power in place)
  2. unseat and re-seat the NIC which may powercycle the mgmt interface
  3. test and see if ps1-22-ulsfo is back online

Mentioned in SAL (#wikimedia-operations) [2019-10-21T18:24:07Z] <robh> working on ps1-22-ulsfo via T235911 (it may flap but it is already ack'd as down in icinga, but not persistent)

Mentioned in SAL (#wikimedia-operations) [2019-10-21T18:30:07Z] <robh> ps1-22-ulsfo repaired (reseating its NIC rebooted its mgmt interface) Done with it and repeating on ps1-23-ulsfo via T235911

Mentioned in SAL (#wikimedia-operations) [2019-10-21T18:32:40Z] <robh> ps1-23-ulsfo back online, all pdu work in ulsfo is now complete T235911

RobH added a comment.Oct 21 2019, 6:34 PM

Summary of work:

  • confirmed in docs that the pro2 will indeed allow hot swap of its network card (the older pro1 will not)
  • scheduled work with @BBlack for Traffic cooperation (no impact expected)
  • unplugged all data/serial/link/temp cables from the network card (which houses the mgmt interface) and unseated it
  • re-seated the NIC, repowering the mgmt interface
  • plugged back in all serial/network/data and tested all connections for both ps1-22-ulsfo and ps1-23-ulsfo
RobH closed this task as Resolved.Oct 21 2019, 6:34 PM
RobH updated the task description. (Show Details)