Page MenuHomePhabricator

(Need By:TBD) rack/setup/install rows C and D new PDUs
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of 16 new PDU sets ordered via T249542 for installation into the racks in rows C and D. These will replace the existing PDUs and require direct coordination and scheduling by the eqiad dc opsen for work in each rack.

Hostname / Racking / Installation Details

Each rack will need to have its actual services listed, and each service then considered for depool/migration of masters/etc to mitigate any issues during the PDU replacement.

Due to the overall complexity of each rack, robh suggests that either @Jclark-ctr or @Cmjohnson handle population/management/scheduling of each rack as a sub-task off of this rack. Template for use below in the 'per host setup checklist' but please copy that into a sub-task for each rack.

Schedule

updated schedule below:

Per host setup checklist

PDU upgrades are very complex. This outline/template of checklist items may require further refinement by the @ops-eqiad folks (as robh is populating the initial list and may not cover all steps, please review!)

<hostname#1>:

  • - receive in new PDUs on T249542
  • - create a sub-task off of T253694 to list all these steps for each rack
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox with the name prepend new- for initial netbox entry (once the old PDUs are removed and have their hostnames set to their asset tags, each PDU can be updated to remove the 'new-' prepend off the netbox hostname. example: 'new-ps1-c1-eqiad', 'new-ps2-c1-eqiad'.

The new PDUs will be mounted with one PDU per cabinet side. Also included are new offset PDU brackets. These brackets should be installed so it pushes the PDU further towards the center of the rack (to avoid the horizontal adjustment bar for the vertical rails.) Due to this, it is suggested that the 'link' pdu be installed first, leaving the 'primary' pdu for after (as these are often combined pdu towers being replaced.) This may change at the discretion of the on-sites after review.

  • - list off every server, and its service and service owners on the task for each rack pdu upgrade. This list will need to be reviewed and the PDU work scheduled with the SRE department as a whole. Once the work has been scheduled and cleared, the rest of this checklist can continue.
  • - check the existing PDU and all connected cables. Ensure all are properly seated and all items are receiving power from both A and B sides before continuing. Anything not seated or not receiving dual power will be rebooted by continuing this checklist.
  • - install new PDU brackets for the link tower in the rack (see above note on orientation of the brackets.)
  • - install link PDU into the cabinet
  • - de-power old/existing B side power, and plug in new B side link PDU
  • - migrate all B side power connections to new link PDU
  • - Note all B side power connections, input into netbox for every single power port used.
  • - When relocating power cables, please try to ensure that the A and B sides use the same port. If server bast1001 plugs into port 5 on tower B, please also have it plug into port 5 on tower A.
  • - audit all B side connections to ensure all devices are receiving full power on the B side connection (any not receiving power will be rebooted when we move the A side connections next.)
  • - BEFORE UNPLUGGING THE A SIDE ORIGINAL TOWER: Login to the PDU via the HTTPS interface and reset it to factory defaults!
  • - Unmount existing PDU tower and set aside (if possible) to install new PDU brackets into the rack.
  • - Install new PDU tower into the rack, and route power cable for easy cut-over.
  • - de-power old/existing A side power, and plug in new A side link PDU
  • - migrate all A side power connections to new link PDU
  • - note mgmt ip for old pdu in netbox, remove old pdu from rack in netbox (using its asset tag name), setting to offline and removing its mgmt dns/ip info in netbox.
  • - run the mgmt dns script in netbox for the new pdu, providing the old PDUs mgmt ip in the script entry.
  • - Note all A side power connections, input into netbox for every single power port used.
  • - audit all A side connections to ensure all devices are receiving full power on the A side connection.
  • - connect serial to new PDU, ensure serial connection is functional
  • - setup network configuration of new PDU via serial
  • - setup remaining pdu configuration via https interface
  • - update puppet repo file: modules/facilities/manifests/init.pp to add the senty4 line to the PDU entry.
  • - update librenms to reflect new PDU. (unclear if you must delete the old and add new, or if the new will update when its wholly online, so far only done via removing old and adding new device.
  • - Update IP address entries in netbox, for now just leave the ip tied to old PDU netbox entry (rob will change this to more detailed entry later)
  • - ensure all errors clear in icinga and netbox after work completes

Related Objects

Event Timeline

RobH created this task.May 26 2020, 9:41 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 26 2020, 9:41 PM
RobH updated the task description. (Show Details)
Restricted Application added a project: Operations. · View Herald TranscriptMay 26 2020, 9:53 PM
Cmjohnson moved this task from Backlog to Racking Tasks on the ops-eqiad board.May 29 2020, 12:24 PM
RobH removed RobH as the assignee of this task.Jul 13 2020, 5:24 PM
RobH added a parent task: Unknown Object (Task).

I failed to unassign this from myself, as it shouldn't be mine any longer. I also neglected to link it into its parent ordering task, which I've now fixed.

This is ready for Chris or John to take over, and create each sub task for each rack. (Since this requires the onsite engineers coordinate with the sub-teams involved in the services in each rack, no reason for me to play go between.)

RobH removed a subscriber: RobH.Jul 13 2020, 5:27 PM
RobH added a subscriber: ayounsi.Jul 15 2020, 5:42 PM

Monitoring is alerting for ps1-c8-eqiad
https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ps1-c8-eqiad
And https://librenms.wikimedia.org/device/device=77/
I ACKed the alerts with that task.

Possible causes are:

  • Incorrect SNMP community (most likely as librenms is alerting too)
  • Icinga check not compatible with the new PDU model

The checklist in the task description should probably have an item to ensure that monitoring is all green as well.

RobH updated the task description. (Show Details)Jul 15 2020, 5:51 PM
RobH added a subscriber: RobH.

Added:

  • - update puppet repo file: modules/facilities/manifests/init.pp to add the senty4 line to the PDU entry
  • - ensure all errors clear in icinga after work completes

If the puppet repo isn't update to add in the PDU model, it will get errors in the checking and be unable to read it properly for icinga status.

wiki_willy assigned this task to Cmjohnson.EditedAug 19 2020, 6:07 PM
wiki_willy added a subscriber: wiki_willy.

@Cmjohnson to provide proposed schedule of all affected racks for upgrades, on Thursday, to send out to Service Owners. Also, subtasks will be created by Chris as well, for tracking

RobH removed a subscriber: RobH.Aug 19 2020, 6:10 PM
RobH added a comment.Aug 24 2020, 4:13 PM
This comment was removed by RobH.
wiki_willy updated the task description. (Show Details)Aug 25 2020, 4:21 PM
wiki_willy added a subscriber: RobH.
RobH updated the task description. (Show Details)Aug 27 2020, 10:03 PM
RobH updated the task description. (Show Details)Aug 27 2020, 10:08 PM
RobH updated the task description. (Show Details)Aug 27 2020, 10:14 PM
RobH updated the task description. (Show Details)Sep 11 2020, 9:00 PM
wiki_willy updated the task description. (Show Details)Sep 11 2020, 9:04 PM
wiki_willy updated the task description. (Show Details)Sep 14 2020, 7:14 PM
Cmjohnson closed this task as Resolved.Thu, Oct 1, 6:13 PM

the pdu upgrade has been completed.