Page MenuHomePhabricator

ulsfo: setup ulsfo PDUs
Closed, ResolvedPublic

Description

This task will track the installation, setup, and transition to our new (WMF owned) PDU towers in our racks in ulsfo.

This work is scheduled for Tuesday, December 18th, at 11:00-12:00 Pacific.

Please note that datacenter vendor policies do not allow customers to pull floor tiles to (de)energize PDU towers. All sub-floor work must be done by DigitalRealty engineers. Due to that, this will be a multi-step process.

  • - install ST B PDUs in each cabinet. This leaves the APC B PDUs unsecured (see photos in T209101#4811939), so only tower B of the servertechs will be installed in advance. This ensures we don't have 2 unsecured PDUs banging around in the cabinet (when the doors open, they move due to all the power cables.)
  • - schedule DR engineering/tech time to work with @RobH onsite for 1-2 hours and migrate entire racks at that time with the following steps:
  • - DR tech energizes ST B PDUs in each rack (de-energizing APC tower B in each rack)
  • - @RobH moves all B side connections from APC to ST B PDUs in each cabinet
  • - @RobH removes APC B PDUs from cabinets to return to DR
  • - @RobH unsecures APC A PDUs in each cabinet and mounts ST PDU tower A in each cabinet
  • - DR tech energizes ST A PDUs in each rack (de-energizing APC tower B in each rack)
  • - @RobH moves all A side connections from APC to ST B PDUs in each cabinet
  • - @RobH removes APC A PDUs from cabinets to return to DR

Remaining steps can be completed without DR support:

  • - add brackets to ps1 in each cabinet
  • - balance power between towers
  • - label and normalize each server's power ports (so if server X in is port 1 on one tower, ensure its in matching port 1 on the redundant tower) and update PDU software labels
  • - update netbox software port assignments to reflect pdu software
  • - add pdus into monitoring
  • - connect ps1-23 to scs

Related Objects

StatusSubtypeAssignedTask
ResolvedRobH

Event Timeline

RobH triaged this task as Medium priority.Nov 8 2018, 8:11 PM
RobH created this task.
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)

@faidon has suggested we instead do all the swaps of towers in a single scheduled work visit where I'm onsite with DR engineer. I'm fine with that, so I'll schedule the work with them once I confirm the right PDUs are onsite tomorrow.

Please note the captive nuts for this arrived to my place today, and the brackets are on site. I can now start on this process of swapping the PDUS over.

I'll go in next week and test fit the tower B in each cabinet.

Mentioned in SAL (#wikimedia-operations) [2018-12-10T20:25:54Z] <robh> messing with ulsfo power for 103.02.23 tower b, shouldnt disrupt anything T209101

Ok, good news, the new PDUs will fit just fine in the racks, as long as we remove our cable managers.

IMG_20181210_124609.jpg (3×2 px, 2 MB)

IMG_20181210_124613.jpg (3×2 px, 2 MB)

As one can see, the deeper 1U cable mangers are blocked from opening. So for now I'll just remove the fronts and leave them without front panels. We'll need to replace them with more shallow cable managers.

I don't like how we have to leave the APC PDU unanchored in the rack when I install the Servertech, so I'm ONLY going to do pre-stage the tower B of each of these racks. This way there is only 1 unsecured PDU (and they wont have two unsecured PDUs banging into one another.) It is easy/fast enough to install the brackets now that I've done one of them to work out how to best arrange the brackets.

RobH mentioned this in Unknown Object (Task).Dec 10 2018, 9:07 PM

So some bad news: I only ordered enough brackets for half the PDUs. T211632 has been created to order the other half. However, this will NOT block the migration to the new PDUs, I'll just temp affix the ST PDUs in place with zip ties rather than brackets. I can move them to brackets after the fact (when the new brackets arrive) with no downtime.

Put in ticket 00545719 with DR techs:

comment section of ticket:

Digital Realty Support,

Our two cabinets, X and Y each have redundant APC PDUs currently energized in each cabinet. Those APC PDUs are not Wikimedia property, but belong to Digital Realty and were loaned to Wikimedia until our ordered ServerTech PDUs order arrived on site.

Those PDUs have arrived (and two of them are currently stored in Z). The other two are already installed into our racks, with the cables fed to the sub-floor, but have NOT been energized. We need to schedule an on-site time for myself (Rob) to be onsite with Digital Realty engineer to de-energize the old APC towers, and energize the Wikimedia owned Servertech PDU towers.

I'll need to move systems between towers as we migrate, as we should be able to do this without downtime. I expect this should take about an hour or less, as I'll just need the DR engineer to handle the unplugging/plugging in the PDU towers in the sub-floor, and I can handle the rest quickly. We'll energize the redundant feed tower in each cabinet, migrate to it, and then energize the primary feed in each cabinet.

I am available to do this work from Wednesday & Thursday this week from 10:00-15:00, and then the following Monday-Thursday 10:00-15:00. (We prefer not to do any work like this on a Friday.) We prefer arriving at 11AM, as it avoids rush hour, but can shift to 10AM if needed!

Can you advise which of these time slots would work for you? Please note we don't want anything done without my being on-site during the work! (We'll have to move items over between feeds and we don't want to offline anything.)

RobH mentioned this in Unknown Object (Task).Dec 12 2018, 4:12 AM
RobH added a subscriber: BBlack.

@BBlack: This work is now scheduled for Tuesday, December 18th, at 11:00-12:00 Pacific.

So either you can depool the site that AM, or I'll do so when I wake up (before heading into the datacenter.) I don't expect any issues, but since we're swapping entire PDU towers, it seems safer to do a preemptive depool.

I have the codfw switch maintenance from 8am to 11am (where codfw will be depooled). And a dentist apt at 1pm.
I think it's better to repool codfw before depooling ulsfo. I can take care of it.

I have the codfw switch maintenance from 8am to 11am (where codfw will be depooled). And a dentist apt at 1pm.
I think it's better to repool codfw before depooling ulsfo. I can take care of it.

Thanks! I'll make sure to sync with Arzhel on Tuesday before starting the power work to ensure codfw is back online.

RobH mentioned this in Unknown Object (Task).Dec 14 2018, 7:08 PM

Change 480568 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool ulsfo for PDU work

https://gerrit.wikimedia.org/r/480568

Change 480568 merged by Ayounsi:
[operations/dns@master] Depool ulsfo for PDU work

https://gerrit.wikimedia.org/r/480568

Change 480577 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] adding dns entries for ulsfo pdus

https://gerrit.wikimedia.org/r/480577

I'm keeping this open to track the additional steps of adding the brackets to the ps1 in each cabinet (only ps2 has them presently) and also then auditing and labeling every power connection for remote port administration on the PDUs.

Change 480577 merged by RobH:
[operations/dns@master] adding dns entries for ulsfo pdus

https://gerrit.wikimedia.org/r/480577

Both PDU sets are online via mgmt interfaces and can be remotely administered. I'm getting the same error code on serial, will troubleshoot it remotely. Since mgmt network works, serial working is low priority.

The ports still need audit (post bracket installation) and setup on the pdu software.

Change 480672 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] normalizing pdu names

https://gerrit.wikimedia.org/r/480672

Change 480672 merged by RobH:
[operations/dns@master] normalizing pdu names

https://gerrit.wikimedia.org/r/480672

Oh, the firmware needs to be updated. We've done this without downtime in the past on other servertechs, but since I'll be onsite to install the other brackets, I'll flash then. (Better safe than sorry and if something went wrong I'd be in the car for an hour before arriving there due to traffic.)

all PDUs in ulsfo are now properly mounted. The temp/humidity leads are plugged in, but not run anywhere until AFTER we get rid of the decom systems and install blanking panels.

current installed firmware is Sentry Switched PDU Version 8.0k

newest firmware revision is Sentry Switched PDU Version 8.0n

Mentioned in SAL (#wikimedia-operations) [2019-02-07T21:22:15Z] <robh> updating firmware on ps1-22-ulsfo via T209101

Ok, while updating these, I've noticed that the power feeds in ulsfo are not balanced. Tower A is around 7 amps and tower B is around 2 amps for both racks.

Mentioned in SAL (#wikimedia-operations) [2019-02-07T21:38:53Z] <robh> updating firmware on ps1-23-ulsfo via T209101 ps1-22-ulsfo update completed

Ok, firmware updated and all power balanced.

RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH renamed this task from ulsfo: install new PDUs in racks / phase out APC loaner PDU use to ulsfo: setup ulsfo PDUs.Feb 7 2019, 11:14 PM

Change 489113 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Icinga: add ping check for ulsfo PDUs

https://gerrit.wikimedia.org/r/489113

Change 489113 merged by Ayounsi:
[operations/puppet@production] Icinga: add ping check for ulsfo PDUs

https://gerrit.wikimedia.org/r/489113

Ok, so while we are now monitoring these PDUs, I have not yet done the following:

  • - label every single power cable with a unique serial number
  • - audit/update each server and document what port it plugs into on each tower, and what its serial numbers are for the power cables (normalize so a server uses port 1 on tower A uses port 1 on tower B as well
  • - update netbox software port assignments to reflect pdu software
  • - connect ps1-23 to scs

Mentioned in SAL (#wikimedia-operations) [2019-07-09T23:06:49Z] <robh> updating power ports on T209101 and disabling ports not in used (only turning off one side and awaiting any icinga alerts for 15 minutes before touching other side of power)

RobH updated the task description. (Show Details)

imported all of the power connections into netbox, and the pdu towers have their ports labeled on the PDU software as well, with groups added for outlet control on network devices.