Page MenuHomePhabricator

rack/setup/install ps[12]-oe1[456]-esams
Closed, ResolvedPublic

Description

This task will track the replacement of the PDUs in esams racks OE14, OE15, & OE16. All three racks have been setup for redundant power via work with Iron Mountain by @wiki_willy. Now they are ready for the installation of new PDUs ordered on T230143.

Each PDU will have to have the old pdu renamed from its current name to 'old-ps1-oe1......' Typically we rename things to their asset tag, but these PDUs pre-date asset tagging and thus have none. The new pdus will end up as the only item in netbox with the hostname ps1-rackname-site. OE14 and OE15 have to have their PS2 entries also updated, all setup items for each PDU are individually listed in the checklists below:

Please note ALL connections must be listed off on the 'cable mapping' tab of the gsheet esams elevation and cable mapping.

ps1-oe14-esams:

  • - rename the old ps1-oe14-esams to 'old-ps1-oe14-esams'
  • - once the new pdu is in place, unrack the old one and set its netbox entry to offline.
  • - receive in the new PDU from T230143 and add to netbox with the hostname ps1-oe14-esams
  • - wire up all serial/mgmt/network/power. Power will work without the PDU mgmt being configured.
  • - list off/update every single cable connection in netbox, this includes all serial/mgmt/network and POWER as these are switched PDU units and we'll have to configure remote power. This means that if an item plugs into port 1 on tower A, it should plug into port 1 on tower B. @ayounsi is familiar with this (so if uncertain please ask him.)
  • - setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup - @RobH can assist with this part once serial is established to the PDU.
  • - update PDU model in puppet per T233129. - @RobH can assist with this part once serial is established to the PDU.

ps2-oe14-esams:

  • - rename the old ps2-oe14-esams to 'old-ps2-oe14-esams'
  • - once the new pdu is in place, unrack the old one and set its netbox entry to offline.
  • - receive in the new PDU from T230143 and add to netbox with the hostname ps2-oe14-esams
  • - connect the link cable from ps1-oe14-esams to ps2-oe14-esams.
  • - list off/update every single cable connection in netbox, this includes all serial/mgmt/network and POWER as these are switched PDU units and we'll have to configure remote power. This means that if an item plugs into port 1 on tower A, it should plug into port 1 on tower B. @ayounsi is familiar with this (so if uncertain please ask him.)

ps1-oe15-esams:

  • - rename the old ps1-oe15-esams to 'old-ps1-oe15-esams'
  • - once the new pdu is in place, unrack the old one and set its netbox entry to offline.
  • - receive in the new PDU from T230143 and add to netbox with the hostname ps1-oe15-esams
  • - wire up all serial/mgmt/network/power. Power will work without the PDU mgmt being configured.
  • - list off/update every single cable connection in netbox, this includes all serial/mgmt/network and POWER as these are switched PDU units and we'll have to configure remote power. This means that if an item plugs into port 1 on tower A, it should plug into port 1 on tower B. @ayounsi is familiar with this (so if uncertain please ask him.)
  • - setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup - @RobH can assist with this part once serial is established to the PDU.
  • - update PDU model in puppet per T233129. - @RobH can assist with this part once serial is established to the PDU.

ps2-oe15-esams:

  • - OE15 did not have redundant power before, and has no existing ps2-oe15-esams to remove/decommission.
  • - receive in the new PDU from T230143 and add to netbox with the hostname ps2-oe15-esams
  • - connect the link cable from ps1-oe14-esams to ps2-oe14-esams.
  • - list off/update every single cable connection in netbox, this includes all serial/mgmt/network and POWER as these are switched PDU units and we'll have to configure remote power. This means that if an item plugs into port 1 on tower A, it should plug into port 1 on tower B. @ayounsi is familiar with this (so if uncertain please ask him.)

ps1-oe16-esams:

  • - there is no ps1-oe16-esams in netbox, so nothing to rename.
  • - receive in the new PDU from T230143 and add to netbox with the hostname ps1-oe14-esams
  • - wire up all serial/mgmt/network/power. Power will work without the PDU mgmt being configured.
  • - list off/update every single cable connection in netbox, this includes all serial/mgmt/network and POWER as these are switched PDU units and we'll have to configure remote power. This means that if an item plugs into port 1 on tower A, it should plug into port 1 on tower B. @ayounsi is familiar with this (so if uncertain please ask him.)
  • - setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup - @RobH can assist with this part once serial is established to the PDU.
  • - update PDU model in puppet per T233129. - @RobH can assist with this part once serial is established to the PDU.

ps2-oe16-esams:

  • - rename the old ps2-oe16-esams to 'old-ps2-oe16-esams'
  • - once the new pdu is in place, unrack the old one and set its netbox entry to offline.
  • - receive in the new PDU from T230143 and add to netbox with the hostname ps2-oe14-esams
  • - connect the link cable from ps1-oe14-esams to ps2-oe14-esams.
  • - list off/update every single cable connection in netbox, this includes all serial/mgmt/network and POWER as these are switched PDU units and we'll have to configure remote power. This means that if an item plugs into port 1 on tower A, it should plug into port 1 on tower B. @ayounsi is familiar with this (so if uncertain please ask him.)

Details

Related Changes in Gerrit:

Event Timeline

mark triaged this task as Medium priority.
RobH renamed this task from Procure and install new PDUs to rack/setup/install ps[12]-oe1[456]-esams.Oct 22 2019, 9:34 PM
RobH claimed this task.
RobH updated the task description. (Show Details)
RobH added a parent task: Unknown Object (Task).
RobH added subscribers: wiki_willy, RobH, ayounsi.
RobH updated the task description. (Show Details)

I have remotely setup both ps1-oe15-esams and ps1-oe16-esams with network configuration. They have NOT had their ports labeled, as this task doesn't list what is plugged into each port. At this time, no cable mappings have been copied to the google sheet.

Mentioned in SAL (#wikimedia-operations) [2019-10-25T15:35:35Z] <robh> ps1-oe14-esams ip info set, rebooting (wont affect servers) via T184066

Change 546674 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] adding esams pdus to monitoring

https://gerrit.wikimedia.org/r/546674

Change 546674 merged by RobH:
[operations/puppet@production] adding esams pdus to monitoring

https://gerrit.wikimedia.org/r/546674

RobH reassigned this task from RobH to Papaul.EditedOct 31 2019, 2:31 PM
RobH added a subscriber: Papaul.

So while the cable mapping tab on this google sheet was imported into netbox, it seems that no one on-site recorded the power cord mappings?

Without this, I cannot finish the setup of the switched ports on these.

@Papaul: Are the power cable mappings written down offline somewhere?

IRC Update:

@ayounsi let me know they ran out of time, and T237009 has been opened to apply labels to cords missing them.

Once all cords have labels, then they can be inputted into the PDU configuration. I've gone ahead and checked off the other boxes (after confirming their completion by logging into netbox and the PDUs directly.)

Other than the setup of the individual ports, this task is complete.

Papaul removed Papaul as the assignee of this task.Nov 5 2019, 4:51 PM

qfx5100-spare1, psu 0 {#20156} to ps2-oe15-esams:17
qfx5100-spare2, psu 0 {#20157} to ps2-oe15-esams:16
qfx5100-spare1, psu 1 {#20159} to ps1-oe15-esams:2
qfx5100-spare2, psu 1 {#20158} to ps1-oe15-esams:3
scs1-oe15-esams:psu1 {#20163} to ps2-oe15-esams:34
scs1-oe15-esams:psu2 {#20164} to ps1-oe15-esams:34
asw2-oe16-esams:psu0 {#20162} to ps2-oe16-esams:26
asw2-oe16-esams:psu1 {#20164} to ps1-oe16-esams:26

scs1-oe16-esams:psu1 {#20163} to ps2-oe16-esams:34
scs1-oe16-esams:psu2 {#20164} to ps1-oe16-esams:34

asw2-oe16-esams:psu0 {#20162} to ps2-oe16-esams:26
asw2-oe16-esams:psu1 {#20165} to ps1-oe16-esams:26

scs1-oe16-esams:psu1 {#20163} to ps2-oe16-esams:34
scs1-oe16-esams:psu2 {#20164} to ps1-oe16-esams:34

asw2-oe16-esams:psu0 {#20162} to ps2-oe16-esams:26
asw2-oe16-esams:psu1 {#20165} to ps1-oe16-esams:26

updated in netbox

qfx5100-spare1, psu 0 {#20156} to ps2-oe15-esams:17
qfx5100-spare2, psu 0 {#20157} to ps2-oe15-esams:16
qfx5100-spare1, psu 1 {#20159} to ps1-oe15-esams:2
qfx5100-spare2, psu 1 {#20158} to ps1-oe15-esams:3
asw2-oe16-esams:psu0 {#20162} to ps2-oe16-esams:26
asw2-oe16-esams:psu1 {#20164} to ps1-oe16-esams:26

All the above are done, but NOT

scs1-oe15-esams:psu1 {#20163} to ps2-oe15-esams:34
scs1-oe15-esams:psu2 {#20164} to ps1-oe15-esams:34

as there is no scs-oe15-esams, not sure what that is. Mark's comment T184066#5694430 covers scs-oe16-esams.

qfx5100-spare1, psu 0 {#20156} to ps2-oe15-esams:17
qfx5100-spare2, psu 0 {#20157} to ps2-oe15-esams:16
qfx5100-spare1, psu 1 {#20159} to ps1-oe15-esams:2
qfx5100-spare2, psu 1 {#20158} to ps1-oe15-esams:3
asw2-oe16-esams:psu0 {#20162} to ps2-oe16-esams:26
asw2-oe16-esams:psu1 {#20164} to ps1-oe16-esams:26

All the above are done, but NOT

scs1-oe15-esams:psu1 {#20163} to ps2-oe15-esams:34
scs1-oe15-esams:psu2 {#20164} to ps1-oe15-esams:34

as there is no scs-oe15-esams, not sure what that is. Mark's comment T184066#5694430 covers scs-oe16-esams.

That was a mistake I made on IRC, then corrected here, but Papaul copied it over. I should really only do this in Phabricator from now on, it gets confusing :)

There is indeed no SCS in OE15, I was purely talking about the new one in OE16.

So we have a google sheet with all of the power cords, and they do NOT match up.

Example:

cp3054 is plugged in to tower 1 on port 2 and into tower 2 on port 7.

I don't like that none of these match up, they really should be balanced up the PDU (server into port 2 or port 7 on both power supplies, not mixed.) I'd like to get this fixed/normalized before we import into netbox and then setup per outlet control. When I discussed with @wiki_willy, these are the planned next steps:

  • resolve this task, as the PDUs are setup.
  • create a new on-site esams task to normalize the PDU outlet usage
  • update the cable mapping google sheet
  • import the updated cable mapping sheet into netbox
  • setup per outlet switching on the PDU towers

I've flagged this for followup in a few days to ensure this plan is acceptable to everyone, and will then create sub-tasks/next steps.

Could we import into Netbox now, and then change & document the setup at our convenience? It feels like documenting the existing situation and changing it are orthogonal to each other - any reason to block one on the other?

Separately, switching a bunch of power cords carries a risk, especially if done by smart hands. If I understand this correctly, things are functional right now, just not very normalized/as pretty as we would like Is there a practical reason for changing this now?

Could we import into Netbox now, and then change & document the setup at our convenience? It feels like documenting the existing situation and changing it are orthogonal to each other - any reason to block one on the other?

Separately, switching a bunch of power cords carries a risk, especially if done by smart hands. If I understand this correctly, things are functional right now, just not very normalized/as pretty as we would like Is there a practical reason for changing this now?

I've created a sub-task for the correction, and indeed we can proceed and import now. However, if we're going to allow remote hands to fix it within the next 60 days, I advise NOT importing it. If we import, we'll have to manually 'unlink'/delete all of the cable connections before re-importing the updated sheet.

So if the fix via T243088 will happen in 60 days or less, I'd say hold off import. If we're going to have it go longer, then import now.

Either works, will do what is approved =]

Agreed that power cord swapping has a risk, and the task T243088 outlines that we want Traffic and/or DC-Ops around when this work is done!

Update: I'm going to clean up and import what we have into netbox as part of the PDU setup task T184066; once imported I'll resolve T184066. Then T243088 will be set to lowest priority for the next time we have a WMF SRE staff on-site (will likely be months out), and they will normalize the power cables to the same outlets on each PDU.

While non-normalized outlet use is a urgent technical issue on 3 phase power, it is simply an organizational thing on single phase, so it isn't vital to fix quickly.

All of the power ports are documented in netbox and labeled on the pdu towers. The port groups for network hardware have been setup for easy reboot for the entire group of ports.

resolving this task.