Page MenuHomePhabricator

Mon, Sept 14th - PDU Upgrade Racks D5 and D6
Closed, ResolvedPublicRequest

Description

ps1-d5-eqiad & ps2-d5-eqiad:

  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - update netbox for new pdus (ps1 and ps2)
  • - check the existing PDU and all connected cables. Ensure all are properly seated and all items are receiving power from both A and B sides before continuing. Anything not seated or not receiving dual power will be rebooted by continuing this checklist.
  • - install new PDU brackets for the link tower in the rack (see above note on orientation of the brackets.)
  • - install link PDU into the cabinet
  • - de-power old/existing B side power, and plug in new B side link PDU
  • - migrate all B side power connections to new link PDU
  • - Note all B side power connections, input into netbox for every single power port used.
  • - When relocating power cables, please try to ensure that the A and B sides use the same port. If server bast1001 plugs into port 5 on tower B, please also have it plug into port 5 on tower A.
  • - audit all B side connections to ensure all devices are receiving full power on the B side connection (any not receiving power will be rebooted when we move the A side connections next.)
  • - BEFORE UNPLUGGING THE A SIDE ORIGINAL TOWER: Login to the PDU via the HTTPS interface and reset it to factory defaults!
  • - Unmount existing PDU tower and set aside (if possible) to install new PDU brackets into the rack.
  • - Install new PDU tower into the rack, and route power cable for easy cut-over.
  • - de-power old/existing A side power, and plug in new A side link PDU
  • - migrate all A side power connections to new link PDU
  • - audit all A side connections to ensure all devices are receiving full power on the A side connection.
  • - connect serial to new PDU, ensure serial connection is functional
  • - setup network configuration of new PDU via serial
  • - setup remaining pdu configuration via https interface
  • - update puppet repo file: modules/facilities/manifests/init.pp to add the senty4 line to the PDU entry.
  • - update librenms to reflect new PDU. (unclear if you must delete the old and add new, or if the new will update when its wholly online, so far only done via removing old and adding new device.
  • - Update IP address entries in netbox, for now just leave the ip tied to old PDU netbox entry (rob will change this to more detailed entry later)
  • - ensure all errors clear in icinga and netbox after work completes
  • - attach temp/humidity leads

ps1-d6-eqiad & ps2-d6-eqiad:

  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - update netbox for new pdus (ps1 and ps2)
  • - check the existing PDU and all connected cables. Ensure all are properly seated and all items are receiving power from both A and B sides before continuing. Anything not seated or not receiving dual power will be rebooted by continuing this checklist.
  • - install new PDU brackets for the link tower in the rack (see above note on orientation of the brackets.)
  • - install link PDU into the cabinet
  • - de-power old/existing B side power, and plug in new B side link PDU
  • - migrate all B side power connections to new link PDU
  • - Note all B side power connections, input into netbox for every single power port used.
  • - When relocating power cables, please try to ensure that the A and B sides use the same port. If server bast1001 plugs into port 5 on tower B, please also have it plug into port 5 on tower A.
  • - audit all B side connections to ensure all devices are receiving full power on the B side connection (any not receiving power will be rebooted when we move the A side connections next.)
  • - BEFORE UNPLUGGING THE A SIDE ORIGINAL TOWER: Login to the PDU via the HTTPS interface and reset it to factory defaults!
  • - Unmount existing PDU tower and set aside (if possible) to install new PDU brackets into the rack.
  • - Install new PDU tower into the rack, and route power cable for easy cut-over.
  • - de-power old/existing A side power, and plug in new A side link PDU
  • - migrate all A side power connections to new link PDU
  • - audit all A side connections to ensure all devices are receiving full power on the A side connection.
  • - connect serial to new PDU, ensure serial connection is functional
  • - setup network configuration of new PDU via serial
  • - setup remaining pdu configuration via https interface
  • - update puppet repo file: modules/facilities/manifests/init.pp to add the senty4 line to the PDU entry.
  • - update librenms to reflect new PDU. (unclear if you must delete the old and add new, or if the new will update when its wholly online, so far only done via removing old and adding new device.
  • - Update IP address entries in netbox, for now just leave the ip tied to old PDU netbox entry (rob will change this to more detailed entry later)
  • - ensure all errors clear in icinga and netbox after work completes
  • - attach temp/humidity leads

Event Timeline

List of hostnames in racks D5 and D6 listed below:

an-worker1113 D5
an-worker1114 D5
asw2-d5-eqiad D5
cablemgmt-wmf5288 D5
cloudcephosd1010 D5
cloudcephosd1011 D5
cloudcephosd1012 D5
cloudcephosd1013 D5
cloudcephosd1014 D5
cloudcephosd1015 D5
cloudsw1-d5-eqiad D5
cloudvirt1036 D5
cloudvirt1037 D5
cloudvirt1038 D5
cloudvirt1039 D5
db1137 D5
druid1008 D5
elastic1065 D5
ganeti1020 D5
msw-d5-eqiad D5
ps1-d5-eqiad D5
ps2-d5-eqiad D5
restbase1026 D5
scandium D5
thumbor1003 D5
thumbor1004 D5
an-conf1003 D6
an-test-worker1003 D6
asw2-d6-eqiad D6
cablemgmt-wmf5289 D6
db1122 D6
db1149 D6
druid1006 D6
elastic1066 D6
elastic1067 D6
es1023 D6
kubernetes1014 D6
labpuppetmaster1002 D6
msw-d6-eqiad D6
mw1366 D6
mw1367 D6
mw1368 D6
mw1369 D6
mw1370 D6
mw1371 D6
mw1372 D6
mw1373 D6
mw1374 D6
mw1375 D6
mw1376 D6
mw1377 D6
mw1378 D6
mw1379 D6
mw1380 D6
mw1381 D6
mw1382 D6
ores1009 D6
ps1-d6-eqiad D6
ps2-d6-eqiad D6
restbase1027 D6
sretest1002 D6
wmf5178 D6

Please take extra care of db1122 as it is an eqiad master and lots of slaves hang from it. We might stop mysql there just in case prior the maintenance.

Mentioned in SAL (#wikimedia-operations) [2020-09-09T11:03:53Z] <marostegui> Stop MySQL on s2 eqiad master to prepare for the PDU maintenance (this will generate lag on s2 on labsdb) T261453

Update: Due to the accident/injury at the data center today, @Jclark-ctr will try and complete the upgrade of these 2x PDUs tomorrow (on Thur, Sept 10) after T261454 is completed, should there be enough time during the scheduled maintenance window of 12pm-4pm UTC.

Latest update: Due to another separate injury, the upgrades for these 2x PDUs will be postponed again for a later date. No PDU upgrades for the rest of this week. Thanks, Willy

wiki_willy renamed this task from Wed, Sept 9 PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 to New Date - Tue, Sept 15: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6.Sep 11 2020, 7:35 PM
wiki_willy reassigned this task from Jclark-ctr to Cmjohnson.
wiki_willy added a subscriber: Jclark-ctr.

Change 627330 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] ps1-d[56] update

https://gerrit.wikimedia.org/r/627330

Change 627330 merged by RobH:
[operations/puppet@production] ps1-d[56] update

https://gerrit.wikimedia.org/r/627330

Please note that ps1-d6-eqiad does not see ps2-d6-eqiad, I suspect it is not linked correctly via cable.

The new netbox entries for these two PDUs have not yet been created by the on-site, so I cannot migrate the ip info yet.

PDUs show correctly in icinga, so the errors for them are legit: ps1-d6-eqiad doesn't see ps2, so it has errors.

wiki_willy renamed this task from New Date - Tue, Sept 15: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 to New Date - Mon, Sept 14: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6.Sep 14 2020, 7:14 PM
RobH renamed this task from New Date - Mon, Sept 14: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 to PDU Upgrade 12pm-4pm UTC- Racks D5 and D6.Sep 14 2020, 7:31 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
This comment was removed by RobH.
RobH renamed this task from PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 to PDU Upgrade Racks D5 and D6.Sep 14 2020, 7:38 PM
RobH updated the task description. (Show Details)

So the only pending item (addeded to checklists): Chris has to plug back in the temp/humidity leads

RobH renamed this task from PDU Upgrade Racks D5 and D6 to Mon, Sept 14th - PDU Upgrade Racks D5 and D6.Sep 14 2020, 7:58 PM
RobH updated the task description. (Show Details)
RobH removed a project: DBA.

these were waiting on the temperature leads to be connected. finished and resolving the task