Page MenuHomePhabricator

Tue, Sept 15 PDU Upgrade 12pm-4pm UTC- Racks C4 and C5
Closed, ResolvedPublicRequest

Description

<ps1-c4-eqiad & ps2-c4-eqiad>:

  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - netbox updated
  • - check the existing PDU and all connected cables. Ensure all are properly seated and all items are receiving power from both A and B sides before continuing. Anything not seated or not receiving dual power will be rebooted by continuing this checklist.
  • - install new PDU brackets for the link tower in the rack (see above note on orientation of the brackets.)
  • - install link PDU into the cabinet
  • - de-power old/existing B side power, and plug in new B side link PDU
  • - migrate all B side power connections to new link PDU
  • - Note all B side power connections, input into netbox for every single power port used.
  • - When relocating power cables, please try to ensure that the A and B sides use the same port. If server bast1001 plugs into port 5 on tower B, please also have it plug into port 5 on tower A.
  • - audit all B side connections to ensure all devices are receiving full power on the B side connection (any not receiving power will be rebooted when we move the A side connections next.)
  • - BEFORE UNPLUGGING THE A SIDE ORIGINAL TOWER: Login to the PDU via the HTTPS interface and reset it to factory defaults!
  • - Unmount existing PDU tower and set aside (if possible) to install new PDU brackets into the rack.
  • - Install new PDU tower into the rack, and route power cable for easy cut-over.
  • - de-power old/existing A side power, and plug in new A side link PDU
  • - migrate all A side power connections to new link PDU
  • - Note all A side power connections, input into netbox for every single power port used.
  • - audit all A side connections to ensure all devices are receiving full power on the A side connection.
  • - connect serial to new PDU, ensure serial connection is functional
  • - (Rob) setup network configuration of new PDU via serial
  • - (Rob) setup remaining pdu configuration via https interface
  • - (Rob) update puppet repo file: modules/facilities/manifests/init.pp to add the senty4 line to the PDU entry.
  • - (Rob) Update librenms to reflect new PDU. (unclear if you must delete the old and add new, or if the new will update when its wholly online, so far only done via removing old and adding new device.
  • - (Rob) Update IP address entries in netbox, for now just leave the ip tied to old PDU netbox entry.
  • - ensure all errors clear in icinga and netbox after work completes

<ps1-c5-eqiad & ps2-c5-eqiad>:

  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - netbox updated
  • - check the existing PDU and all connected cables. Ensure all are properly seated and all items are receiving power from both A and B sides before continuing. Anything not seated or not receiving dual power will be rebooted by continuing this checklist.
  • - install new PDU brackets for the link tower in the rack (see above note on orientation of the brackets.)
  • - install link PDU into the cabinet
  • - de-power old/existing B side power, and plug in new B side link PDU
  • - migrate all B side power connections to new link PDU
  • - Note all B side power connections, input into netbox for every single power port used.
  • - When relocating power cables, please try to ensure that the A and B sides use the same port. If server bast1001 plugs into port 5 on tower B, please also have it plug into port 5 on tower A.
  • - audit all B side connections to ensure all devices are receiving full power on the B side connection (any not receiving power will be rebooted when we move the A side connections next.)
  • - BEFORE UNPLUGGING THE A SIDE ORIGINAL TOWER: Login to the PDU via the HTTPS interface and reset it to factory defaults!
  • - Unmount existing PDU tower and set aside (if possible) to install new PDU brackets into the rack.
  • - Install new PDU tower into the rack, and route power cable for easy cut-over.
  • - de-power old/existing A side power, and plug in new A side link PDU
  • - migrate all A side power connections to new link PDU
  • - Note all A side power connections, input into netbox for every single power port used.
  • - audit all A side connections to ensure all devices are receiving full power on the A side connection.
  • - connect serial to new PDU, ensure serial connection is functional
  • - (Rob) setup network configuration of new PDU via serial
  • - (Rob) setup remaining pdu configuration via https interface
  • - (Rob) update puppet repo file: modules/facilities/manifests/init.pp to add the senty4 line to the PDU entry.
  • - (Rob) Update librenms to reflect new PDU. (unclear if you must delete the old and add new, or if the new will update when its wholly online, so far only done via removing old and adding new device.
  • - (Rob) Update IP address entries in netbox, for now just leave the ip tied to old PDU netbox entry.
  • - ensure all errors clear in icinga and netbox after work completes

Event Timeline

List of hostnames in C4 and C5:

an-worker1089 C4
an-worker1090 C4
an-worker1100 C4
an-worker1105 C4
an-worker1106 C4
an-worker1107 C4
an-worker1108 C4
asw2-c4-eqiad C4
asw-c4-eqiad C4
cablemgmt-wmf5281 C4
cp1083 C4
cp1084 C4
deploy1001 C4
kafka-jumbo1005 C4
kafka-jumbo1007 C4
labsdb1010 C4
logstash1028 C4
ms-be1024 C4
ms-be1025 C4
ms-be1054 C4
ms-fe1007 C4
msw-c4-eqiad C4
mwlog1001 C4
ores1006 C4
ps1-c4-eqiad C4
snapshot1006 C4
thanos-be1003 C4
wmf3570 C4
an-conf1002 C5
an-test-worker1002 C5
aqs1005 C5
asw2-c5-eqiad C5
asw-c5-eqiad C5
cablemgmt-wmf5282 C5
cloudcontrol1005 C5
cloudmetrics1001 C5
db1120 C5
db1145 C5
db1146 C5
dbproxy1018 C5
dbproxy1019 C5
dbproxy1020 C5
druid1002 C5
elastic1040 C5
elastic1041 C5
elastic1042 C5
elastic1043 C5
es1022 C5
ganeti1010 C5
kubernetes1003 C5
kubernetes1012 C5
labsdb1011 C5
labstore1005 C5
logstash1022 C5
maps1003 C5
mc1028 C5
mc1029 C5
mc1030 C5
mc1031 C5
mc1032 C5
msw-c5-eqiad C5
ps1-c5-eqiad C5
relforge1002 C5
wdqs1013 C5
wtp1037 C5
wtp1038 C5
wtp1039 C5

Be careful with dbproxy1018 and dbproxy1019, and labsdb1010 as they serve cloud infra and that service will not be switched to codfw.

Change 627409 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy: Depool labsdb1010

https://gerrit.wikimedia.org/r/627409

Change 627409 merged by Marostegui:
[operations/puppet@production] dbproxy: Depool labsdb1010

https://gerrit.wikimedia.org/r/627409

I have depooled labsdb1010, I will stop mysql in a couple of hours

Mentioned in SAL (#wikimedia-operations) [2020-09-15T08:13:22Z] <marostegui> Stop MySQL on labsdb1010 for PDU maintenance T261456

labsdb1010 mysql has been stopped

@Cmjohnson please take special care of: dbproxy1018 and dbproxy1019, and labsdb1011 as those hosts are serving a

Change 627558 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] ps1-c[45]-eqiad update

https://gerrit.wikimedia.org/r/627558

Change 627558 merged by RobH:
[operations/puppet@production] ps1-c[45]-eqiad update

https://gerrit.wikimedia.org/r/627558

RobH updated the task description. (Show Details)

The on-site work here was fully done or is there anything pending that requires power changes? :)

Cmjohnson updated the task description. (Show Details)