Page MenuHomePhabricator

New Date - Tue, Sept 15 PDU Upgrade 12pm-4pm UTC- Racks C2 and C3
Closed, ResolvedPublicRequest

Description

<ps1-c2-eqiad & ps2-c2-eqiad>:

  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - netbox updated
  • - check the existing PDU and all connected cables. Ensure all are properly seated and all items are receiving power from both A and B sides before continuing. Anything not seated or not receiving dual power will be rebooted by continuing this checklist.
  • - install new PDU brackets for the link tower in the rack (see above note on orientation of the brackets.)
  • - install link PDU into the cabinet
  • - de-power old/existing B side power, and plug in new B side link PDU
  • - migrate all B side power connections to new link PDU
  • - Note all B side power connections, input into netbox for every single power port used.
  • - When relocating power cables, please try to ensure that the A and B sides use the same port. If server bast1001 plugs into port 5 on tower B, please also have it plug into port 5 on tower A.
  • - audit all B side connections to ensure all devices are receiving full power on the B side connection (any not receiving power will be rebooted when we move the A side connections next.)
  • - BEFORE UNPLUGGING THE A SIDE ORIGINAL TOWER: Login to the PDU via the HTTPS interface and reset it to factory defaults!
  • - Unmount existing PDU tower and set aside (if possible) to install new PDU brackets into the rack.
  • - Install new PDU tower into the rack, and route power cable for easy cut-over.
  • - de-power old/existing A side power, and plug in new A side link PDU
  • - migrate all A side power connections to new link PDU
  • - Note all A side power connections, input into netbox for every single power port used.
  • - audit all A side connections to ensure all devices are receiving full power on the A side connection.
  • - connect serial to new PDU, ensure serial connection is functional
  • - (Rob) setup network configuration of new PDU via serial
  • - (Rob) setup remaining pdu configuration via https interface
  • - (Rob) update puppet repo file: modules/facilities/manifests/init.pp to add the senty4 line to the PDU entry.
  • - (Rob) Update librenms to reflect new PDU. (unclear if you must delete the old and add new, or if the new will update when its wholly online, so far only done via removing old and adding new device.
  • - (Rob) Update IP address entries in netbox, for now just leave the ip tied to old PDU netbox entry.
  • - ensure all errors clear in icinga and netbox after work completes

Event Timeline

List of hostnames in C2 and C3 below:

analytics1028 C2
analytics1029 C2
analytics1030 C2
analytics1031 C2
analytics1064 C2
analytics1065 C2
analytics1066 C2
analytics1074 C2
an-worker1088 C2
an-worker1099 C2
an-worker1104 C2
asw2-c2-eqiad C2
asw-c2-eqiad C2
brokenasw-c2-eqiad C2
cloudelastic1003 C2
db1087 C2
db1088 C2
db1100 C2
db1101 C2
db1108 C2
es1015 C2
es1016 C2
kafka-jumbo1004 C2
labstore1004 C2
ms-be1049 C2
ms-be1050 C2
msw-c2-eqiad C2
ps1-c2-eqiad C2
thanos-fe1003 C2
analytics1033 C3
analytics1034 C3
an-druid1002 C3
an-test-master1002 C3
asw2-c3-eqiad C3
asw-c3-eqiad C3
cablemgmt-wmf5277 C3
cumin1001 C3
db1078 C3
db1089 C3
db1090 C3
db1095 C3
db1105 C3
db1110 C3
db1133 C3
dbstore1005 C3
elastic1057 C3
elastic1058 C3
es1017 C3
ganeti1009 C3
kubernetes1011 C3
msw-c3-eqiad C3
mw1405 C3
mw1406 C3
mw1407 C3
ores1005 C3
pc1009 C3
ps1-c3-eqiad C3
sessionstore1002 C3

wiki_willy renamed this task from Tue, Sept 14 PDU Upgrade 12pm-4pm UTC- Racks C2 and C3 to Mon, Sept 14 PDU Upgrade 12pm-4pm UTC- Racks C2 and C3.Aug 27 2020, 8:35 PM

Please take extra care with db1087, db1100 and db1109, they are an eqiad masters and lots of slaves hang from them. We might stop mysql just in case.

Mentioned in SAL (#wikimedia-operations) [2020-09-14T11:09:56Z] <marostegui> Stop MySQL on s5 and s8 eqiad primary master - lag will show up on labsdb hosts T261455

Please take extra care with db1087, db1100 and db1109, they are an eqiad masters and lots of slaves hang from them. We might stop mysql just in case.

MySQL stopped on those hosts

Change 627319 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] ps1-c[23]-eqiad update

https://gerrit.wikimedia.org/r/627319

Change 627319 merged by RobH:
[operations/puppet@production] ps1-c[23]-eqiad update

https://gerrit.wikimedia.org/r/627319

wiki_willy renamed this task from Mon, Sept 14 PDU Upgrade 12pm-4pm UTC- Racks C2 and C3 to New Date - Tue, Sept 15 PDU Upgrade 12pm-4pm UTC- Racks C2 and C3.Sep 14 2020, 7:12 PM

Mentioned in SAL (#wikimedia-operations) [2020-09-15T09:22:11Z] <marostegui> Stop MySQL on s5 and s8 eqiad primary master - lag will show up on labsdb hosts T261455

Please take extra care with db1087, db1100 and db1109, they are an eqiad masters and lots of slaves hang from them. We might stop mysql just in case.

MySQL stopped on those hosts

Stopped them again for today's maintenance.

Change 627547 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] setting ps1-c[23]-eqiad monitoring

https://gerrit.wikimedia.org/r/627547

Change 627547 merged by RobH:
[operations/puppet@production] setting ps1-c[23]-eqiad monitoring

https://gerrit.wikimedia.org/r/627547

Cmjohnson updated the task description. (Show Details)