Page MenuHomePhabricator

New Date - Thur, Sept 17: PDU Upgrade 12pm-4pm UTC- Racks D1 and D2
Closed, ResolvedPublicRequest

Description

ps1-d1-eqiad & ps2-d1-eqiad:

  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - update netbox
  • - check the existing PDU and all connected cables. Ensure all are properly seated and all items are receiving power from both A and B sides before continuing. Anything not seated or not receiving dual power will be rebooted by continuing this checklist.
  • - install new PDU brackets for the link tower in the rack (see above note on orientation of the brackets.)
  • - install link PDU into the cabinet
  • - de-power old/existing B side power, and plug in new B side link PDU
  • - migrate all B side power connections to new link PDU
  • - Note all B side power connections, input into netbox for every single power port used.
  • - When relocating power cables, please try to ensure that the A and B sides use the same port. If server bast1001 plugs into port 5 on tower B, please also have it plug into port 5 on tower A.
  • - audit all B side connections to ensure all devices are receiving full power on the B side connection (any not receiving power will be rebooted when we move the A side connections next.)
  • - BEFORE UNPLUGGING THE A SIDE ORIGINAL TOWER: Login to the PDU via the HTTPS interface and reset it to factory defaults!
  • - Unmount existing PDU tower and set aside (if possible) to install new PDU brackets into the rack.
  • - Install new PDU tower into the rack, and route power cable for easy cut-over.
  • - de-power old/existing A side power, and plug in new A side link PDU
  • - migrate all A side power connections to new link PDU
  • - Note all A side power connections, input into netbox for every single power port used.
  • - audit all A side connections to ensure all devices are receiving full power on the A side connection.
  • - connect serial to new PDU, ensure serial connection is functional
  • - (Rob) setup network configuration of new PDU via serial
  • - (Rob) setup remaining pdu configuration via https interface
  • - (Rob) update puppet repo file: modules/facilities/manifests/init.pp to add the senty4 line to the PDU entry.
  • - (Rob) Update librenms to reflect new PDU. (unclear if you must delete the old and add new, or if the new will update when its wholly online, so far only done via removing old and adding new device.
  • - (Rob) Update IP address entries in netbox, for now just leave the ip tied to old PDU netbox entry.
  • - ensure all errors clear in icinga and netbox after work completes

ps1-d2-eqiad & ps2-d2-eqiad:

  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - update netbox
  • - check the existing PDU and all connected cables. Ensure all are properly seated and all items are receiving power from both A and B sides before continuing. Anything not seated or not receiving dual power will be rebooted by continuing this checklist.
  • - install new PDU brackets for the link tower in the rack (see above note on orientation of the brackets.)
  • - install link PDU into the cabinet
  • - de-power old/existing B side power, and plug in new B side link PDU
  • - migrate all B side power connections to new link PDU
  • - Note all B side power connections, input into netbox for every single power port used.
  • - When relocating power cables, please try to ensure that the A and B sides use the same port. If server bast1001 plugs into port 5 on tower B, please also have it plug into port 5 on tower A.
  • - audit all B side connections to ensure all devices are receiving full power on the B side connection (any not receiving power will be rebooted when we move the A side connections next.)
  • - BEFORE UNPLUGGING THE A SIDE ORIGINAL TOWER: Login to the PDU via the HTTPS interface and reset it to factory defaults!
  • - Unmount existing PDU tower and set aside (if possible) to install new PDU brackets into the rack.
  • - Install new PDU tower into the rack, and route power cable for easy cut-over.
  • - de-power old/existing A side power, and plug in new A side link PDU
  • - migrate all A side power connections to new link PDU
  • - Note all A side power connections, input into netbox for every single power port used.
  • - audit all A side connections to ensure all devices are receiving full power on the A side connection.
  • - connect serial to new PDU, ensure serial connection is functional
  • - (Rob) setup network configuration of new PDU via serial
  • - (Rob) setup remaining pdu configuration via https interface
  • - (Rob) update puppet repo file: modules/facilities/manifests/init.pp to add the senty4 line to the PDU entry.
  • - (Rob) Update librenms to reflect new PDU. (unclear if you must delete the old and add new, or if the new will update when its wholly online, so far only done via removing old and adding new device.
  • - (Rob) Update IP address entries in netbox, for now just leave the ip tied to old PDU netbox entry.
  • - ensure all errors clear in icinga and netbox after work completes

Event Timeline

Restricted Application added a project: SRE. · View Herald TranscriptAug 27 2020, 8:43 PM

List of hostnames in racks D1 and D2:

asw2-d1-eqiad D1
cablemgmt-wmf5284 D1
centrallog1001 D1
db1125 D1
db1136 D1
db1148 D1
dbproxy1016 D1
dns1002 D1
dumpsdata1002 D1
elastic1060 D1
elastic1061 D1
es1018 D1
kafka-jumbo1006 D1
logstash1012 D1
msw-d1-eqiad D1
mw1349 D1
mw1350 D1
mw1351 D1
mw1352 D1
mw1353 D1
mw1354 D1
mw1355 D1
mw1356 D1
mw1357 D1
mw1358 D1
mw1359 D1
mw1360 D1
mw1361 D1
mw1362 D1
phab1003 D1
ps1-d1-eqiad D1
ps2-d1-eqiad D1
rdb1010 D1
restbase-dev1006 D1
snapshot1009 D1
stat1005 D1
wdqs1008 D1
analytics1076 D2
an-presto1001 D2
an-worker1092 D2
an-worker1093 D2
an-worker1112 D2
asw2-d2-eqiad D2
backup1001 D2
backup1001-array1 D2
backup1001-array2 D2
cablemgmt-wmf5285 D2
cloudelastic1004 D2
cloudstore1008 D2
cloudstore1008-array1 D2
cp1087 D2
cp1088 D2
flerovium D2
flerovium-array1 D2
kafka-main1004 D2
labstore1007 D2
labstore1007-array1 D2
labstore1007-array2 D2
ms-be1043 D2
ms-be1048 D2
ms-be1055 D2
ms-be1059 D2
msw-d2-eqiad D2
ps1-d2-eqiad D2
ps2-d2-eqiad D2

Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.
Marostegui moved this task from Blocked external/Not db team to Next on the DBA board.

dbproxy1016 needs to be failover. I will take care of that now
Please take extra care of db1125.

Thanks!

Change 623349 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Failover m3 dbproxy

https://gerrit.wikimedia.org/r/623349

Change 623349 merged by Marostegui:
[operations/dns@master] wmnet: Failover m3 dbproxy

https://gerrit.wikimedia.org/r/623349

Mentioned in SAL (#wikimedia-operations) [2020-08-31T13:07:09Z] <marostegui> Failover m3 (phabricator) proxy from dbproxy1016 to dbproxy1020 - T261459

dbproxy1016 is no longer active, its service has been failed over.

wiki_willy renamed this task from Mon, Sept 21 PDU Upgrade 12pm-4pm UTC- Racks D1 and D2 to New Date - Thur, Sept 17: PDU Upgrade 12pm-4pm UTC- Racks D1 and D2.Sep 11 2020, 7:39 PM
RobH updated the task description. (Show Details)Sep 11 2020, 8:57 PM
Marostegui moved this task from Next to In progress on the DBA board.Sep 16 2020, 4:16 PM

Mentioned in SAL (#wikimedia-operations) [2020-09-17T10:58:26Z] <marostegui> Stop mysql on db1125 for PDU mainteanance, lag will appear on s2, s4, s6 and s7 on labsdb hosts T261459

mysql stopped on db1125 on all the instances

Mentioned in SAL (#wikimedia-operations) [2020-09-17T14:02:20Z] <marostegui> Start mysql on db1125 after PDU maintenance T261459

mysql started back on db1125

Change 628123 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] updating pdus from upgrade

https://gerrit.wikimedia.org/r/628123

Change 628123 merged by RobH:
[operations/puppet@production] updating pdus from upgrade

https://gerrit.wikimedia.org/r/628123

RobH updated the task description. (Show Details)Sep 17 2020, 3:58 PM
Cmjohnson updated the task description. (Show Details)Sep 17 2020, 4:02 PM
Cmjohnson closed this task as Resolved.Sep 17 2020, 4:04 PM

Resolving this task the new pdus are installed