Page MenuHomePhabricator

(Need By:TBD) rack/setup/install row C new PDUs
Closed, ResolvedPublic

Description

This task will track the racking, setup and configuration of all PDUs in row C. We shouldn't have any impact on servers going down. The only thing to keep in mind is that during the PDU replacing process the management switch in that particular rack will not be available.

If you are a service owner and think that you need to depool your server(s) during the maintenance window, please put a "YES" in the "List of Servers and network devices" table below.
Thanks

Schedule

RackDateTimeComments
C1August 3rd11:30 am CT/4:30pm UTC
C2August 4th1:300 pm CT/6:30pm UTC
C3Already replaced
C4August 4th09:30 am CT/2:30pm UTC
C5August 4th10:00 am CT/3:00pm UTC
C6August 4th10:30 am CT/3:30pm UTC
C7August 4th11:00 am CT/4:00pm UTC
C8August 11th09:30 am CT/2:30pm UTC

Per PDU setup Checklist

ps1-c1-codfw/ps2-c1-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack C1

ServersDo you need to depool?
es2032
es2031
kubernetes2012Powered down
kubernetes2011Powered down
db2138
restbase2022depooled and downtimed - can be powered off
db2125
ganeti2010
ganeti2009
db2112
restbase2015depooled and downtimed - can be powered off
cloudcontrol2003-dev
ores2005
mc2027YES will need to be powered off/depooled? right before the maintenance, and back on right after
elastic2031Depooled and powered down
restbase2011
mc2037YES will need to be powered off/depooled? right before the maintenance, and back on right after
pc2013
wcqs2002Depooled and powered down
cumin2002Powered down
db2149

ps1-c2-codfw/ps2-c2-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack C2

ServersDo you need to depool?
lvs2009
ms-fe2011
moss-fe2001
kafka-logging2003
elastic2045
cp2036powered off
cp2035powered off
dns2001powered off
ml-cache2003
ms-be2055
elastic2047
elastic2046
ms-be2048
ms-be2035
ms-be2034
ms-be2042
backup2009
ms-be2068
backup2006

ps1-c4-codfw/ps2-c4-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack C4

ServersDo you need to depool?
elastic2082No, not in production
elastic2081No, not in production
mc2048
mc2047
wdqs2011
elastic2066powered down and ready for maintenance
elastic2065powered down and ready for maintenance
ms-backup2001
logstash2035
ms-be2064
backup2003
ms-be2058

ps1-c5-codfw/ps2-c5-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack C5

ServersDo you need to depool?
db2166
db2126
db2114
logstash2002
restbase2016
ores2006
db2165
db2090
parse2013
parse2012
db2102
restbase2020
ganeti2012
mc2031
mc2030
elastic2033Powered off/ready for work
elastic2032Powered off/ready for work
wdqs2001Powered off/ready for work
phab2001
parse2011
gitlab-runner2003
ganeti2011
restbase2025
ml-serve2003

ps1-c6-codfw/ps2-c6-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack C6

ServersDo you need to depool?
db2180
db2179
db2168
db2167
parse2015
parse2014
mw2365
db2127
mw2364
mw2363
mw2362
mw2361
mw2360
mw2359
mw2358
mw2357
mw2356
ganeti2014
ganeti2013
mw2355
es2022
dbproxy2003
db2116
db2099
mw2354
mw2353
mw2352
mw2351
mw2350
wdqs2008
db2135
db2095
db2115

ps1-c7-codfw/ps2-c7-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack C7

ServersDo you need to depool?
elastic2083Not in production/ready for work
mc2050
mc2049
elastic2071
cp2038powered off
cp2037powered off
elastic2059Powered down/ready for work
logstash2028
ms-be2054
cloudbackup2002
cloudbackup2002-array1
kafka-main2003
ms-be2036
thanos-be2003
elastic2048Powered down/ready for work
ms-be2049

ps1-c8-codfw/ps2-c8-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack C8

ServersDo you need to depool?
frdb2001
pay-lvs2002
frlog2001
frdb2002
frban2001
frauth2001
frmon2001
pfw3-codfw.wikimedia.org:7
fasw-c-codfw.mgmt.codfw.wmnet:1
pfw3-codfw.wikimedia.org:0
fasw-c-codfw.mgmt.codfw.wmnet:0
frqueue2001
frbast2001
pay-lvs2001
civi2001
payments2003
frpm2001
frpig2001
payments2002
payments2001
frbackup2002
frdb2003
fran2001
frqueue2002
frdata2001
frmx2001

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 820067 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Disable notifications pdu C rows

https://gerrit.wikimedia.org/r/820067

Change 820067 merged by Marostegui:

[operations/puppet@production] mariadb: Disable notifications pdu C rows

https://gerrit.wikimedia.org/r/820067

Mentioned in SAL (#wikimedia-operations) [2022-08-03T11:49:30Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cumin2002.codfw.wmnet with reason: PDU maintenance, T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-03T11:49:55Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cumin2002.codfw.wmnet with reason: PDU maintenance, T310145

JMeybohm updated the task description. (Show Details)

Icinga downtime and Alertmanager silence (ID=0444b6dc-d394-43ed-8847-01dae0f308ee) set by mvernon@cumin1001 for 1 day, 0:00:00 on 8 host(s) and their services with reason: PDU work

moss-fe2001.codfw.wmnet,ms-be[2034-2035,2042,2048,2055,2068].codfw.wmnet,ms-fe2011.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-08-03T16:48:20Z] <Emperor> shutdown moss-fe2001.codfw.wmnet,ms-fe2011.codfw.wmnet,ms-be20[34,35,42,48,55,68].codfw.wmnet PDU work T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-03T17:08:25Z] <ryankemper> T310145 elastic2031 and wcqs2002 powered off in preparation for C1 maintenance

Change 820370 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Disable notifications DBs in C5

https://gerrit.wikimedia.org/r/820370

Change 820370 merged by Marostegui:

[operations/puppet@production] mariadb: Disable notifications DBs in C5

https://gerrit.wikimedia.org/r/820370

Change 820371 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Disable notifications DBs in C6

https://gerrit.wikimedia.org/r/820371

Change 820371 merged by Marostegui:

[operations/puppet@production] mariadb: Disable notifications DBs in C6

https://gerrit.wikimedia.org/r/820371

@Papaul we have some doubts about whether C1 was done or not. Can you update the list of racks that were done yesterday? Thanks!

Mentioned in SAL (#wikimedia-operations) [2022-08-04T13:14:17Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2065.codfw.wmnet with reason: T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T13:14:31Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2065.codfw.wmnet with reason: T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T13:39:48Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2066.codfw.wmnet with reason: T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T13:40:02Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2066.codfw.wmnet with reason: T310145

Having moved C2 to today, it needs to wait until all the ms-* nodes in D2 are fully back up before starting.

Icinga downtime and Alertmanager silence (ID=ec46c9c7-d251-4875-87f9-040b391ea22a) set by mvernon@cumin1001 for 1 day, 0:00:00 on 2 host(s) and their services with reason: PDU work

ms-be[2058,2064].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-08-04T14:04:51Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2033.codfw.wmnet with reason: T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T14:05:05Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2033.codfw.wmnet with reason: T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T14:21:21Z] <Emperor> shutdown ms-be20[58,64].codfw.wmnet for PDU swap T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T14:22:57Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2032.codfw.wmnet with reason: T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T14:23:11Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2032.codfw.wmnet with reason: T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T14:24:47Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2001.codfw.wmnet with reason: T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T14:25:00Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2001.codfw.wmnet with reason: T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T14:31:52Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2011.codfw.wmnet with reason: T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T14:32:18Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2011.codfw.wmnet with reason: T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T15:11:21Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool hosts for PDU maint (T310145)', diff saved to https://phabricator.wikimedia.org/P32284 and previous config saved to /var/cache/conftool/dbconfig/20220804-151121-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-08-04T15:13:47Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 10:00:00 on db[2114,2126,2166].codfw.wmnet with reason: Maintenance (T310145)

Mentioned in SAL (#wikimedia-operations) [2022-08-04T15:13:51Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2114,2126,2166].codfw.wmnet with reason: Maintenance (T310145)

Mentioned in SAL (#wikimedia-operations) [2022-08-04T15:19:59Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool C6 for PDU maint (T310145)', diff saved to https://phabricator.wikimedia.org/P32285 and previous config saved to /var/cache/conftool/dbconfig/20220804-151958-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-08-04T15:20:48Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 10:00:00 on db[2116,2127,2167-2168].codfw.wmnet,es2022.codfw.wmnet with reason: Maintenance (T310145)

Mentioned in SAL (#wikimedia-operations) [2022-08-04T15:21:04Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2116,2127,2167-2168].codfw.wmnet,es2022.codfw.wmnet with reason: Maintenance (T310145)

Mentioned in SAL (#wikimedia-operations) [2022-08-04T15:50:42Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2048.codfw.wmnet with reason: T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T15:50:56Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2048.codfw.wmnet with reason: T310145

Icinga downtime and Alertmanager silence (ID=0f30d2ec-1037-4449-b903-79ae6c2ccede) set by mvernon@cumin1001 for 1 day, 0:00:00 on 4 host(s) and their services with reason: PDU work

ms-be[2036,2049,2054].codfw.wmnet,thanos-be2003.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-08-04T16:06:29Z] <Emperor> shutdown ms-be20[39,49,54].codfw.wmnet,thanos-be2003 for PDU swap T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T16:35:01Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2059.codfw.wmnet with reason: T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T16:35:15Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2059.codfw.wmnet with reason: T310145

Icinga downtime and Alertmanager silence (ID=b9c7901f-43da-4f03-8147-b41491323e54) set by mvernon@cumin1001 for 1 day, 0:00:00 on 8 host(s) and their services with reason: PDU work

moss-fe2001.codfw.wmnet,ms-be[2034-2035,2042,2048,2055,2068].codfw.wmnet,ms-fe2011.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-08-04T18:22:14Z] <Emperor> shutdown moss-fe2001.codfw.wmnet,ms-fe2011.codfw.wmnet,ms-be20[34,35,42,48,68].codfw.wmnet PDU work T310145

Mentioned in SAL (#wikimedia-operations) [2022-08-04T18:41:51Z] <cwhite> poweroff kafka-logging2003 - T310145

Change 820559 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new PDU model for ps1-c2,c4,c5,c6,c7,d2,d3

https://gerrit.wikimedia.org/r/820559

Change 820559 merged by Papaul:

[operations/puppet@production] Add new PDU model for ps1-c2,c4,c5,c6,c7,d2,d3

https://gerrit.wikimedia.org/r/820559

Mentioned in SAL (#wikimedia-operations) [2022-08-05T11:34:36Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repool after PDU maint on C5 (T310145)', diff saved to https://phabricator.wikimedia.org/P32289 and previous config saved to /var/cache/conftool/dbconfig/20220805-113436-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-08-05T11:35:56Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repool after PDU maint on C6 (T310145)', diff saved to https://phabricator.wikimedia.org/P32290 and previous config saved to /var/cache/conftool/dbconfig/20220805-113555-ladsgroup.json

Change 822436 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new PDU model for ps1-c8

https://gerrit.wikimedia.org/r/822436

Change 822436 merged by Papaul:

[operations/puppet@production] Add new PDU model for ps1-c8

https://gerrit.wikimedia.org/r/822436