Page MenuHomePhabricator

(Need By:TBD) rack/setup/install row D new PDUs
Closed, ResolvedPublic

Description

This task will track the racking, setup and configuration of all PDUs in row D. Servers will lose power, please shut down the servers from the OS with sudo poweroff from the OS, racadm serveraction powerdown from the DRAC, or power off hard from the iLO.

During the PDU replacement process, the management switch in that particular rack will not be available.

If you are a service owner and think that you need to depool your server(s) during the maintenance window, please put a "YES" in the "List of Servers and network devices" table below.
Thanks

Schedule

RackDateTimeComments
D1June 28th9:30am CT/2:30 pm UTC~1hour 30 mins
D2August 4th11:30 am CT/4:30pm UTC
D3August 4th12:00 pm CT/5:00pm UTC
D4August 10th09:30 am CT/2:30pm UTC
D5August 10th10:00 am CT/3:00pm UTC
D6August 10th10:30 am CT/3:30pm UTC
D7August 10th11:00 am CT/4:00pm UTC
D8August 10th11:30 am CT/4:30pm UTC

Per PDU setup Checklist

ps1-d1-codfw/ps2-d1-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack D1

ServersDo you need to depool?
cloudcontrol2004-dev
db2078No, host decommissioned
db2088No, host will be decommissioned before the date
db2100Yes, backup source
db2117Yes
db2128Yes, sanitarium s5 master, needs downtiming
db2139Yes, backup source
db2151Yes, mediabackups metadata/misc
elastic2034
es2033Yes
es2034Yes
ganeti2015
ganeti2025
mc2032
ml-staging2002
ores2007
pc2014No, spare.
restbase2012
restbase2017
wcqs2003
asw-d1-codfw
msw-d1-codfw

ps1-d2-codfw/ps2-d2-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack D2

ServersDo you need to depool?
lvs2010powered off
ms-fe2012
moss-fe2002
ms-backup2002
thanos-fe2003
elastic2052
elastic2051
backup2001-array2
elastic2050
backup2001-array1
backup2001
ms-be2043
ms-be2038
ms-be2037
ms-be2069
ms-be2065
ms-be2061

ps1-d3-codfw/ps2-d3-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack D3

ServersDo you need to depool?
mw2376
mw2375
mw2374
mw2373
mw2372
mw2371
mw2370
mw2369
mw2368
mw2367
mw2366
es2023
ganeti2016
db2119
db2118
kubernetes2022
thumbor2006
maps2008
restbase-dev2003
mw2279
mw2278
mw2277
mw2276
mw2275
mw2274
mw2273
mw2272
mw2271

ps1-d4-codfw/ps2-d4-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack D4

ServersDo you need to depool?
elastic2085not in production/ready for work
elastic2084not in production/ready for work
mc2052
mc2051
elastic2072powered off/ready for work
dbprov2003already shut down
cp2040powered off
cp2039powered off
dns2002powered off
mc-gp2003
logstash2029
kafka-main2004
sessionstore2003
mw2290
mw2289
mw2288
mw2287
mw2286
mw2285
mw2284
mw2283
mw2282
mw2281
wdqs2006powered off/ready for work
ores2008powered off/ready for work
mc2033

ps1-d5-codfw/ps2-d5-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack D5

ServersDo you need to depool?
logstash2003
restbase2018
db2093
netmon2001
ores2009powered off/ready for work
mc2035
mc2034
elastic2036powered off/ready for work
puppetmaster2002
wdqs2002powered off/ready for work
db2172
krb2002
gitlab-runner2004
gerrit2001
restbase2027
ml-serve2008powered off/ready for work
parse2017
restbase2026
rdb2010
parse2016
db2129
ganeti2017
db2120

ps1-d6-codfw/ps2-d6-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack D6

ServersDo you need to depool?
aqs2012
aqs2011
aqs2010
aqs2009
dbproxy2004
db2130
db2101
ganeti2026
ml-serve2004
maps2010
kubernetes2014
kubernetes2013
db2140

ps1-d7-codfw/ps2-d7-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack D7

ServersDo you need to depool?
elastic2086powered off/ready for work
mc2054
mc2053
wdqs2012powered off
elastic2068powered off/ready for work
elastic2067powered off/ready for work
cp2042powered off
cp2041powered off
elastic2060powered off/ready for work
ms-be2056
kafka-main2005
elastic2054powered off/ready for work
elastic2053powered off/ready for work
ms-be2050
ms-be2039
backup2007already shutdown
ms-be2059
thanos-be2004

ps1-d8-codfw/ps2-d8-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack D8

ServersDo you need to depool?
db2181powered off
db2182powered off
db2174
db2173
gerrit2002
parse2020
parse2019
parse2018
restbase2023
db2131
ganeti2018
theemin
conf2006
krb2001
db2152
mc2036

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 809177 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new PDU model to ps1-d1-codfw

https://gerrit.wikimedia.org/r/809177

Change 809177 merged by Papaul:

[operations/puppet@production] Add new PDU model to ps1-d1-codfw

https://gerrit.wikimedia.org/r/809177

Marostegui added a subscriber: jcrespo.
Marostegui subscribed.

All mysql hosts need mysql to be stopped before the maintenance.

For D7, please ping @jbond once done so he can confirm the ms-be* nodes have come back up OK.

In rack D2, ms-fe2012 needs depooling before the power goes, and if you could ping me once the rack is done so I can check all the ms-be* nodes come back up again, that'd be kind.

Change 820369 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Downtime D3 databases

https://gerrit.wikimedia.org/r/820369

Change 820369 merged by Marostegui:

[operations/puppet@production] mariadb: Downtime D3 databases

https://gerrit.wikimedia.org/r/820369

All db*, es* hosts powered off.

All the ms-* nodes in C4 & C7 must be back and properly in service before we can start on D2, I'm afraid. I'll be on IRC, but please don't start on D2 until I've OK'd the state of the C ms nodes.

[from my pov D3 needn't be thus blocked as it has no swift nodes in]

Mentioned in SAL (#wikimedia-operations) [2022-08-04T17:12:02Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2050.codfw.wmnet with reason: T310146

Mentioned in SAL (#wikimedia-operations) [2022-08-04T17:12:16Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2050.codfw.wmnet with reason: T310146

Icinga downtime and Alertmanager silence (ID=05fabe0d-d28a-48be-a4e5-7d427e293a41) set by mvernon@cumin1001 for 1 day, 0:00:00 on 9 host(s) and their services with reason: PDU work

moss-fe2002.codfw.wmnet,ms-be[2037-2038,2043,2061,2065,2069].codfw.wmnet,ms-fe2012.codfw.wmnet,thanos-fe2003.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-08-04T17:16:08Z] <Emperor> shutdown of moss-fe2002.codfw.wmnet,ms-be20[37,38,43,61,65,69].codfw.wmnet,ms-fe2012.codfw.wmnet,thanos-fe2003.codfw.wmnet for power work T310146

Mentioned in SAL (#wikimedia-operations) [2022-08-04T19:31:16Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2071.codfw.wmnet with reason: T310146

Mentioned in SAL (#wikimedia-operations) [2022-08-04T19:31:29Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2071.codfw.wmnet with reason: T310146

Change 820592 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new PDU model for ps1-d3-codfw

https://gerrit.wikimedia.org/r/820592

Change 820592 merged by Papaul:

[operations/puppet@production] Add new PDU model for ps1-d3-codfw

https://gerrit.wikimedia.org/r/820592

Mentioned in SAL (#wikimedia-operations) [2022-08-05T11:37:29Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repool after PDU maint on D3 (T310146)', diff saved to https://phabricator.wikimedia.org/P32291 and previous config saved to /var/cache/conftool/dbconfig/20220805-113729-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-08-09T21:57:00Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: T310146

Mentioned in SAL (#wikimedia-operations) [2022-08-09T21:57:24Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2006.codfw.wmnet with reason: T310146

bking updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2022-08-10T08:49:28Z] <jynus> shutdown dbprov2003 before pdu upgrade T310146

Mentioned in SAL (#wikimedia-operations) [2022-08-10T09:10:39Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool D5 dbs (T310146)', diff saved to https://phabricator.wikimedia.org/P32339 and previous config saved to /var/cache/conftool/dbconfig/20220810-091038-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-08-10T09:15:25Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint (T310146)

Mentioned in SAL (#wikimedia-operations) [2022-08-10T09:15:40Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2093,2120,2129,2172].codfw.wmnet with reason: D5 PDU maint (T310146)

Mentioned in SAL (#wikimedia-operations) [2022-08-10T09:28:42Z] <jynus> shutdown backup2007 before pdu upgrade T310146

Mentioned in SAL (#wikimedia-operations) [2022-08-10T09:34:33Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool D6 dbs (T310146)', diff saved to https://phabricator.wikimedia.org/P32340 and previous config saved to /var/cache/conftool/dbconfig/20220810-093433-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-08-10T09:36:00Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint (T310146)

Mentioned in SAL (#wikimedia-operations) [2022-08-10T09:36:15Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2101,2130,2140].codfw.wmnet,dbproxy2004.codfw.wmnet with reason: D6 PDU maint (T310146)

Mentioned in SAL (#wikimedia-operations) [2022-08-10T09:51:00Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool D8 DBs for PDU maint (T310146)', diff saved to https://phabricator.wikimedia.org/P32341 and previous config saved to /var/cache/conftool/dbconfig/20220810-095059-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-08-10T09:52:51Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint (T310146)

Mentioned in SAL (#wikimedia-operations) [2022-08-10T09:53:07Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: D8 PDU Maint (T310146)

I removed db2181 and db2182 from D8 list because they have been decommissioned recently (after creation of this task): T311623: decommission db2081 and T313003: decommission db2082

I removed db2181 and db2182 from D8 list because they have been decommissioned recently (after creation of this task): T311623: decommission db2081 and T313003: decommission db2082

My bad, these were db2081 and db2082, may them rest in peace.

Mentioned in SAL (#wikimedia-operations) [2022-08-10T10:31:14Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint (T310146)

Mentioned in SAL (#wikimedia-operations) [2022-08-10T10:31:18Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2181-2182].codfw.wmnet with reason: D6 PDU maint (T310146)

Mentioned in SAL (#wikimedia-operations) [2022-08-10T12:37:39Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: T310146

Mentioned in SAL (#wikimedia-operations) [2022-08-10T12:37:54Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[2072,2084-2085].codfw.wmnet with reason: T310146

Change 821742 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] Depool codfw for PDU upgrade (row D)

https://gerrit.wikimedia.org/r/821742

Mentioned in SAL (#wikimedia-operations) [2022-08-10T13:30:02Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: T310146

Mentioned in SAL (#wikimedia-operations) [2022-08-10T13:30:19Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: T310146

Change 821742 merged by Ssingh:

[operations/dns@master] Depool codfw for PDU upgrade (row D)

https://gerrit.wikimedia.org/r/821742

Mentioned in SAL (#wikimedia-operations) [2022-08-10T14:43:30Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint (T310146)

Mentioned in SAL (#wikimedia-operations) [2022-08-10T14:43:43Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: PDU Maint (T310146)

Icinga downtime and Alertmanager silence (ID=ba2eda0b-8bbe-4755-9a59-5480b01ae495) set by mvernon@cumin1001 for 1 day, 0:00:00 on 5 host(s) and their services with reason: PDU work

ms-be[2039,2050,2056,2059].codfw.wmnet,thanos-be2004.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-08-10T18:05:29Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repool D8 DBs after PDU maint (T310146)', diff saved to https://phabricator.wikimedia.org/P32346 and previous config saved to /var/cache/conftool/dbconfig/20220810-180529-ladsgroup.json

Change 822174 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new PDU model for ps1-d[4-8]-codfw

https://gerrit.wikimedia.org/r/822174

Change 822174 merged by Papaul:

[operations/puppet@production] Add new PDU model for ps1-d[4-8]-codfw

https://gerrit.wikimedia.org/r/822174

Papaul updated the task description. (Show Details)

Row D maintenance complete