Page MenuHomePhabricator

(Need By:TBD) rack/setup/install row B new PDUs
Closed, ResolvedPublic

Description

This task will track the racking, setup and configuration of all PDUs in row B. We shouldn't have any impact on servers going down. The only thing to keep in mind is that during the PDU replacing process the management switch in that particular rack will not be available.

If you are a service owner and think that you need to depool your server(s) during the maintenance window, please put a "YES" in the "List of Servers and network devices" table below.
Thanks

Schedule

RackDateTimeComments
B1August 2nd10:00am CT/3:00pm UTC
B2August 2nd10:30am CT/3:30pm UTC
B3August 3rd09:30 am CT/2:30pm UTC
B4August 2nd11:00am CT/4:00pm UTC
B5August 2nd11:30am CT/4:30pm UTC
B6August 3rd10:00 am CT/3:00pm UTC
B7August 3rd10:30 am CT/3:30pm UTC
B8August 3rd11:00 am CT/4:00pm UTC

Per PDU setup Checklist

ps1-b1-codfw/ps2-b1-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack B1

ServersIs server depool and power down?
asw-b1-codfw
cloudcephmon2004-devyes
cloudcephosd2001-devyes
cloudcephosd2002-devyes
cloudcephosd2003-devyes
cloudcontrol2001-devyes
cloudgw2001-devyes
cloudgw2002-devyes
cloudvirt2001-devyes
cloudvirt2002-devyes
cloudvirt2003-devyes
cloudcephmon2005-devyes
cloudcephmon2006-devyes
cloudcontrol2005-devyes
clouddb2002-devyes
cloudgw2003-devyes
cloudnet2005-devyes
cloudnet2006-devyes
cloudservices2004-devyes
cloudservices2005-devyes
cloudweb2002-devyes

ps1-b2-codfw/ps2-b-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack B2

ServersIs server depool and power down?
asw-b2-codfw
cp2031
cp2032
elastic2041
elastic2042
elastic2057
lvs2008
ms-be2031
ms-be2032
ms-be2041
ms-be2046
ms-fe2010

ps1-b3-codfw/ps2-b3-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack B3

ServersIs server depool and power down?
asw-b3-codfw
conf2004NO should be powered off only right before the maintenance, and on right after
db2108Yes
db2123Yes, s5 master, needs downtiming
es2021Yes, es4 master, needs downtiming
mw2254Decommed, don't power on, T313730
mw2255Decommed, don't power on, T313730
mw2257Decommed, don't power on, T313730
mw2258Decommed, don't power on, T313730
mw2259Yes
mw2260Yes
mw2261Yes
mw2262Yes
mw2263Yes
mw2264Yes
mw2265Yes
mw2266Yes
mw2267Yes
mw2268Yes
mw2269Yes
mw2270Yes
mw2310Yes
mw2311Yes
mw2312Yes
mw2313Yes
mw2314Yes
mw2315Yes
mw2316Yes
mw2317Yes
mw2318Yes
mw2319Yes
mw2320Yes
mw2321Yes
mw2322Yes
mw2323Yes
mw2324Yes
ores2003
restbase2021Yes
thumbor2003NO should be powered off only right before the maintenance, and on right after
thumbor2004NO should be powered off only right before the maintenance, and on right after

ps1-b4-codfw/ps2-b4-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack B4

ServersIs server depool and power down?
asw-b4-codfw
backup2005
cp2033
cp2034
dbprov2002
elastic2058
kafka-main2002
mc-gp2002
ms-be2053
ms-be2057
ms-be2063
sessionstore2001

ps1-b5-codfw/ps2-b5-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack B5

ServersIs server depool and power down?
asw-b5-codfw
bast2002
centrallog2002
db2086No, will be decommissioned before the date
db2107Yes
db2109Yes
db2137Yes
db2143Yes
db2147Yes
db2159Yes, sanitarium master s7, needs downtiming
db2160Yes
db2177No, host not provisioned yet
db2178No, host not provisioned yet
elastic2028
ganeti2021
ganeti2022
mc2024
ml-serve2002
ores2004
parse2006
parse2007
pc2012Yes, pc2 master need dowtiming.
prometheus2005
puppetmaster2003
restbase2013
restbase2019
wdqs2005

ps1-b6-codfw/ps2-b6-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack B6

ServersIs server depool and power down?
asw-b6-codfw
db2072Yes, sanitarium master for s1
db2096Yes
db2098Yes
db2110Yes, s4 master, needs downtiming
db2111Yes
db2124Yes
db2134Yes, m3 master, needs dowtiming
db2161Yes
db2162Yes
dbproxy2002Needs downtiming
kubernetes2009Yes
kubernetes2010Yes
mc2023NO will need to be powered off right before the maintenance, and back on right after
ml-serve2006
mw2325Yes
mw2326Yes
mw2327Yes
mw2328Yes
mw2329Yes
mw2330Yes
mw2331Yes
mw2332Yes
mw2333Yes
mw2334Yes
rdb2008Yes

ps1-b7-codfw/ps2-b7-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack B7

ServersIs server depool and power down?
elastic2080Not in production; work on it anytime
elastic2079Not in production; work on it anytime
mc2046Yes
elastic2043Yes
elastic2044Yes
furud-array7
furud-array6
furud-array5
furud-array4
furud-array3
furud-array2
furud-array1
ms-be2033
furud
thanos-be2002
ms-be2047

ps1-b8-codfw/ps2-b8-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack B8

ServersIs server depool and power down?
db2164
gitlab-runner2002
es2030
es2029
mc2026NO will need to be powered off right before the maintenance, and back on right after
mc2025NO will need to be powered off right before the maintenance, and back on right after
elastic2030Yes
kubestage2002Yes
ganeti2020
ganeti2019
db2163
db2148
wdqs2007
es2025
elastic2029Yes
restbase2010Yes
parse2010Yes
parse2009Yes
parse2008Yes
restbase2014Yes
clouddb2001-dev

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Papaul triaged this task as Medium priority.Jun 13 2022, 1:29 PM

Hi,
In B2, ms-fe2010 and thanos-fe2002 will need depooling.

We need to make sure the ms nodes in Rack A7 (ms-be2030 ms-be2045 ms-be2052) are all fully OK before starting on rack B2 or B4 please (B1/5 have no ms nodes in, so no issue there).

Marostegui moved this task from Triage to Ready on the DBA board.

All mysql hosts need mysql to be stopped before the maintenance.

Change 819497 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Disable notifications rack b5

https://gerrit.wikimedia.org/r/819497

Change 819497 merged by Marostegui:

[operations/puppet@production] mariadb: Disable notifications rack b5

https://gerrit.wikimedia.org/r/819497

Mentioned in SAL (#wikimedia-operations) [2022-08-02T08:46:14Z] <marostegui> stop mysql on db2095 db2107 db2109 db2137 db2147 db2159 db2160 pc2012 for pdu maintenance on codfw b5 T310070

Change 819502 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2143: Disable notifications

https://gerrit.wikimedia.org/r/819502

Change 819502 merged by Marostegui:

[operations/puppet@production] db2143: Disable notifications

https://gerrit.wikimedia.org/r/819502

@Papaul all hosts in B5 have mysql off, you can power the hosts off as if you need.

Mentioned in SAL (#wikimedia-operations) [2022-08-02T10:05:40Z] <jynus> shutdown dbprov2002 backup2005 backup2008 T310070

Change 819591 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] Depool codfw for PDU upgrade

https://gerrit.wikimedia.org/r/819591

Change 819591 merged by Ssingh:

[operations/dns@master] Depool codfw for PDU upgrade

https://gerrit.wikimedia.org/r/819591

Mentioned in SAL (#wikimedia-operations) [2022-08-02T13:53:34Z] <godog> depool and poweroff prometheus2005 - T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-02T13:56:14Z] <godog> schedule poweroff for centrallog2002 at 16 utc - T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-02T15:06:17Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 4:00:00 on mc-gp2002.codfw.wmnet with reason: Power down for PDU maintenance, T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-02T15:06:32Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc-gp2002.codfw.wmnet with reason: Power down for PDU maintenance, T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-02T15:10:29Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 4:00:00 on ganeti-test[2001-2003].codfw.wmnet with reason: Power down for PDU maintenance, T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-02T15:10:48Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ganeti-test[2001-2003].codfw.wmnet with reason: Power down for PDU maintenance, T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-02T17:31:58Z] <ryankemper@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on elastic[2041-2042,2057].codfw.wmnet with reason: T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-02T17:32:13Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic[2041-2042,2057].codfw.wmnet with reason: T310070

Change 819763 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new PDU model

https://gerrit.wikimedia.org/r/819763

Change 819763 merged by Papaul:

[operations/puppet@production] Add new PDU model

https://gerrit.wikimedia.org/r/819763

Mentioned in SAL (#wikimedia-operations) [2022-08-03T06:45:59Z] <godog> power up centrallog2002 and prometheus2005 - T310070

Change 820066 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Disable notifications on codfw racks

https://gerrit.wikimedia.org/r/820066

Change 820066 merged by Marostegui:

[operations/puppet@production] mariadb: Disable notifications on codfw racks

https://gerrit.wikimedia.org/r/820066

Databases in the remaining B* racks are ready

Mentioned in SAL (#wikimedia-operations) [2022-08-03T09:04:35Z] <jynus> stop backup2006 backup2009 for T310070

Change 820086 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] lvs: Use conf2005 in codfw

https://gerrit.wikimedia.org/r/820086

Change 820086 merged by Vgutierrez:

[operations/puppet@production] lvs: Use conf2005 in codfw

https://gerrit.wikimedia.org/r/820086

Mentioned in SAL (#wikimedia-operations) [2022-08-03T09:43:47Z] <vgutierrez> rolling restart of pybal in codfw lvs instances - T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-03T12:59:10Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2043.codfw.wmnet with reason: T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-03T12:59:24Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2043.codfw.wmnet with reason: T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-03T13:05:04Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2044.codfw.wmnet with reason: T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-03T13:05:18Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2044.codfw.wmnet with reason: T310070

Change 820123 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new pdu model for pdu in rack b3 b6-b8 and c1

https://gerrit.wikimedia.org/r/820123

Icinga downtime and Alertmanager silence (ID=664eda2d-5203-44ca-92c1-3213c3996b5f) set by mvernon@cumin1001 for 1 day, 0:00:00 on 4 host(s) and their services with reason: PDU work

aqs[2005-2008].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-08-03T14:32:08Z] <Emperor> shutdown aqs200[5-8] prior to PDU work T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-03T14:33:51Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2029.codfw.wmnet with reason: T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-03T14:34:05Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2029.codfw.wmnet with reason: T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-03T15:19:01Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2030.codfw.wmnet with reason: T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-03T15:19:15Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2030.codfw.wmnet with reason: T310070

Icinga downtime and Alertmanager silence (ID=353f1e46-07cd-47d6-9a06-44c3a93b5b51) set by mvernon@cumin1001 for 1 day, 0:00:00 on 3 host(s) and their services with reason: PDU work

ms-be[2033,2047].codfw.wmnet,thanos-be2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-08-03T15:59:09Z] <Emperor> shutdown ms-be20[33,47],thanos-be2002 prior to PDU work T310070

Change 820123 merged by Papaul:

[operations/puppet@production] Add new pdu model for pdu in rack b3 b6-b8 and c1

https://gerrit.wikimedia.org/r/820123

Mentioned in SAL (#wikimedia-operations) [2022-08-09T09:12:23Z] <vgutierrez> rolling restart of pybal in codfw - T310070

Mentioned in SAL (#wikimedia-operations) [2022-08-09T09:53:05Z] <vgutierrez> rolling restart of pybal in eqsin - T310070