Page MenuHomePhabricator

(Need By:TBD) rack/setup/install row A new PDUs
Open, MediumPublic

Description

This task will track the racking, setup and configuration of all PDUs in row A. For all racks that are not network racks (A1 and A8) we shouldn't have any impact on servers going down. The only thing to keep in mind is that during the PDU replacing process the management switch in that particular rack will not be available.

If you are a service owner and think that you need to depool your server(s) during the maintenance window, please put a "YES" in the "List of Servers and network devices" table below.
Thanks

Schedule

RackDateTimeComments
A1Waiting for PDU's
A2June 21st9:30am CT/2:30 pm UTC~2hours to complete
A3June 23rd9:30am CT/2:30 pmUTC~2 hours 15 minutes to complete
A4June 30th9:30am CT/2:30pm UTC~ 1 hour 15 minutes to complete
A5July 12th9:30am CT/2:30pm UTCCY1 disconnected the whole rack by mistake
A6July 14th9:30am CT/2:30pm UTC~ 1 hour 45 minutes
A7August 2nd9:30am CT/2:30pm UTC
A8Waiting for PDUs

Per PDU setup Checklist

ps1-a1-codfw/ps2-a1-codfw

send out a notification to notify everybody that the management network will not be available for the whole site. Send out another notification also to net-ops to let them know of the ongoing maintenance.

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack A1

ServersDo you need to depool?
db2075No, host decommissioned.
db2136Yes
es2026Yes
gitlab2002
kubestage2001
mc2019
ml-serve2005
cr1-codfw
mr1-codfw
msw1-codfw
scs-a1-codfw
asw-a1-codfw
msw-a1-codfw
atlas-codfw

ps1-a2-codfw/ps2-a2-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack A2

ServersDo you need to depool?
authdns2001
elastic2037
elastic2038
elastic2055
lvs2007
ms-be2028
ms-be2029
ms-be2040
ms-be2044
ms-be2051
thanos-fe2001
asw-a2-codfw
msw-a2-codfw

ps1-a3-codfw/ps2-a3-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack A3

ServersDo you need to depool?
db2089No, will be decommissioned before the date.
db2103Yes, master, needs downtime the whole chain
db2142Yes, x2 master
es2020Yes
mw2291
mw2292
mw2293
mw2294
mw2295
mw2296
mw2297
mw2298
mw2299
mw2300
mw2377
mw2378
mw2379
mw2380
mw2381
mw2382
mw2383
mw2384
mw2385
mw2386
mw2387
mw2388
mw2389
mw2390
mw2391
mw2392
mw2393
mw2394
mw2395
mw2396
mw2397
mw2398
mw2399
mw2400
mw2401
asw-a3-codfw
msw-a3-codfw

ps1-a4-codfw/ps2-a4-codfw

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack A4

ServersDo you need to depool?
asw-a4-codfw
backup2002
backup2002-array1
backup2004
cp2027
cp2028
dbprov2001
ganeti2027
kafka-main2001
mc-gp2001
ms-be2060
ms-be2062
msw-a4-codfw
mw2251
mw2252
mw2253
ores2001

ps1-a5-codfw/ps2-a5-codfw===

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack A5

ServersDo you need to depool?
asw-a5-codfw
contint2001
db2079Yes (master but will be switchedover this weekT313798
db2085No, host decommissioned
db2104Yes, s2 master, needs downtime
db2121Yes, s7 master, needs downtime
db2132Yes, m1 master, needs downtime
db2145Yes
elastic2025
ganeti2023
ganeti2024
graphite2003
kubernetes2018
logstash2001
maps2005
mc2020
ml-serve2001
msw-a5-codfw
mw2402
mw2403
mw2404
mw2405
mw2406
mw2407
mw2408
mw2409
mw2410
mw2411
parse2001
parse2002
parse2003
pc2011
puppetmaster2001
wdqs2003

ps1-a6-codfw/ps2-a6-codfw===

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack A6

ServersDo you need to depool?

ps1-a7-codfw/ps2-a7-codfw===

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack A7

ServersDo you need to depool?
asw-a7-codfw
cloudbackup2001Done
cp2029Done
cp2030Done
elastic2039
elastic2040
elastic2056
ganeti2028Powered down
ms-be2030make sure server is up before moving to rack B2/B4
ms-be2045make sure server is up before moving to rack B2/B4
ms-be2052make sure server is up before moving to rack B2/B4
thanos-be2001

ps1-a8-codfw/ps2-a8-codfw===

Send out a notification to net-ops to let them know of the ongoing maintenance.

  • - receive in new PDUs on T303460
  • - apply asset tags to each tower (both primary and link towers) as well has hostname labels.
  • - add new PDUs into netbox
  • - Downtime the old PDU in Icinga
  • - Run the "Move devices attributes" to move all settings from old PDU to new PDU
  • - Login to the master PDU and do the configuration
  • - Make sure Icinga is seeing the new PDU

List of Servers and network devices in rack A8

ServersDo you need to depool?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Papaul triaged this task as Medium priority.Jun 6 2022, 4:28 AM

Testing out the "Move devices attributes" script before using it on the new PDUs move all configuration from ps1-a2-codfw to ps1-a2-codfw-new give the output below

[success] [dst] Setting primary_ip4 to 10.193.0.26/16
[info] [src] Removing primary_ip4
[success] [dst] Setting rack to A2
[success] Moved interfaces net
[success] Moved consoleports console0
[success] Updated cable 10036 termination B
[success] All done! WMF5964 replaced WMF5967
``

A1: serviceops: gitlab2002 is still in state "in setup". While we were going to change that we will hold back until this is done.

@Dzahn we are not doing rack A1 until maybe the end os the year because we don't have the PDU's yet for that rack same for A8

ah ACK, ok, in that case we will just move forward as planned. Thanks Papaul

Change 807171 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new pdu model for ps1-a2-codfw

https://gerrit.wikimedia.org/r/807171

Change 807171 merged by Papaul:

[operations/puppet@production] Add new pdu model for ps1-a2-codfw

https://gerrit.wikimedia.org/r/807171

Change 808023 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new model to ps1-a3-codfw

https://gerrit.wikimedia.org/r/808023

Change 808023 merged by Papaul:

[operations/puppet@production] Add new model to ps1-a3-codfw

https://gerrit.wikimedia.org/r/808023

Change 809977 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] ADD new PDU model to ps1-a4-codfw

https://gerrit.wikimedia.org/r/809977

Change 809977 merged by Papaul:

[operations/puppet@production] ADD new PDU model to ps1-a4-codfw

https://gerrit.wikimedia.org/r/809977

Change 813264 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new pdu model for ps1-a5-codfw

https://gerrit.wikimedia.org/r/813264

Change 813264 merged by Papaul:

[operations/puppet@production] Add new pdu model for ps1-a5-codfw

https://gerrit.wikimedia.org/r/813264

Would it better if Service Owners depooled and/or down-timed services before the remainder of these?

A2, A4 & A5 have all had power losses during the maintenance (3/4 done so far).

Change 813902 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new PDU model for ps1-a6-codfw

https://gerrit.wikimedia.org/r/813902

Change 813902 merged by Papaul:

[operations/puppet@production] Add new PDU model for ps1-a6-codfw

https://gerrit.wikimedia.org/r/813902

Papaul updated the task description. (Show Details)

I will need to check the state of the swift backends in A7 before it'll be safe to start on B2/4 (but B1/5 have no swift backends in).

All mysql hosts need mysql to be stopped before the maintenance.

Removing the DBA tag as this will only affect A7 and we don't have any DBs there.

Mentioned in SAL (#wikimedia-operations) [2022-08-02T13:30:01Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 4:00:00 on ganeti2028.codfw.wmnet with reason: Power down for PDU maintenance, T309957

Mentioned in SAL (#wikimedia-operations) [2022-08-02T13:30:16Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ganeti2028.codfw.wmnet with reason: Power down for PDU maintenance, T309957

Change 819591 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] Depool codfw for PDU upgrade

https://gerrit.wikimedia.org/r/819591

Change 819591 merged by Ssingh:

[operations/dns@master] Depool codfw for PDU upgrade

https://gerrit.wikimedia.org/r/819591

Icinga downtime and Alertmanager silence (ID=cd0b03ef-75d5-4a98-8161-1d31bb05694f) set by mvernon@cumin2002 for 1:00:00 on 3 host(s) and their services with reason: shutdown for PDU replacement

ms-be[2030,2045,2052].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-08-02T14:23:29Z] <Emperor> shutdown ms-be20[30,45,52] for PDU work T309957

Mentioned in SAL (#wikimedia-operations) [2022-08-02T14:59:29Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2025.codfw.wmnet with reason: T309957

Mentioned in SAL (#wikimedia-operations) [2022-08-02T14:59:43Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2025.codfw.wmnet with reason: T309957

Mentioned in SAL (#wikimedia-operations) [2022-08-02T15:04:37Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2037.codfw.wmnet with reason: T309957

Mentioned in SAL (#wikimedia-operations) [2022-08-02T15:04:51Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2037.codfw.wmnet with reason: T309957

Icinga downtime and Alertmanager silence (ID=1afb1eaa-338e-4346-baff-e22c312e16f5) set by mvernon@cumin2002 for 3:00:00 on 3 host(s) and their services with reason: shutdown for PDU replacement

ms-be[2030,2045,2052].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-08-02T15:45:45Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2039.codfw.wmnet with reason: T309957

Mentioned in SAL (#wikimedia-operations) [2022-08-02T15:45:59Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2039.codfw.wmnet with reason: T309957

Mentioned in SAL (#wikimedia-operations) [2022-08-02T15:49:45Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2040.codfw.wmnet with reason: T309957

Mentioned in SAL (#wikimedia-operations) [2022-08-02T15:49:58Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2040.codfw.wmnet with reason: T309957

Mentioned in SAL (#wikimedia-operations) [2022-08-02T15:50:55Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2056.codfw.wmnet with reason: T309957

Mentioned in SAL (#wikimedia-operations) [2022-08-02T15:51:08Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2056.codfw.wmnet with reason: T309957

Icinga downtime and Alertmanager silence (ID=d85c427a-fe27-4337-ba4f-b92100f4ccf6) set by mvernon@cumin2002 for 1 day, 0:00:00 on 6 host(s) and their services with reason: shutdown for PDU replacement

ms-be[2031-2032,2041,2046].codfw.wmnet,ms-fe2010.codfw.wmnet,thanos-fe2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-08-02T17:05:29Z] <Emperor> ms-be20[31,32,41,46].codfw.wmnet,ms-fe2010.codfw.wmnet,thanos-fe2002.codfw.wmnet downtime for PDU work T309957