codfw row C switches upgrade
For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade codfw row C switches during the scheduled DC switchover.
Scheduled on May 2nd - 13:00-15:00 UTC , please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well (well, 15min in real-reality). Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.
The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:
- please write NONE if no action is needed,
- the cookbook/command to run if it can be done by a 3rd party
- who will be around to take care of the depool
- Link to the relevant doc
- etc
The two main types of actions needed are depool and monitoring downtime
All servers will be downtimed with sudo cookbook sre.hosts.downtime --hours 2 -r "codfw row C upgrade" -t T334049 'P{P:netbox::host%location ~ "C.*codfw"}' but specific services might need specific downtimes.
WMCS
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
cloudbackup2002 | |||
cloudcumin2001 | |||
Machine Learning
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
ml-cache2003 | none | none | |
ml-etcd2002 | none | none | |
ml-serve[2003,2007] | none | none | |
ml-serve-ctrl2001 | none | none | |
ml-staging-etcd2003 | none | none | |
ores[2005-2006] | depool | pool | done |
Data Engineering
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
kafka-stretch2001 | None | None | |
schema2003 | depool | pool | Depooled |
Traffic
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
cp[2035-2038] | NOOP, codfw will be depooled | NOOP | |
dns2001 | move ns1 to dns2002 | move ns1 back to dns2001 | DONE |
durum2001 | disable puppet and stop pybal | DONE | |
lvs2009 | NOOP, codfw will be depooled | NOOP | |
Core Platform
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
maps2007 | None | None | |
restbase[2015-2016,2020,2022,2025] | depool | pool | |
sessionstore2002 | None | None | |
ServiceOps-Collab
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
doc2002 | NONE | NONE | |
gitlab-runner2003 | will be paused ahead of maintenance in admin interface | will be unpaused ahead of maintenance in admin interface | unpaused again by @Jelto |
phab2002 | NONE | NONE | |
vrts2001 | NONE | NONE | |
Search Platform
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
elastic[2045-2048,2059,2065-2066,2071,2081-2083] | NONE | NONE | |
wcqs2002 | NONE | NONE | |
wdqs[2008,2011,2017-2019] | NONE | NONE | |
Observability
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
alert2001 | |||
kafka-logging2003 | |||
logstash[2002,2028,2032,2035,2037] | drain shards 2002,2028,2035,2037 depool 2032 & set downtime | allocate shards, repool | shards allocating, re-pooled |
mwlog2002 | n/a | n/a | no action |
prometheus2006 | depool and remove from AM | pool and put back in AM | incomplete |
thanos-fe2004 | n/a | n/a | not in production yet |
webperf2003 | no action | ||
Infrastructure Foundations
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
build2001 | NONE | NONE | |
cumin2002 | NONE | NONE | |
debmonitor2002 | NONE | NONE | |
failoid2002 | NONE | NONE | |
ganeti[2009-2014] | NONE | NONE | |
idp-test2002 | NONE | NONE | |
ldap-replica2005 | depooled | repooled | OK |
netflow2002 | None | None | |
puppetboard2002 | NONE | NONE | |
puppetmaster2005 | NONE | NONE | |
urldownloader2003 | NONE | NONE | |
Data Persistence
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
backup[2003,2006,2009] | mediabackups should be paused | restart codfw media backups | |
cassandra-dev2002 | None | None | |
db[2099,2102,2112-2116,2125-2127,2135,2138,2141,2144,2149-2150,2165-2169,2179-2180,2184,2186] | Nothing - codfw will be depooled | ||
dbprov2004 | db backups should be paused | restart db backups, if any | |
dbproxy2003 | Nothing, not active | ||
es[2022,2031-2032] | Nothing - codfw will be depooled | ||
moss-fe2001 | None | None | |
ms-backup2001 | mediabackups should be paused | restart codfw media backups | |
ms-be[2042,2048-2049,2054-2055,2058,2064,2068,2072] | None | None | |
ms-fe2011 | None | None | |
pc2013 | Nothing - codfw will be depooled | ||
thanos-be2003 | None | None | |
ServiceOps
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
conf2005 | |||
deploy2002 | |||
dragonfly-supernode2001 | |||
kafka-main2003 | |||
kubernetes[2011-2012,2015,2017,2021] | |||
kubestagetcd2002 | |||
kubetcd2005 | |||
mc[2047-2050] | |||
mc-wf2001 | |||
mw[2335-2339,2350-2365,2412-2419,2436-2443] | |||
mwmaint2002 | |||
parse[2011-2015] | |||
rdb2009 | |||
No owner
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
irc2002 | NONE | NONE |