codfw row A switches upgrade
For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade codfw row A switches.
This is scheduled for Feb 7th - 14:00-16:00 UTC, please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well. Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.
The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:
- please write NONE if no action is needed,
- the cookbook/command to run if it can be done by a 3rd party
- who will be around to take care of the depool
- Link to the relevant doc
- etc
The two main types of actions needed are depool and monitoring downtime
Data Engineering
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
aqs[2001-2004] | None | None | |
Observability
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
grafana2001 | set downtime | none | depooled |
kafka-logging2001 | set downtime, stop kafka service | start kafka service, confirm kafka logging dashboard returns green | depooled |
kafkamon2002 | set downtime | none | depooled |
logstash[2001,2023,2026,2033] | conftool 2023, drain shards 2001,2026,2033 | conftool 2023, allocate shards 2001,2026,2033 | all re-pooled |
xhgui2001 | none | none | n/a |
Observability and Data Persistence
SRE Observability Data-Persistence
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
thanos-fe2001 | conftool depool, while making sure another thanos-fe host is pooled for service thanos-web | conftool pool. make sure only one thanos-fe host is pooled for thanos-web service | |
Search Platform
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
elastic[2037-2040,2055-2056,2061-2062,2069,2073-2076] | None | None | Search team will depool & ban hosts from cluster one day prior to upgrade |
wdqs[2003-2004,2009] | None | None | Search team will depool one day prior to upgrade |
Core Platform
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
maps2005 | |||
thumbor2005 | |||
WMCS
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
cloudbackup2001 | NONE | NONE | |
ServiceOps-Collab
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
contint2001 | NONE | NONE | |
doc2001 | NONE | NONE | |
gitlab2002 | NONE | NONE | |
planet2002 | NONE | NONE | |
Data Persistence
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
backup[2002,2004] | They are not a service, but storage. Jaime will make sure earlier in the week they are not active at the time of the maintenance. | Jaime will restart some delayed backups, if any. | no blockers |
db[2094,2097,2103-2106,2121-2122,2132-2133,2136,2142,2145-2146,2153-2158,2175-2176,2183] | All MW need to be depooled, and some masters need to be switched over (misc masters do not need switchover/depooling | @Marostegui will repool everything | @Marostegui No longer masters: db2103, db2104, db2105, db2121, db2142 - the rest of masters are misc so they can be ignored - what needs to be depooled, is already depooled |
dbprov2001 | They are not a service, but storage. Jaime will make sure earlier in the week they are not active at the time of the maintenance. | None | no blockers |
dbproxy2001 | None | Reload haproxy | |
es[2020,2024,2026-2028] | All need to be depooled (@Marostegui will do it) | @Marostegui will repool everything | Depooled |
moss-be2001 | N/A | N/A | Not currently in production service |
ms-be[2040,2044-2045,2051-2052,2060,2062,2066] | None | None | |
ms-fe2009 | sudo depool | sudo pool | |
pc2011 | To be depooled | To be repooled once it is all done | Already depooled by @Marostegui |
thanos-be2001 | None | None | |
Infrastructure Foundations
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
ganeti[2023-2024,2027-2030] | None | ||
ganeti-test[2001-2003] | None | ||
netbox2002 | None | None | |
netboxdb2002 | None | None | |
pki2001 | None | None | N/A |
puppetdb2002 | sudo cumin 'A:codfw or A:esams or A:ulsfo' 'disable-puppet "Switch reboot: T327925"' | sudo cumin 'A:codfw or A:esams or A:ulsfo' 'enable-puppet "Switch reboot: T327925"' | jbond will handle |
puppetmaster[2001,2004] | sudo cumin 'A:codfw or A:esams or A:ulsfo' 'disable-puppet "Switch reboot: T327925"' | sudo cumin 'A:codfw or A:esams or A:ulsfo' 'enable-puppet "Switch reboot: T327925"' | jbond will handle |
rpki2002 | None | ||
test-reimage2001 | nNone | ||
testvm[2001-2005] | None | ||
urldownloader2001 | None | ||
Infrastructure Foundations and Observability
Infrastructure-Foundations SRE Observability
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
netmon2002 | None | None | |
Machine Learning
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
ml-cache2001 | - | - | |
ml-serve[2001,2005] | - | - | |
ml-staging2001 | - | - | |
ml-staging-etcd2001 | - | - | |
ores[2001-2002] | sudo -i depool | sudo -i pool | |
orespoolcounter2003 | - | - | - |
Traffic
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
acmechief2001 | N/A | ||
acmechief-test2001 | N/A | ||
authdns2001 | redirect to authdns1001 | the opposite | done |
cp[2027-2030] | N/A | N/A | |
doh2001 | disable puppet & stop bird.service | the oppposite | done |
lvs2007 | N/A | N/A | |
ncredir2001 | N/A | N/A | |
ServiceOps
serviceops
Due to the large number of services potentially affected (multiple mw appservers, kubernetes workers), a global depool of a/a services was done:
sre.discovery.datacenter-route depool --reason T327925 codfw
After the maintenance, repool:
sre.discovery.datacenter-route pool --reason T327925 codfw
Depool restbase-async from eqiad:
cookbook sre.discovery.service-route --reason T327925 depool --wipe-cache eqiad restbase-async
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
kafka-main2001 | done | ||
kubemaster2001 | done | ||
kubernetes[2005,2007-2008,2018-2019] | done | ||
kubestage2001 | done | ||
kubetcd2004 | done | ||
mc[2038-2041,2055] | done | ||
mc-gp2001 | done | ||
mw[2291-2309,2377-2411] | done | ||
mwdebug2001 | done | ||
parse[2001-2005] | done | ||
poolcounter2003 | done | ||
rdb2007 | done | ||
registry2003 | done | ||