codfw row D switches upgrade
For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade codfw row D.
Scheduled on May 16th - 13:00-15:00 UTC , please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well (well, 15min in real-reality). Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.
The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:
- please write NONE if no action is needed,
- the cookbook/command to run if it can be done by a 3rd party
- who will be around to take care of the depool
- Link to the relevant doc
- etc
The two main types of actions needed are depool and monitoring downtime
All servers will be downtimed with sudo cookbook sre.hosts.downtime --hours 2 -r "codfw row D upgrade" -t XXX 'P{P:netbox::host%location ~ "D.*codfw"}' but specific services might need specific downtimes.
Core Platform
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
maps[2008,2010] | depool | pool | |
restbase[2012,2017-2018,2023,2026-2027] | depool | pool | |
sessionstore2003 | None | None | |
thumbor2006 | None | None | |
Infrastructure Foundations
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
ganeti[2015-2018,2025-2026] | NONE | NONE | OK |
idm2001 | NONE | NONE | OK |
idp2002 | NONE | NONE | OK |
install2004 | NONE | NONE | OK |
ldap-replica2006 | depooled | repooled | OK |
netbox-dev2002 | NONE | NONE | OK |
ping2003 | redirect ICMP traffic | revert | |
puppetdb2003 | NONE | NONE | OK |
puppetmaster2002 | Puppet disabled in codfw/esams/ulsfo | Puppet re-enabled | OK |
urldownloader2004 | NONE | NONE | OK |
Infrastructure Foundations and Data Engineering
Infrastructure-Foundations Data-Engineering
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
krb[2001-2002] | NONE | NONE | |
WMCS
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
cloudcontrol2004-dev | NONE | NONE | |
ServiceOps
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
chartmuseum2001 | |||
conf2006 | |||
kafka-main[2004-2005] | |||
kubernetes[2013-2014,2016,2022,2024] | |||
kubestagemaster2001 | |||
kubestagetcd2003 | |||
mc[2051-2054] | |||
mc-gp2003 | |||
mc-wf2002 | |||
mw[2271-2279,2281-2290,2366-2376,2444-2451] | |||
parse[2016-2020] | |||
rdb2010 | |||
Observability
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
arclamp2001 | nothing to do, not active | ||
dispatch-be2001 | not in production | ||
kafka-logging2005 | no action needed | ||
logstash[2003,2029-2031] | drain shards 2003,2029 depool 2030,2031 & set downtime | allocate shards, re-pool | done |
Observability and Data Persistence
SRE Observability Data-Persistence
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
thanos-fe2003 | sudo depool | sudo pool | @MatthewVernon done |
Search Platform
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
apifeatureusage2001 | none | none | |
elastic[2050-2054,2060,2067-2068,2072,2084-2086] | inflatador/rkemper ban/depool day before | inflatador/rkemper unban/repool | |
search-loader2001 | none | none | |
wcqs2003 | none | none | |
wdqs[2006,2012,2015,2021-2022] | none | none | |
ServiceOps-Collab
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
aphlict2001 | NONE | NONE | |
gerrit2002 | NONE | NONE | |
gitlab-runner2004 | Will be paused in admin menu | Will be unpaused in admin menu | paused |
miscweb2003 | NONE | NONE | |
Traffic
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
cp[2039-2042] | NOOP, codfw depooled | DONE | |
dns2002 | stop puppet and disable bird | DONE | |
durum2002 | stop puppet and disable bird | DONE | |
lvs2010 | NOOP, codfw depooled | DONE | |
Machine Learning
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
ml-etcd2003 | none | none | |
ml-serve[2004,2008] | none | none | |
ml-serve-ctrl2002 | none | none | |
ml-staging2002 | none | none | |
ml-staging-ctrl2002 | none | none | |
ores[2007-2009] | depool | pool | |
Data Engineering
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
aqs[2009-2012] | None | None | |
kafka-stretch2002 | None | None | |
schema2004 | depool | pool | Depooled |
Data Persistence
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
backup[2001,2007] | to make sure media backups are paused | resume media backups | |
cassandra-dev2003 | |||
db[2100-2101,2117-2120,2128-2131,2139-2140,2151-2152,2170-2174,2181-2182,2187] | Nothing - codfw will be depooled | ||
dbprov2003 | to make sure db backups are not ongoing | retry db backups if failed | |
dbproxy2004 | Nothing, not in use | Reload haproxy | |
es[2023,2033-2034] | Nothing - codfw will be depooled | ||
moss-fe2002 | Nothing, not in service | ||
ms-backup2002 | to make sure media backups are paused | resume media backups | |
ms-be[2043,2050,2056,2059,2061,2065,2069,2073] | no action required | ||
ms-fe2012 | sudo depool | sudo pool | Done |
pc2014 | Nothing - codfw will be depooled | ||
thanos-be2004 | no action required | ||