= codfw row D switches upgrade =
For reasons detailed in {T327248} we're going to upgrade codfw row D.
**Scheduled on May 16th - 13:00-15:00 UTC** , please let us know if there is any issue with the scheduled time.
It means a !!30min hard downtime!! for the whole row if everything goes well (well, 15min in real-reality). Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.
The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:
* please write `NONE` if no action is needed,
* the cookbook/command to run if it can be done by a 3rd party
* who will be around to take care of the depool
* Link to the relevant doc
* etc
The two main types of actions needed are depool and monitoring downtime
NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark `None` in the table) so there are less moving parts closer to the maintenance window.
All servers will be downtimed with `sudo cookbook sre.hosts.downtime --hours 2 -r "codfw row D upgrade" -t XXX 'P{P:netbox::host%location ~ "D.*codfw"}'` but specific services might need specific downtimes.
== Core Platform ==
#core-platform-team
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|maps[2008,2010]| `depool`|`pool` | |
|restbase[2012,2017-2018,2023,2026-2027]|`depool` |`pool` | |
|sessionstore2003| None|None | |
|thumbor2006| None|None | |
== Infrastructure Foundations ==
#infrastructure-foundations
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|ganeti[2015-2018,2025-2026]|NONE |NONE | |
|idm2001|NONE |NONE | |
|idp2002|NONE |NONE | |
|install2004|NONE |NONE | |
|ldap-replica2006|depooled |repool | |
|netbox-dev2002|NONE |NONE | |
|ping2003|redirect ICMP traffic |revert | |
|puppetdb2003|NONE |NONE | |
|puppetmaster2002| | | |
|urldownloader2004|NONE |NONE | |
== Infrastructure Foundations and Data Engineering ==
#infrastructure-foundations #data-engineering
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|krb[2001-2002]|NONE |NONE | |
== WMCS ==
#cloud-services-team
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|cloudcontrol2004-dev| NONE | NONE | |
== ServiceOps ==
#serviceops
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|chartmuseum2001| | | |
|conf2006| | | |
|kafka-main[2004-2005]| | | |
|kubernetes[2013-2014,2016,2022,2024]| | | |
|kubestagemaster2001| | | |
|kubestagetcd2003| | | |
|mc[2051-2054]| | | |
|mc-gp2003| | | |
|mc-wf2002| | | |
|mw[2271-2279,2281-2290,2366-2376,2444-2451]| | | |
|parse[2016-2020]| | | |
|rdb2010| | | |
== Observability ==
#sre_observability
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|arclamp2001| | | nothing to do, not active|
|dispatch-be2001| | | not in production |
|kafka-logging2005| | | no action needed |
|logstash[2003,2029-2031]| drain shards 2003,2029 depool 2030,2031 & set downtime | allocate shards, re-pool | shards draining, depooled, downtime set |
== Observability and Data Persistence ==
#sre_observability #data-persistence
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|thanos-fe2003|`sudo depool` |`sudo pool` |@MatthewVernon to do|
== Search Platform ==
#discovery-search
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|apifeatureusage2001|none |none | |
|elastic[2050-2054,2060,2067-2068,2072,2084-2086]| inflatador/rkemper ban/depool day before | inflatador/rkemper unban/repool | |
|search-loader2001| none | none | |
|wcqs2003| none | none | |
|wdqs[2006,2012,2015,2021-2022]|none | none | |
== ServiceOps-Collab ==
#serviceops-collab
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|aphlict2001| NONE | NONE | |
|gerrit2002| NONE | NONE | |
|gitlab-runner2004| Will be paused in [admin menu](https://gitlab.wikimedia.org/admin/runners/823#/) | Will be unpaused in [admin menu](https://gitlab.wikimedia.org/admin/runners/823#/) | paused |
|miscweb2003| NONE | NONE | |
== Traffic ==
#traffic
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|cp[2039-2042]|NOOP, codfw depooled| |N/A|
|dns2002|stop puppet and disable pybal| | |
|durum2002|stop puppet and disable bird| | |
|lvs2010|NOOP, codfw depooled| |N/A|
== Machine Learning ==
#machine-learning-team
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|ml-etcd2003|none |none | |
|ml-serve[2004,2008]| none| none| |
|ml-serve-ctrl2002|none |none | |
|ml-staging2002| none| none| |
|ml-staging-ctrl2002| none| none| |
|ores[2007-2009]|`depool`| `pool`| |
== Data Engineering ==
#data-engineering
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|aqs[2009-2012]| None | None | |
|kafka-stretch2002| None | None | |
|schema2004| `depool` | `pool` | Depooled|
== Data Persistence ==
#data-persistence
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|backup[2001,2007]|to make sure media backups are paused|resume media backups | |
|cassandra-dev2003| | | |
|db[2100-2101,2117-2120,2128-2131,2139-2140,2151-2152,2170-2174,2181-2182,2187]|Nothing - codfw will be depooled | | |
|dbprov2003| to make sure db backups are not ongoing| retry db backups if failed| |
|dbproxy2004|Nothing, not in use |Reload haproxy | |
|es[2023,2033-2034]|Nothing - codfw will be depooled | | |
|moss-fe2002|Nothing, not in service | | |
|ms-backup2002| to make sure media backups are paused| resume media backups | |
|ms-be[2043,2050,2056,2059,2061,2065,2069,2073]|no action required | | |
|ms-fe2012|`sudo depool` |`sudo pool` |@MatthewVernon to do|
|pc2014|Nothing - codfw will be depooled | | |
|thanos-be2004|no action required | | |