= eqiad row A switches upgrade =
For reasons detailed in {T327248} we're going to upgrade eqiad row A switches during the scheduled DC switchover.
This is scheduled for **March 7th - 14:00-16:00 UTC***, please let us know if there is any issue with the scheduled time.
It means a !!30min hard downtime!! for the whole row if everything goes well. Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.
The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:
* please write `NONE` if no action is needed,
* the cookbook/command to run if it can be done by a 3rd party
* who will be around to take care of the depool
* Link to the relevant doc
* etc
The two main types of actions needed are depool and monitoring downtime
NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark `None` in the table) so there are less moving parts closer to the maintenance window.
All servers will be downtimed with `sudo cookbook sre.hosts.downtime --hours 2 -r "eqiad row A upgrade" -t T329073 'P{P:netbox::host%location ~ "A.*eqiad"}'` but specific services might need specific downtimes.
== ServiceOps-Collab ==
#serviceops-collab
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|gitlab1003| None | None | |
|moscovium| None | None | |
|planet1002| None| None| |
== Observability ==
#sre_observability
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|dispatch-be1001| n/a | n/a | not in production |
|grafana1002| failover to codfw ahead of maint window | failback to eqiad | depooled (failed over to codfw) |
|kafkamon1002| set downtime | n/a | downtime scheduled |
|logstash[1010,1023-1024,1026,1033]| drain shards 1010,1026,1033 depool 1023,1024 set downtime | allocate shards, pool | shards draining, hosts depooled, downtime scheduled |
|prometheus1005| `depool` on the host | `pool` on the host | depooled |
== Observability and Data Persistence ==
#sre_observability #data-persistence
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|thanos-fe1001| `depool` and make sure another host in eqiad is pooled for `thanos-web` service | `pool` and put back thanos-fe1001 as the single host pooled for `thanos-web` |@matthewvernon to do |
== WMCS ==
#cloud-services-team
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|clouddb[1013-1014,1021]| Likely yes? | Likely yes? | Please coordinate with @Marostegui |
|cloudmetrics1003| No | No | Contact: @aborrero |
|cloudservices1004| No | No | Contact: @Andrew |
== Machine Learning ==
#machine-learning-team
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|ml-serve1001| none | none | |
|ores[1001-1002]| sudo -i depool | sudo -i pool | depooled |
|orespoolcounter1003| none | none | |
== Search Platform ==
**Search team has already done the depool/downtime/etc for the relevant hosts, so all of these hosts are ready for the operation.**
#discovery-search
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|cloudelastic[1001,1005]| | |no action needed |
|elastic[1053-1054,1068-1073,1084]| | |no action needed |
|relforge1003| | |no action needed |
|wcqs1001| | |no action needed |
|wdqs[1003-1004,1006,1011]| | | no action needed|
== ServiceOps ==
#serviceops
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|conf1007| | | |
|kafka-main1001| | | |
|kubemaster1001| | | |
|kubernetes[1005,1007-1008,1017-1018]| | | |
|kubestagetcd1004| | | |
|kubetcd1005| | | |
|mc[1037-1040]| | | |
|mc-gp1001| | | |
|mw[1385-1392,1414-1422,1448-1465]| | | |
|mwdebug1002| | | |
|parse[1001-1006]| | | |
|poolcounter1004| | | |
|rdb1011| | | |
|registry1003| | | |
== Traffic ==
#traffic
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|cp[1075-1078]| | |nothing to be done, eqiad depooled|
|dns1001|removed from authdns_servers| |depooled |
|lvs[1013,1017]| | |nothing to be done, eqiad depooled|
|ncredir1002| | |nothing to be done, eqiad depooled|
== Data Engineering ==
#data-engineering
We will likely stop a number of pipelines e.g. oozie, airflow
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|an-db1001| None | None | |
|an-druid[1001,1003]| None | None | |
|an-master1001| [[https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Manual_Failover|Fail over to an-master1002]] - who: DE SREs | fail back to an-master1001 | Failed over to standby |
|an-presto[1002,1005]| None | None | |
|an-test-client1001| None | None | |
|an-test-master1001| [[https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Manual_Failover|Fail over to an-test-master1002]] - who: DE SREs | fail back to an-test-master1001 | Failed over to standby |
|an-test-worker1001| None | None | |
|an-tool1011| None | None | |
|an-worker[1078-1082,1096,1102-1103,1118-1123,1129,1139-1141]| [[https://gerrit.wikimedia.org/r/c/operations/puppet/+/894537|Stop ingestion via gobblin]] & Put HDFS into safe mode - who: DE SREs | Re-enable gobblin jobs & Take HDFS out of safe mode - who: DE SREs | gobblin timers absented |
|analytics[1058-1060,1070-1071]| [[https://gerrit.wikimedia.org/r/c/operations/puppet/+/894537|Stop ingestion via gobblin]] & Put HDFS into safe mode | Re-enable gobblin jobs & Take HDFS out of safe mode | gobblin timers absented |
|aqs[1010,1016]| `sudo -i depool` | `sudo -i pool` | depooled |
|archiva1002| None (archiva.wikimedia.org will be unavailable) | None | |
|datahubsearch1001| `sudo -i depool` | `sudo -i pool` | depooled |
|dbstore1003| None (s1, s5, s7 will be unavailable) | None | |
|druid1004| `sudo -i depool` | `sudo -i pool` | depooled |
|kafka-jumbo[1001-1002]| None (to be checked) | None (to be checked) | |
|karapace1001| Stop Datahub ingestion - who: DE SREs| Restart DataHub ingestion | |
|stat[1004,1008]| Announce downtime via analytics-annouce@lists and #data-engineering Slack channel| None | |
== Data Engineering and Machine Learning ==
#data-engineering #machine-learning-team
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|dse-k8s-ctrl1001| none | none | |
|dse-k8s-etcd1001| None | | |
|dse-k8s-worker1001| None | | |
== Infrastructure Foundations ==
#infrastructure-foundations
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|apt1001|none | | |
|aux-k8s-ctrl[1001-1002]| | | |
|aux-k8s-etcd[1001-1003]| | | |
|aux-k8s-worker[1001-1002]| | | |
|ganeti[1023,1025-1026,1029-1032]|none | | |
|install1003|none | | |
|netbox1002| | | |
|netboxdb1002| | | |
|pki1001| DNS failover | | |
|puppetmaster1004| | | |
|urldownloader1001|failed over |no need, can remain on 1002| |
== Infrastructure Foundations and Data Engineering ==
#infrastructure-foundations #data-engineering
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|krb1001|none |none | |
== Data Persistence ==
#data-persistence
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|backup[1004,1008]| | | |
|db[1103,1107,1111,1115-1117,1126-1129,1141-1142,1151,1154,1156-1161,1176-1177,1185-1186]| db1176 (m1 master) needs to be switched over. db1159 (m5 master needs to be switched over T331384) | | db1176 is no longer a master after finishing T329259, db1159 is no longer a master (T331384) |
|dbprov1001| | | |
|dbproxy[1012-1013]| | |dbproxy1013 (m2-master) has been switched over |
|es[1020,1024,1026-1028]| | |Nothing to be done, eqiad will be depooled |
|moss-fe1001|`sudo depool` |`sudo pool` |@matthewvernon to do |
|ms-backup1001| | | |
|ms-be[1040,1044-1046,1051,1057,1060,1064]| | |no action required |
|ms-fe1009|`sudo depool` |`sudo pool` |@matthewvernon to do |
|pc1011| | | Nothing to be done, eqiad will be depooled|
|thanos-be1001| | |no action required |
== Core Platform ==
#core-platform-team
|Servers|Depool action needed|Repool action needed|Status|
|---|---|---|---|
|dumpsdata1004| | | |
|htmldumper1001| | | |
|maps[1005-1006]|None | None | |
|restbase[1016,1019-1021,1028,1031]|`sudo -i depool` |`sudo -i pool` | |
|sessionstore1001| None| None | |
|snapshot[1011-1012]| | | |
|thumbor1005| None | None | |