eqiad row C switches upgrade
For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade eqiad row C switches during the scheduled DC switchover.
Scheduled on April 4th - 13:00-15:00 UTC , please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well (well, 15min in real-reality). Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.
The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:
- please write NONE if no action is needed,
- the cookbook/command to run if it can be done by a 3rd party
- who will be around to take care of the depool
- Link to the relevant doc
- etc
The two main types of actions needed are depool and monitoring downtime
All servers will be downtimed with sudo cookbook sre.hosts.downtime --hours 2 -r "eqiad row C upgrade" -t T331882 'P{P:netbox::host%location ~ "C.*eqiad"}' but specific services might need specific downtimes.
Observability
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
alert1001 | fail services over to alert2001 | fail services back to alert1001 | incomplete |
kafka-logging1002 | schedule downtime | incomplete | |
logstash[1025,1028,1034] | drain shards 1028,1034 depool 1025 & set downtime | allocate shards, pool | shards allocating, pooled |
mwlog1002 | schedule downtime, deploy MW patch | revert MW patch | incomplete |
webperf1003 | none | none | |
Observability and Data Persistence
SRE Observability Data-Persistence
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
thanos-fe1003 | depool | pool | done |
Core Platform
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
dumpsdata[1003,1005] | |||
maps1009 | N/A | N/A | |
sessionstore1002 | None | None | |
snapshot1014 | |||
thumbor1006 | depool | pool | |
ServiceOps-Collab
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
doc1002 | None | None | |
etherpad1003 | None | None | |
gitlab-runner1003 | pause in admin interface | unpause in admin interface | unpaused/repooled Tue, April 4th by @Jelto |
miscweb1002 | None | None | |
Search Platform
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
cloudelastic1003 | None | ||
elastic[1057-1059,1080-1083,1087-1088] | None | ||
wcqs1003 | None | ||
wdqs[1010,1013-1014] | None | ||
Data Engineering
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
an-conf1002 | None | None | |
an-coord1002 | None | None | |
an-db1002 | None | None | |
an-druid1002 | None | None | |
an-test-master1002 | None | None | |
an-test-worker1002 | None | None | |
an-tool[1005,1007,1010] | Announce downtime for Superset and Turnilo | None | done |
an-worker[1088-1091,1099-1100,1104-1111,1131-1133] | Stop gobblin ingestion with puppet (1 hour ahead), Stop YARN queues with puppet (30 minutes ahead), Put HDFS into safe mode (5 minutes ahead) | Reverse these three steps | Done |
analytics[1064-1066,1074-1075] | Stop gobblin ingestion with puppet (1 hour ahead), Stop YARN queues with puppet (30 minutes ahead), Put HDFS into safe mode (5 minutes ahead) | Reverse these three steps | Done |
aqs[1012-1013,1018] | depool | pool | Done |
datahubsearch1003 | depool | pool | Done |
db1108 | None | None | |
dbstore1005 | None | None | |
kafka-jumbo[1004-1005,1007] | None | None | |
matomo1002 | Announce downtime for matomo | None | done |
Data Engineering or Search Platform
Data-Engineering Discovery-Search
Currently without owners, see https://gerrit.wikimedia.org/r/c/operations/puppet/+/903686 for the fix
an-airflow1005.eqiad.wmnet
as well as an-airflow1001.eqiad.wmnet
Data Engineering and Machine Learning
Data-Engineering Machine-Learning-Team
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
dse-k8s-etcd1003 | none | none | none |
dse-k8s-worker1003 | none | none | none |
Infrastructure Foundations
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
cumin1001 | NONE | NONE | |
ganeti[1009-1012,1024,1027-1028] | NONE | NONE | |
idp-test1002 | NONE | NONE | |
install1004 | NONE | NONE | |
mx1001 | NONE | NONE | |
puppetdb1002 | disable puppet fleet wide cumin '*' 'disable-puppet "Switch maintenance: T331882"' | cumin '*' 'enable-puppet "Switch maintenance: T331882"' | |
puppetmaster1005 | NONE | NONE | |
rpki1001 | NONE | NONE | |
seaborgium | NONE | NONE | |
urldownloader[1002-1003] | Failover to 1001 | not needed (can remain on 1001) | DONE |
Traffic
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
acmechief1001 | N/A | ||
acmechief-test1001 | N/A | ||
cp[1083-1086] | depool | eqiad will be depooled, NOOP | |
doh1001 | stop puppet && disable bird | done | |
lvs1015 | N/A | ||
lvs1019 | stop pybal && disable puppet | eqiad will be depooled, NOOP | |
ncredir1001 | eqiad will be depooled, NOOP | ||
Machine Learning
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
ml-cache1002 | none | none | none |
ml-etcd1002 | none | none | none |
ml-serve1003 | none | none | none |
ores[1005-1006] | sudo -i depool | sudo -i pool | |
orespoolcounter1004 | none | none | none |
Data Persistence
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
backup[1002,1006] | ES backups will fail; to be delayed until after maintenance. | Restart es eqiad backups | |
db[1100-1101,1110,1120-1121,1131,1133-1135,1145-1147,1150,1166-1171,1180-1181,1189] | db1101 will needed to be failed over (T333123) as it is going to become m1 master as part of T331510 to allow row B maintenance | ||
dbprov1003 | Jaime to make sure they are idle during maintenance. | ||
dbproxy[1020-1021] | Nothing to be done, they are not active at the moment | ||
es[1022,1031-1032] | Nothing to be done as eqiad will be depooled | ||
moss-be1002 | n/a | n/a | Not in production |
ms-backup1002 | Jaime to make sure they are idle during maintenance | ||
ms-be[1042,1049-1050,1054,1062,1066] | None | None | |
ms-fe1011 | sudo depool | sudo pool | done |
pc1013 | Nothing to be done as eqiad will be depooled | ||
thanos-be1003 | None | None | |
ServiceOps
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
deploy1002 | |||
kafka-main1003 | |||
kubemaster1002 | |||
kubernetes[1006,1011-1012,1020,1023] | |||
kubestagetcd1006 | |||
kubetcd1004 | |||
mc[1045-1050] | |||
mc-gp1002 | |||
mw[1405-1413,1434-1436,1482-1486] | |||
mwdebug1001 | |||
parse[1013-1016] | |||
poolcounter1005 | |||
registry1004 | |||
WMCS
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
cloudcontrol1005 | None - no action needed | None - no action needed | Active, but no depool is required, just alert downtime. |
clouddb[1017-1018] | @fnegri taking care of this | ||
clouddumps1002 | https://gerrit.wikimedia.org/r/905628 | https://gerrit.wikimedia.org/r/905610 | @aborrero taking care of it |
cloudmetrics1004 | None - no action needed | None - no action needed | Active, but no depool is required, just alert downtime. |
cloudrabbit1002 | None - no action needed | None - no action needed | Active, but no depool is required, just alert downtime. Clients should know how to use other rabbit server |
dbproxy1018 | @fnegri taking care of this | ||
labstore[1004-1005] | to be decom - no action needed | to be decom - no action needed | to be decom - no action needed |