eqiad row D switches upgrade
For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade eqiad row D switches during the scheduled DC switchover.
Scheduled on April 18th - 13:00-15:00 UTC , please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well (well, 15min in real-reality). Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.
The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:
- please write NONE if no action is needed,
- the cookbook/command to run if it can be done by a 3rd party
- who will be around to take care of the depool
- Link to the relevant doc
- etc
The two main types of actions needed are depool and monitoring downtime
All servers will be downtimed with sudo cookbook sre.hosts.downtime --hours 2 -r "eqiad row D upgrade" -t XXX 'P{P:netbox::host%location ~ "D.*eqiad"}' but specific services might need specific downtimes.
Observability
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
kafka-logging1003 | |||
logstash[1012,1029-1031,1035] | drain shards 1012,1029,1035 depool 1030,1031 & set downtime | allocate shards, repool | shards allocating, pooled |
xhgui1001 | |||
Core Platform
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
dumpsdata1002 | None | None | |
maps1010 | None | None | |
restbase[1018,1025-1027,1030,1033] | depool | pool | Done |
sessionstore1003 | None | None | |
snapshot[1009,1015] | None | None | |
Infrastructure Foundations
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
bast1003 | sent announcement | NONE | |
cuminunpriv1001 | NONE | NONE | |
ganeti[1019-1022,1033-1034] | NONE | NONE | |
idm1001 | NONE | NONE | |
idm-test1001 | NONE | NONE | |
ldap-replica1004 | depooled | repooled | OK |
ping1003 | Remove ping redirect config on CR routers in eqiad | Re-run homer to add deleted firewall term back | repooled |
pki-root1001 | NONE | NONE | |
puppetboard1002 | NONE | NONE | |
puppetmaster1002 | sudo cumin '*' 'disable-puppet "Switch reboot: T333377"' | sudo cumin '*' 'enable-puppet "Switch reboot: T333377"' | |
sretest1001 | NONE | NONE | |
urldownloader1004 | NONE | NONE | |
Unowned
Servers | Depool action needed | Repool action needed | Status |
irc1001 | failed over to 2001 | not needed, can remain on irc2001 | OK |
irc1002 | NONE | NONE | OK |
Search Platform
Servers | Depool action needed | Repool action needed | Status | |
---|---|---|---|---|
apifeatureusage1001 | NONE | NONE | ||
cloudelastic1004 | NONE | NONE | ||
elastic[1060-1067] | NONE | NONE | ||
search-loader1001 | NONE | NONE | ||
wdqs[1005,1008] | NONE | NONE | ||
ServiceOps-Collab
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
aphlict1001 | NONE | NONE | |
gitlab-runner1004 | Will be paused in admin interface | Will be unpaused in admin interface | paused |
miscweb1003 | NONE | NONE | |
releases1002 | |||
Machine Learning
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
ml-etcd1003 | none | none | none |
ml-serve1004 | none | none | none |
ml-serve-ctrl1002 | none | none | none |
ores[1007-1009] | sudo -i depool | sudo -i pool | repooled |
Traffic
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
cp[1087-1090] | eqiad will be depooled, NOOP | done | |
dns1002 | disable puppet and stop bird | done | |
doh1002 | disable puppet and stop bird | done | |
durum1001 | disable puppet and stop bird | done | |
lvs[1016,1020] | eqiad will be depooled, NOOP | done | |
Data Engineering
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
an-airflow[1003-1004] | Announce downtime for these machines - Remind users to pause/unpause DAGs | None | |
an-conf1003 | None | None | |
an-druid1005 | None | None | |
an-presto[1001,1003] | None | None | |
an-test-coord1001 | None | None | |
an-test-druid1001 | None | None | |
an-test-presto1001 | None | None | |
an-test-worker1003 | None | None | |
an-worker[1092-1095,1101,1112-1116,1134-1138] | Stop gobblin ingestion with puppet (1 hour ahead), Stop YARN queues with puppet (30 minutes ahead), Put HDFS into safe mode (5 minutes ahead) | Reverse these three steps | Complete |
analytics[1067-1068,1076-1077] | Stop gobblin ingestion with puppet (1 hour ahead), Stop YARN queues with puppet (30 minutes ahead), Put HDFS into safe mode (5 minutes ahead) | Reverse these three steps | Complete |
aqs[1014-1015,1019] | depool | pool | Complete |
dbstore1007 | None | None | |
druid[1006,1008] | None | None | |
eventlog1003 | None | None | |
flerovium | None | None | |
kafka-jumbo[1006,1008-1009] | None | None | |
schema1004 | depool | pool | Complete |
stat[1005-1006] | Announce downtime for stat100[5-6] | None | |
Data Engineering and Machine Learning
Data-Engineering Machine-Learning-Team
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
dse-k8s-worker1004 | none | none | none |
Data Persistence
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
backup[1001,1007] | 1) make sure mediabackups on eqiad are stopped 2) ongoing bacula backups will fail - should be minimal disrruption | retry failed backups for faster recovery/check and restart media backups on eqiad | |
backupmon1001 | just a monitoring host, downtime would be enough | Making sure checks work as usual | |
db[1102,1106,1114,1122-1123,1125,1136-1138,1140,1148-1149,1153,1172-1175,1182,1184,1221-1225] | None, eqiad will be depooled | ||
dborch1001 | Nothing to be done | ||
dbprov1004 | Make sure no ongoing backup | Retry failed, if any | |
dbproxy[1016-1017] | Failover m3-master and m5-master | Reload proxies | Both failed over already by @Marostegui |
es[1023,1033-1034] | None, eqiad will be depooled | ||
moss-fe1002 | n/a | n/a | Not in production |
ms-be[1043,1048,1055-1056,1059,1063,1067] | None | None | |
pc1014 | None, eqiad will be depooled | ||
thanos-be1004 | None | None | |
ms-fe1013 | |||
WMCS
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
cloudcontrol1007 | |||
cloudcumin1001 | |||
clouddb[1019-1020] | |||
cloudrabbit1003 | |||
cloudweb1004 | |||
ServiceOps
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
chartmuseum1001 | |||
conf1009 | |||
kafka-main[1004-1005] | |||
kubernetes[1013-1014,1016,1021,1024] | |||
kubestage1004 | |||
mc[1051-1054] | |||
mc-gp1003 | |||
mc-wf1002 | |||
mw[1349-1384,1437-1447,1487-1488] | |||
parse[1018-1024] | |||
rdb[1010,1012] | |||
scandium | |||
testreduce1001 | None | None | |