eqiad row B switches upgrade
For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade eqiad row B switches during the scheduled DC switchover.
This has been re-scheduled to March 28th - 14:00-16:00 UTC (one week later than originally planned to not conflict with Sprint week), please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well (well, 15min in real-reality). Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.
The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:
- please write NONE if no action is needed,
- the cookbook/command to run if it can be done by a 3rd party
- who will be around to take care of the depool
- Link to the relevant doc
- etc
The two main types of actions needed are depool and monitoring downtime
All servers will be downtimed with sudo cookbook sre.hosts.downtime --hours 2 -r "eqiad row B upgrade" -t T330165 'P{P:netbox::host%location ~ "B.*eqiad"}' but specific services might need specific downtimes.
Traffic
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
dns1003 (formerly authdns1001) | disable puppet and stop bird | done | |
cp[1079-1082] | depool | eqiad will be depooled, no action | |
durum1002 | disable puppet and stop bird | done | |
lvs1014 | N/A | ||
lvs1018 | disable puppet & stop pybal | eqiad will be depooled, no action | |
ServiceOps-Collab
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
contint1002 | NONE | NONE | |
gitlab1004 | NONE | NONE | NONE |
gitlab-runner1002 | pause in admin interface | unpause in admin interface | unpaused/pooled again |
otrs1001 | NONE | NONE | downtime will be announced |
phab1004 | NONE | NONE | downtime announced in wikitech-l |
Infrastructure Foundations
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
debmonitor1002 | none | none | |
failoid1002 | none | none | |
ganeti[1013-1018] | none | none | |
idp1002 | failover idp.w.o | ready to proceed | |
ldap-replica1003 | depool | depooled | |
mirror1001 | none | none | |
netflow1002 | none | none | |
puppetdb1003 | none | none | |
puppetmaster[1001,1003] | disable puppet in fleet wide (as puppet.wikimedia.org goes to eqiad | puppet disabled | |
Infrastructure Foundations and Observability
Infrastructure-Foundations SRE Observability
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
netmon1003 | None | None | |
Observability
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
arclamp1001 | none | none | downtime scheduled |
centrallog1002 | none | none | downtime scheduled |
graphite1005 | failover to codfw | fail back to eqiad | moved back to eqiad |
kafka-logging1001 | downtime scheduled | ||
logstash[1011,1027,1032] | drain shards 1011,1027 depool 1032 | allocate shards 1011,1027 repool 1032 | shards allocating, hosts pooled |
prometheus1006 | depool and remove from AM | pool and put back in AM | repool completed |
Observability and Data Persistence
SRE Observability Data-Persistence
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
thanos-fe1002 | depool | pool | completed |
Core Platform
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
dumpsdata1001 | none | ||
maps[1007-1008] | depool | pool | |
restbase[1017,1022-1024,1029,1032] | depool | pool | |
snapshot[1008,1010,1013] | none | ||
thumbor[1001-1002] | depool | pool | |
Search Platform
Discovery-Search
contact: @bking (inflatador on IRC)
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
cloudelastic[1002,1006] | no action needed | ||
elastic[1055-1056,1074-1079,1085-1086] | no action needed | ||
relforge1004 | no action needed | ||
wcqs1002 | no action needed | ||
wdqs[1007,1009,1012] | no action needed | ||
Data Persistence
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
backup[1003,1005] | Jaime to make sure they are idle during downtime | ||
db[1104,1112-1113,1118-1119,1124,1130,1132,1139,1143-1144,1152,1155,1162-1165,1178-1179,1183,1187-1188,1206] | db1183 needs to be switched over (T330847) , db1164 needs to be switched over T331510 | db1183 and db1164 are no longer masters (T330847 T331510) | |
dbprov1002 | Jaime to make sure they are idle during downtime | ||
dbproxy[1014-1015] | @Marostegui will failover dbproxy1014 and dbproxy1015 | Reload the proxies | Both proxies have been failed over and are not active |
es[1021,1025,1029-1030] | Jaime: may require an es backup retry after downtime | Nothing on MW side as eqiad is depooled | |
ms-be[1041,1047,1052-1053,1058,1061,1065] | nothing needed | ||
ms-fe1010 | depool | pool | depooled |
pc1012 | Nothing to do, eqiad is depooled | ||
thanos-be1002 | nothing needed | ||
Machine Learning
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
ml-etcd1001 | none | none | none |
ml-serve1002 | none | none | none |
ml-serve-ctrl1001 | none | none | none |
ores[1003-1004] | sudo -i depool | sudo -i pool | |
Data Engineering
Data-Engineering
Announce downtime for other teams pipelines + announce downtime for Hive, Presto + Superset limited functionality
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
an-conf1001 | NONE | NONE | |
an-coord1001 | Failover hive to an-coord1002 **n.b. We will lose MariaDB, therefore superset, some Druid functionality, Hive, DataHub | Fail back Hive to an-coord1001 | Failed over hive |
an-druid1004 | NONE | NONE | |
an-launcher1002 | Disable gobblin ingestion at 12:50 UTC | Re-enable ingestion | patch for gobblin Gobblin jobs absented |
an-master1002 | Putting into safe mode + Disabling YARN | Taking out of safe mode + Enabling YARN | Scheduled for 13:30 UTC : patch for YARN YARN queues stopped, safe mode entered |
an-presto1004 | NONE | NONE | |
an-test-coord1002 | NONE | NONE | |
an-test-ui1001 | NONE | NONE | |
an-tool[1008-1009] | NONE + Announce downtime for Hue | NONE | Done |
an-web1001 | NONE + Announce downtime for wikistats and analytics.wikimedia.org | NONE | Done |
an-worker[1083-1087,1097-1098,1117,1124-1128,1130] | Putting HDFS into safe mode | Taking HDFS out of safe mode | Safe mode entered |
analytics[1061-1063,1072-1073] | Putting HDFS into safe mode | Taking HDFS out of safe mode | Safe mode entered |
aqs[1011,1017] | NONE | NONE | |
datahubsearch1002 | NONE | NONE | |
druid[1005,1007] | NONE | NONE | |
kafka-jumbo1003 | NONE | NONE | |
schema1003 | NONE | NONE | |
stat[1007,1009] | Announce downtime for these two stats servers | Done | |
Data Engineering and Machine Learning
Data-Engineering Machine-Learning-Team
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
dse-k8s-ctrl1002 | NONE | NONE | |
dse-k8s-etcd1002 | NONE | NONE | |
dse-k8s-worker1002 | NONE | NONE | |
WMCS
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
dbproxy1019 | |||
cloudbackup[1001-1002]-dev | |||
cloudcephmon1001 | |||
cloudcephosd1003 | |||
cloudcontrol1006 | |||
clouddb[1015-1016] | |||
clouddumps1001 | |||
cloudrabbit1001 | |||
cloudservices1005 | |||
cloudvirt[1017,1019-1024] | |||
cloudvirt-wdqs[1001-1003] | |||
cloudweb1003 | |||
ServiceOps
Servers | Depool action needed | Repool action needed | Status |
---|---|---|---|
conf1008 | |||
dragonfly-supernode1001 | |||
kafka-main1002 | |||
kubernetes[1009-1010,1015,1019,1022] | |||
kubestage1003 | |||
kubestagemaster1001 | |||
kubestagetcd1005 | |||
kubetcd1006 | |||
mc[1041-1044] | |||
mc-wf1001 | |||
mw[1393-1404,1423-1433,1466-1481] | |||
mwmaint1002 | |||
parse[1007-1012,1017] | |||
rdb1009 | |||