Page MenuHomePhabricator

eqiad row D switches upgrade
Closed, ResolvedPublic

Description

eqiad row D switches upgrade

For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade eqiad row D switches during the scheduled DC switchover.

Scheduled on April 18th - 13:00-15:00 UTC , please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well (well, 15min in real-reality). Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.

The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:

  • please write NONE if no action is needed,
  • the cookbook/command to run if it can be done by a 3rd party
  • who will be around to take care of the depool
  • Link to the relevant doc
  • etc

The two main types of actions needed are depool and monitoring downtime

NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark None in the table) so there are less moving parts closer to the maintenance window.

All servers will be downtimed with sudo cookbook sre.hosts.downtime --hours 2 -r "eqiad row D upgrade" -t XXX 'P{P:netbox::host%location ~ "D.*eqiad"}' but specific services might need specific downtimes.

Observability

SRE Observability

ServersDepool action neededRepool action neededStatus
kafka-logging1003
logstash[1012,1029-1031,1035]drain shards 1012,1029,1035 depool 1030,1031 & set downtimeallocate shards, repoolshards allocating, pooled
xhgui1001

Core Platform

Platform Engineering

ServersDepool action neededRepool action neededStatus
dumpsdata1002NoneNone
maps1010NoneNone
restbase[1018,1025-1027,1030,1033]depoolpoolDone
sessionstore1003NoneNone
snapshot[1009,1015]NoneNone

Infrastructure Foundations

Infrastructure-Foundations

ServersDepool action neededRepool action neededStatus
bast1003sent announcementNONE
cuminunpriv1001NONENONE
ganeti[1019-1022,1033-1034]NONENONE
idm1001NONENONE
idm-test1001NONENONE
ldap-replica1004depooledrepooledOK
ping1003Remove ping redirect config on CR routers in eqiadRe-run homer to add deleted firewall term backrepooled
pki-root1001NONENONE
puppetboard1002NONENONE
puppetmaster1002sudo cumin '*' 'disable-puppet "Switch reboot: T333377"'sudo cumin '*' 'enable-puppet "Switch reboot: T333377"'
sretest1001NONENONE
urldownloader1004NONENONE

Unowned

ServersDepool action neededRepool action neededStatus
irc1001failed over to 2001not needed, can remain on irc2001OK
irc1002NONENONEOK

Search Platform

Discovery-Search

ServersDepool action neededRepool action neededStatus
apifeatureusage1001NONENONE
cloudelastic1004NONENONE
elastic[1060-1067]NONENONE
search-loader1001NONENONE
wdqs[1005,1008]NONENONE

ServiceOps-Collab

collaboration-services

ServersDepool action neededRepool action neededStatus
aphlict1001NONENONE
gitlab-runner1004Will be paused in admin interfaceWill be unpaused in admin interfacepaused
miscweb1003NONENONE
releases1002

Machine Learning

Machine-Learning-Team

ServersDepool action neededRepool action neededStatus
ml-etcd1003nonenonenone
ml-serve1004nonenonenone
ml-serve-ctrl1002nonenonenone
ores[1007-1009]sudo -i depoolsudo -i poolrepooled

Traffic

Traffic

ServersDepool action neededRepool action neededStatus
cp[1087-1090]eqiad will be depooled, NOOPdone
dns1002disable puppet and stop birddone
doh1002disable puppet and stop birddone
durum1001disable puppet and stop birddone
lvs[1016,1020]eqiad will be depooled, NOOPdone

Data Engineering

Data-Engineering

ServersDepool action neededRepool action neededStatus
an-airflow[1003-1004]Announce downtime for these machines - Remind users to pause/unpause DAGsNone
an-conf1003NoneNone
an-druid1005NoneNone
an-presto[1001,1003]NoneNone
an-test-coord1001NoneNone
an-test-druid1001NoneNone
an-test-presto1001NoneNone
an-test-worker1003NoneNone
an-worker[1092-1095,1101,1112-1116,1134-1138]Stop gobblin ingestion with puppet (1 hour ahead), Stop YARN queues with puppet (30 minutes ahead), Put HDFS into safe mode (5 minutes ahead)Reverse these three stepsComplete
analytics[1067-1068,1076-1077]Stop gobblin ingestion with puppet (1 hour ahead), Stop YARN queues with puppet (30 minutes ahead), Put HDFS into safe mode (5 minutes ahead)Reverse these three stepsComplete
aqs[1014-1015,1019]depoolpoolComplete
dbstore1007NoneNone
druid[1006,1008]NoneNone
eventlog1003NoneNone
fleroviumNoneNone
kafka-jumbo[1006,1008-1009]NoneNone
schema1004depoolpoolComplete
stat[1005-1006]Announce downtime for stat100[5-6]None

Data Engineering and Machine Learning

Data-Engineering Machine-Learning-Team

ServersDepool action neededRepool action neededStatus
dse-k8s-worker1004nonenonenone

Data Persistence

Data-Persistence

ServersDepool action neededRepool action neededStatus
backup[1001,1007]1) make sure mediabackups on eqiad are stopped 2) ongoing bacula backups will fail - should be minimal disrruptionretry failed backups for faster recovery/check and restart media backups on eqiad
backupmon1001just a monitoring host, downtime would be enoughMaking sure checks work as usual
db[1102,1106,1114,1122-1123,1125,1136-1138,1140,1148-1149,1153,1172-1175,1182,1184,1221-1225]None, eqiad will be depooled
dborch1001Nothing to be done
dbprov1004Make sure no ongoing backupRetry failed, if any
dbproxy[1016-1017]Failover m3-master and m5-masterReload proxiesBoth failed over already by @Marostegui
es[1023,1033-1034]None, eqiad will be depooled
moss-fe1002n/an/aNot in production
ms-be[1043,1048,1055-1056,1059,1063,1067]NoneNone
pc1014None, eqiad will be depooled
thanos-be1004NoneNone
ms-fe1013

WMCS

cloud-services-team

ServersDepool action neededRepool action neededStatus
cloudcontrol1007
cloudcumin1001
clouddb[1019-1020]
cloudrabbit1003
cloudweb1004

ServiceOps

serviceops

ServersDepool action neededRepool action neededStatus
chartmuseum1001
conf1009
kafka-main[1004-1005]
kubernetes[1013-1014,1016,1021,1024]
kubestage1004
mc[1051-1054]
mc-gp1003
mc-wf1002
mw[1349-1384,1437-1447,1487-1488]
parse[1018-1024]
rdb[1010,1012]
scandium
testreduce1001NoneNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@ayounsi we are placing new DB hosts in production, can you run the same query you ran to gather the affected DBs just in case we have new ones being affected?

@ayounsi we are placing new DB hosts in production, can you run the same query you ran to gather the affected DBs just in case we have new ones being affected?

Updated, the diff is the addition of db1221-1225 as well as ms-fe1013

Thank you, nothing changes from our DB side!

Mentioned in SAL (#wikimedia-operations) [2023-04-17T21:17:57Z] <inflatador> bking@cumin1001 ban cloudelastic1004 for upcoming switch maintenance T333377

Mentioned in SAL (#wikimedia-operations) [2023-04-17T21:59:38Z] <ryankemper@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 13 hosts with reason: T333377 maint

Mentioned in SAL (#wikimedia-operations) [2023-04-17T21:59:59Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 13 hosts with reason: T333377 maint

Change 909608 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Disable the gobblin timers temporarily on the prod cluster

https://gerrit.wikimedia.org/r/909608

Change 909616 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Failover irc.wikimedia.org to irc2001

https://gerrit.wikimedia.org/r/909616

Change 909616 merged by Muehlenhoff:

[operations/dns@master] Failover irc.wikimedia.org to irc2001

https://gerrit.wikimedia.org/r/909616

Change 909621 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Stop the YARN queues temporarily to facilitate switch maintenance

https://gerrit.wikimedia.org/r/909621

Change 909608 merged by Btullis:

[operations/puppet@production] Disable the gobblin timers temporarily on the prod cluster

https://gerrit.wikimedia.org/r/909608

Change 909621 merged by Btullis:

[operations/puppet@production] Stop the YARN queues temporarily to facilitate switch maintenance

https://gerrit.wikimedia.org/r/909621

Change 909653 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: temporarily removed dns1002 from authdns_servers

https://gerrit.wikimedia.org/r/909653

Mentioned in SAL (#wikimedia-operations) [2023-04-18T11:48:46Z] <effie> depooling eqiad due to eqiad row D switches upgrade - T333377

jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade - T333377 started.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T11:50:55Z] <jiji@cumin1001> START - Cookbook sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade - T333377

Change 909653 merged by Ssingh:

[operations/puppet@production] hiera: temporarily removed dns1002 from authdns_servers

https://gerrit.wikimedia.org/r/909653

jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade - T333377 failed.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T12:26:32Z] <jiji@cumin1001> END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) depool all active/active services in eqiad: eqiad row D switches upgrade - T333377

jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade - T333377 started.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T12:27:14Z] <jiji@cumin1001> START - Cookbook sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade - T333377

jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade - T333377 completed.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T12:27:19Z] <jiji@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in eqiad: eqiad row D switches upgrade - T333377

Change 909662 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] depool eqiad

https://gerrit.wikimedia.org/r/909662

Change 909662 merged by Ssingh:

[operations/dns@master] depool eqiad

https://gerrit.wikimedia.org/r/909662

Mentioned in SAL (#wikimedia-operations) [2023-04-18T13:06:50Z] <topranks> disabling ping offload on cr1-eqiad and cr2-eqiad in advance of row D switch upgrade T333377

Icinga downtime and Alertmanager silence (ID=7fc7ae6f-d3b2-43ed-b030-194ed6367c80) set by cmooney@cumin1001 for 2:00:00 on 270 host(s) and their services with reason: eqiad row D upgrade

an-airflow[1003-1004].eqiad.wmnet,an-conf1003.eqiad.wmnet,an-druid1005.eqiad.wmnet,an-presto[1001,1003].eqiad.wmnet,an-test-coord1001.eqiad.wmnet,an-test-druid1001.eqiad.wmnet,an-test-presto1001.eqiad.wmnet,an-test-worker1003.eqiad.wmnet,an-worker[1092-1095,1101,1112-1116,1134-1138].eqiad.wmnet,analytics[1067-1069,1076-1077].eqiad.wmnet,aphlict[1001-1002].eqiad.wmnet,apifeatureusage1001.eqiad.wmnet,aqs[1014-1015,1019].eqiad.wmnet,backup[1001,1007].eqiad.wmnet,backupmon1001.eqiad.wmnet,bast1003.wikimedia.org,chartmuseum1001.eqiad.wmnet,cloudbackup1004.eqiad.wmnet,cloudcephmon1002.eqiad.wmnet,cloudcephosd[1011-1015,1019-1020,1023-1024].eqiad.wmnet,cloudcontrol1007.wikimedia.org,cloudcumin1001.eqiad.wmnet,clouddb[1019-1020].eqiad.wmnet,cloudelastic1004.wikimedia.org,cloudgw1002.eqiad.wmnet,cloudnet1006.eqiad.wmnet,cloudrabbit1003.wikimedia.org,cloudvirt[1028-1030,1036-1047].eqiad.wmnet,cloudvirtlocal1001.eqiad.wmnet,cloudweb1004.wikimedia.org,conf1009.eqiad.wmnet,cp[1087-1090].eqiad.wmnet,cuminunpriv1001.eqiad.wmnet,db[1106,1114,1122-1123,1125,1136-1138,1140,1148-1149,1153,1172-1175,1182,1184,1221-1225].eqiad.wmnet,dborch1001.wikimedia.org,dbprov1004.eqiad.wmnet,dbproxy[1016-1017].eqiad.wmnet,dbstore1007.eqiad.wmnet,dns1002.wikimedia.org,doh1002.wikimedia.org,druid[1006,1008].eqiad.wmnet,dse-k8s-worker1004.eqiad.wmnet,dumpsdata1002.eqiad.wmnet,durum1001.eqiad.wmnet,elastic[1060-1067].eqiad.wmnet,es[1023,1033-1034].eqiad.wmnet,eventlog1003.eqiad.wmnet,flerovium.eqiad.wmnet,ganeti[1019-1022,1033-1034].eqiad.wmnet,gitlab-runner1004.eqiad.wmnet,idm1001.wikimedia.org,idm-test1001.wikimedia.org,irc[1001-1002].wikimedia.org,kafka-jumbo[1006,1008-1009].eqiad.wmnet,kafka-logging1003.eqiad.wmnet,kafka-main[1004-1005].eqiad.wmnet,kubernetes[1013-1014,1016,1021,1024].eqiad.wmnet,kubestage1004.eqiad.wmnet,ldap-replica1004.wikimedia.org,logstash[1012,1029-1031,1035].eqiad.wmnet,lvs[1016,1020].eqiad.wmnet,maps1010.eqiad.wmnet,mc[1051-1054].eqiad.wmnet,mc-gp1003.eqiad.wmnet,mc-wf1002.eqiad.wmnet,miscweb1003.eqiad.wmnet,ml-etcd1003.eqiad.wmnet,ml-serve1004.eqiad.wmnet,ml-serve-ctrl1002.eqiad.wmnet,moss-fe1002.eqiad.wmnet,ms-be[1043,1048,1055-1056,1059,1063,1067].eqiad.wmnet,ms-fe1013.eqiad.wmnet,mw[1349-1384,1437-1447,1487-1488].eqiad.wmnet,ores[1007-1009].eqiad.wmnet,parse[1018-1024].eqiad.wmnet,pc1014.eqiad.wmnet,ping1003.eqiad.wmnet,pki-root1001.eqiad.wmnet,puppetboard1002.eqiad.wmnet,puppetmaster1002.eqiad.wmnet,rdb[1010,1012].eqiad.wmnet,releases1002.eqiad.wmnet,restbase[1018,1025-1027,1030,1033].eqiad.wmnet,scandium.eqiad.wmnet,schema1004.eqiad.wmnet,search-loader1001.eqiad.wmnet,sessionstore1003.eqiad.wmnet,snapshot[1009,1015].eqiad.wmnet,sretest1001.eqiad.wmnet,stat[1005-1006].eqiad.wmnet,testreduce1001.eqiad.wmnet,thanos-be1004.eqiad.wmnet,urldownloader1004.wikimedia.org,wdqs[1005,1008].eqiad.wmnet,xhgui1001.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=e714b564-285e-4f22-b860-267d7c23208d) set by cmooney@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: eqiad row D upgrade

asw2-d-eqiad

Mentioned in SAL (#wikimedia-operations) [2023-04-18T13:25:08Z] <topranks> Rebooting asw2-d-eqiad virtual-chassis (all row D top-of-rack switches) to upgrade JunOS. Row D going down T333377

Mentioned in SAL (#wikimedia-operations) [2023-04-18T14:04:33Z] <sukhe> running authdns-update to repool eqiad after switch maint: T333377

Mentioned in SAL (#wikimedia-operations) [2023-04-18T15:07:04Z] <claime> repooling all eqiad active active services post T333377

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 started.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T15:07:56Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 failed.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T15:38:29Z] <cgoubert@cumin1001> END (ERROR) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: End of maintenance - T333377

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 started.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T15:38:34Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 failed.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T15:54:41Z] <cgoubert@cumin1001> END (ERROR) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: End of maintenance - T333377

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 started.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T15:54:48Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 failed.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T16:00:08Z] <cgoubert@cumin1001> END (ERROR) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: End of maintenance - T333377

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 started.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T16:00:19Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 failed.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T16:03:38Z] <cgoubert@cumin1001> END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: End of maintenance - T333377

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 started.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T16:03:43Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 failed.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T16:04:50Z] <cgoubert@cumin1001> END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: End of maintenance - T333377

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 started.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T16:08:06Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of maintenance - T333377 completed.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T16:08:11Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in eqiad: End of maintenance - T333377

All works complete, no issues to report.