Page MenuHomePhabricator

eqiad row C switches upgrade
Closed, ResolvedPublic

Description

eqiad row C switches upgrade

For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade eqiad row C switches during the scheduled DC switchover.

Scheduled on April 4th - 13:00-15:00 UTC , please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well (well, 15min in real-reality). Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.

The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:

  • please write NONE if no action is needed,
  • the cookbook/command to run if it can be done by a 3rd party
  • who will be around to take care of the depool
  • Link to the relevant doc
  • etc

The two main types of actions needed are depool and monitoring downtime

NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark None in the table) so there are less moving parts closer to the maintenance window.

All servers will be downtimed with sudo cookbook sre.hosts.downtime --hours 2 -r "eqiad row C upgrade" -t T331882 'P{P:netbox::host%location ~ "C.*eqiad"}' but specific services might need specific downtimes.

Observability

SRE Observability

ServersDepool action neededRepool action neededStatus
alert1001fail services over to alert2001fail services back to alert1001incomplete
kafka-logging1002schedule downtimeincomplete
logstash[1025,1028,1034]drain shards 1028,1034 depool 1025 & set downtimeallocate shards, poolshards allocating, pooled
mwlog1002schedule downtime, deploy MW patchrevert MW patchincomplete
webperf1003nonenone

Observability and Data Persistence

SRE Observability Data-Persistence

ServersDepool action neededRepool action neededStatus
thanos-fe1003depoolpooldone

Core Platform

Platform Engineering

ServersDepool action neededRepool action neededStatus
dumpsdata[1003,1005]
maps1009N/AN/A
sessionstore1002NoneNone
snapshot1014
thumbor1006depoolpool

ServiceOps-Collab

collaboration-services

ServersDepool action neededRepool action neededStatus
doc1002NoneNone
etherpad1003NoneNone
gitlab-runner1003pause in admin interfaceunpause in admin interfaceunpaused/repooled Tue, April 4th by @Jelto
miscweb1002NoneNone

Search Platform

Discovery-Search

ServersDepool action neededRepool action neededStatus
cloudelastic1003None
elastic[1057-1059,1080-1083,1087-1088]None
wcqs1003None
wdqs[1010,1013-1014]None

Data Engineering

Data-Engineering

ServersDepool action neededRepool action neededStatus
an-conf1002NoneNone
an-coord1002NoneNone
an-db1002NoneNone
an-druid1002NoneNone
an-test-master1002NoneNone
an-test-worker1002NoneNone
an-tool[1005,1007,1010]Announce downtime for Superset and TurniloNonedone
an-worker[1088-1091,1099-1100,1104-1111,1131-1133]Stop gobblin ingestion with puppet (1 hour ahead), Stop YARN queues with puppet (30 minutes ahead), Put HDFS into safe mode (5 minutes ahead)Reverse these three stepsDone
analytics[1064-1066,1074-1075]Stop gobblin ingestion with puppet (1 hour ahead), Stop YARN queues with puppet (30 minutes ahead), Put HDFS into safe mode (5 minutes ahead)Reverse these three stepsDone
aqs[1012-1013,1018]depoolpoolDone
datahubsearch1003depoolpoolDone
db1108NoneNone
dbstore1005NoneNone
kafka-jumbo[1004-1005,1007]NoneNone
matomo1002Announce downtime for matomoNonedone

Data Engineering or Search Platform

Data-Engineering Discovery-Search
Currently without owners, see https://gerrit.wikimedia.org/r/c/operations/puppet/+/903686 for the fix
an-airflow1005.eqiad.wmnet
as well as an-airflow1001.eqiad.wmnet

Data Engineering and Machine Learning

Data-Engineering Machine-Learning-Team

ServersDepool action neededRepool action neededStatus
dse-k8s-etcd1003nonenonenone
dse-k8s-worker1003nonenonenone

Infrastructure Foundations

Infrastructure-Foundations

ServersDepool action neededRepool action neededStatus
cumin1001NONENONE
ganeti[1009-1012,1024,1027-1028]NONENONE
idp-test1002NONENONE
install1004NONENONE
mx1001NONENONE
puppetdb1002disable puppet fleet wide cumin '*' 'disable-puppet "Switch maintenance: T331882"'cumin '*' 'enable-puppet "Switch maintenance: T331882"'
puppetmaster1005NONENONE
rpki1001NONENONE
seaborgiumNONENONE
urldownloader[1002-1003]Failover to 1001not needed (can remain on 1001)DONE

Traffic

Traffic

ServersDepool action neededRepool action neededStatus
acmechief1001N/A
acmechief-test1001N/A
cp[1083-1086]depooleqiad will be depooled, NOOP
doh1001stop puppet && disable birddone
lvs1015N/A
lvs1019stop pybal && disable puppeteqiad will be depooled, NOOP
ncredir1001eqiad will be depooled, NOOP

Machine Learning

Machine-Learning-Team

ServersDepool action neededRepool action neededStatus
ml-cache1002nonenonenone
ml-etcd1002nonenonenone
ml-serve1003nonenonenone
ores[1005-1006]sudo -i depoolsudo -i pool
orespoolcounter1004nonenonenone

Data Persistence

Data-Persistence

ServersDepool action neededRepool action neededStatus
backup[1002,1006]ES backups will fail; to be delayed until after maintenance.Restart es eqiad backups
db[1100-1101,1110,1120-1121,1131,1133-1135,1145-1147,1150,1166-1171,1180-1181,1189]db1101 will needed to be failed over (T333123) as it is going to become m1 master as part of T331510 to allow row B maintenance
dbprov1003Jaime to make sure they are idle during maintenance.
dbproxy[1020-1021]Nothing to be done, they are not active at the moment
es[1022,1031-1032]Nothing to be done as eqiad will be depooled
moss-be1002n/an/aNot in production
ms-backup1002Jaime to make sure they are idle during maintenance
ms-be[1042,1049-1050,1054,1062,1066]NoneNone
ms-fe1011sudo depoolsudo pooldone
pc1013Nothing to be done as eqiad will be depooled
thanos-be1003NoneNone

ServiceOps

serviceops

ServersDepool action neededRepool action neededStatus
deploy1002
kafka-main1003
kubemaster1002
kubernetes[1006,1011-1012,1020,1023]
kubestagetcd1006
kubetcd1004
mc[1045-1050]
mc-gp1002
mw[1405-1413,1434-1436,1482-1486]
mwdebug1001
parse[1013-1016]
poolcounter1005
registry1004

WMCS

cloud-services-team

ServersDepool action neededRepool action neededStatus
cloudcontrol1005None - no action neededNone - no action neededActive, but no depool is required, just alert downtime.
clouddb[1017-1018]@fnegri taking care of this
clouddumps1002https://gerrit.wikimedia.org/r/905628https://gerrit.wikimedia.org/r/905610@aborrero taking care of it
cloudmetrics1004None - no action neededNone - no action neededActive, but no depool is required, just alert downtime.
cloudrabbit1002None - no action neededNone - no action neededActive, but no depool is required, just alert downtime. Clients should know how to use other rabbit server
dbproxy1018@fnegri taking care of this
labstore[1004-1005]to be decom - no action neededto be decom - no action neededto be decom - no action needed

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 901322 had a related patch set uploaded (by Krinkle; author: Tim Starling):

[operations/mediawiki-config@master] Temporarily disable xenon/excimer for mwlog1002 switch maintenance

https://gerrit.wikimedia.org/r/901322

Change 899629 merged by Herron:

[operations/puppet@production] alerting_host: failover icinga and alertmanger from eqiad to codfw

https://gerrit.wikimedia.org/r/899629

jbond updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2023-04-03T21:12:33Z] <ryankemper@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wcqs1003.eqiad.wmnet,wdqs[1010,1013-1014].eqiad.wmnet with reason: T331882 eqiad row C maint

Mentioned in SAL (#wikimedia-operations) [2023-04-03T21:12:52Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wcqs1003.eqiad.wmnet,wdqs[1010,1013-1014].eqiad.wmnet with reason: T331882 eqiad row C maint

Mentioned in SAL (#wikimedia-operations) [2023-04-03T21:16:48Z] <ryankemper@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: T331882 eqiad row C maint

Mentioned in SAL (#wikimedia-operations) [2023-04-03T21:17:04Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: T331882 eqiad row C maint

Mentioned in SAL (#wikimedia-operations) [2023-04-03T21:25:51Z] <inflatador> bking@cumin ban cloudelastic1003 from all cloudelastic clusters T331882

Mentioned in SAL (#wikimedia-operations) [2023-04-04T06:09:48Z] <XioNoX> stage new Junos on asw2-c-eqiad - T331882

Change 905469 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Failover urldownloader in eqiad for row C switch maintenance

https://gerrit.wikimedia.org/r/905469

Change 905469 merged by Muehlenhoff:

[operations/dns@master] Failover urldownloader in eqiad for row C switch maintenance

https://gerrit.wikimedia.org/r/905469

MW section masters:

  • db1100: s5
  • db1131: s6
  • db1181: s7

Need to downtime the whole sections for these. I'll do it a bit later.

Change 905596 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Stop Yarn queues and Gobblin timers

https://gerrit.wikimedia.org/r/905596

Change 901322 merged by jenkins-bot:

[operations/mediawiki-config@master] Temporarily disable xenon/excimer for mwlog1002 switch maintenance

https://gerrit.wikimedia.org/r/901322

Mentioned in SAL (#wikimedia-operations) [2023-04-04T12:17:30Z] <tstarling@deploy2002> Synchronized src/Profiler.php: T331882 disable profiling for switch maintenance (duration: 05m 58s)

Change 905603 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] Depool eqiad for network maintenance

https://gerrit.wikimedia.org/r/905603

Change 905603 merged by Ssingh:

[operations/dns@master] Depool eqiad for network maintenance

https://gerrit.wikimedia.org/r/905603

Mentioned in SAL (#wikimedia-operations) [2023-04-04T12:31:03Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 6:00:00 on 38 hosts with reason: Row c switch maint T331882

Mentioned in SAL (#wikimedia-operations) [2023-04-04T12:31:58Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 38 hosts with reason: Row c switch maint T331882

aborrero updated the task description. (Show Details)

Change 905596 merged by Stevemunene:

[operations/puppet@production] Stop Hadoop Yarn queues to ease network maintenance

https://gerrit.wikimedia.org/r/905596

Change 905628 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] clouddumps: depool clouddumps1002

https://gerrit.wikimedia.org/r/905628

Change 905628 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] clouddumps: depool clouddumps1002

https://gerrit.wikimedia.org/r/905628

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row C switches upgrade - T331882 started.

Mentioned in SAL (#wikimedia-operations) [2023-04-04T12:44:54Z] <akosiaris@cumin1001> START - Cookbook sre.discovery.datacenter depool all active/active services in eqiad: eqiad row C switches upgrade - T331882

Mentioned in SAL (#wikimedia-operations) [2023-04-04T12:57:20Z] <steve_munene> putting pdfs into safe mode as part of T331882

Mentioned in SAL (#wikimedia-analytics) [2023-04-04T12:57:57Z] <steve_munene> putting hdfs into safe mode as part of T331882

Icinga downtime and Alertmanager silence (ID=80a32cef-9700-4047-8185-415ffca1aaa2) set by ayounsi@cumin1001 for 2:00:00 on 227 host(s) and their services with reason: eqiad row C upgrade

acmechief1001.eqiad.wmnet,acmechief-test1001.eqiad.wmnet,an-airflow[1001,1005].eqiad.wmnet,an-conf1002.eqiad.wmnet,an-coord1002.eqiad.wmnet,an-db1002.eqiad.wmnet,an-druid1002.eqiad.wmnet,an-test-client1002.eqiad.wmnet,an-test-master1002.eqiad.wmnet,an-test-worker1002.eqiad.wmnet,an-tool[1005,1007,1010].eqiad.wmnet,an-worker[1088-1091,1099-1100,1104-1111,1131-1133].eqiad.wmnet,analytics[1064-1066,1074-1075].eqiad.wmnet,aqs[1012-1013,1018].eqiad.wmnet,backup[1002,1006].eqiad.wmnet,cloudbackup1003.eqiad.wmnet,cloudcephmon1003.eqiad.wmnet,cloudcephosd[1006-1009,1016-1018,1021-1022].eqiad.wmnet,cloudcontrol1005.wikimedia.org,clouddb[1017-1018].eqiad.wmnet,clouddumps1002.wikimedia.org,cloudelastic1003.wikimedia.org,cloudgw1001.eqiad.wmnet,cloudmetrics1004.eqiad.wmnet,cloudnet1005.eqiad.wmnet,cloudrabbit1002.wikimedia.org,cloudvirt[1025-1027,1031-1035].eqiad.wmnet,cp[1083-1086].eqiad.wmnet,cumin1001.eqiad.wmnet,datahubsearch1003.eqiad.wmnet,db[1100-1101,1108,1110,1120-1121,1131,1133-1135,1145-1147,1150,1166-1171,1180-1181,1189,1217-1220].eqiad.wmnet,dbprov1003.eqiad.wmnet,dbproxy[1018,1020-1021].eqiad.wmnet,dbstore1005.eqiad.wmnet,deploy1002.eqiad.wmnet,doc[1002-1003].eqiad.wmnet,doh1001.wikimedia.org,dse-k8s-etcd1003.eqiad.wmnet,dse-k8s-worker1003.eqiad.wmnet,dumpsdata[1003,1005].eqiad.wmnet,elastic[1057-1059,1080-1083,1087-1088].eqiad.wmnet,es[1022,1031-1032].eqiad.wmnet,etherpad1003.eqiad.wmnet,ganeti[1009-1012,1024,1027-1028].eqiad.wmnet,gitlab-runner1003.eqiad.wmnet,idp-test1002.wikimedia.org,install1004.wikimedia.org,kafka-jumbo[1004-1005,1007].eqiad.wmnet,kafka-logging1002.eqiad.wmnet,kafka-main1003.eqiad.wmnet,kubemaster1002.eqiad.wmnet,kubernetes[1006,1011-1012,1020,1023].eqiad.wmnet,kubestagetcd1006.eqiad.wmnet,kubetcd1004.eqiad.wmnet,labstore[1004-1005].eqiad.wmnet,logstash[1025,1028,1034].eqiad.wmnet,lvs[1015,1019].eqiad.wmnet,maps1009.eqiad.wmnet,matomo1002.eqiad.wmnet,mc[1045-1050].eqiad.wmnet,mc-gp1002.eqiad.wmnet,miscweb1002.eqiad.wmnet,ml-cache1002.eqiad.wmnet,ml-etcd1002.eqiad.wmnet,ml-serve1003.eqiad.wmnet,moss-be1002.eqiad.wmnet,ms-backup1002.eqiad.wmnet,ms-be[1042,1049-1050,1054,1062,1066].eqiad.wmnet,ms-fe1011.eqiad.wmnet,mw[1405-1413,1434-1436,1482-1486].eqiad.wmnet,mwdebug1001.eqiad.wmnet,mwlog1002.eqiad.wmnet,mx1001.wikimedia.org,ncredir1001.eqiad.wmnet,ores[1005-1006].eqiad.wmnet,orespoolcounter1004.eqiad.wmnet,parse[1013-1016].eqiad.wmnet,pc1013.eqiad.wmnet,poolcounter1005.eqiad.wmnet,puppetdb1002.eqiad.wmnet,puppetmaster1005.eqiad.wmnet,registry1004.eqiad.wmnet,rpki1001.eqiad.wmnet,seaborgium.wikimedia.org,sessionstore1002.eqiad.wmnet,snapshot1014.eqiad.wmnet,thanos-be1003.eqiad.wmnet,thanos-fe1003.eqiad.wmnet,thumbor1006.eqiad.wmnet,urldownloader[1002-1003].wikimedia.org,wcqs1003.eqiad.wmnet,wdqs[1010,1013-1014].eqiad.wmnet,webperf1003.eqiad.wmnet

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row C switches upgrade - T331882 completed.

Mentioned in SAL (#wikimedia-operations) [2023-04-04T13:05:42Z] <akosiaris@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in eqiad: eqiad row C switches upgrade - T331882

Mentioned in SAL (#wikimedia-operations) [2023-04-04T13:11:06Z] <XioNoX> asw2-c-eqiad> request system reboot all-members - T331882

ayounsi claimed this task.

Closing the task as the upgrade is done.

It went extremely smoothly, thank you everybody! See you in 2 weeks for eqiad row D.

jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row C switches upgrade - T331882 started.

Mentioned in SAL (#wikimedia-operations) [2023-04-04T14:43:25Z] <jiji@cumin1001> START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: eqiad row C switches upgrade - T331882

jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row C switches upgrade - T331882 failed.

Mentioned in SAL (#wikimedia-operations) [2023-04-04T15:19:38Z] <jiji@cumin1001> END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: eqiad row C switches upgrade - T331882

Mentioned in SAL (#wikimedia-operations) [2023-04-04T20:23:47Z] <inflatador> bking@cumin1001 unban elastic nodes post switch maintenance T331882

Mentioned in SAL (#wikimedia-operations) [2023-04-04T23:25:13Z] <tstarling@deploy2002> Synchronized src/Profiler.php: re-enable excimer T331882 (duration: 06m 25s)