eqiad row D switches upgrade
Closed, ResolvedPublic
Actions

Description

eqiad row D switches upgrade

For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade eqiad row D switches during the scheduled DC switchover.

Scheduled on April 18th - 13:00-15:00 UTC , please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well (well, 15min in real-reality). Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.

The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:

please write NONE if no action is needed,
the cookbook/command to run if it can be done by a 3rd party
who will be around to take care of the depool
Link to the relevant doc
etc

The two main types of actions needed are depool and monitoring downtime

NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark None in the table) so there are less moving parts closer to the maintenance window.

All servers will be downtimed with sudo cookbook sre.hosts.downtime --hours 2 -r "eqiad row D upgrade" -t XXX 'P{P:netbox::host%location ~ "D.*eqiad"}' but specific services might need specific downtimes.

Observability

SRE Observability

Servers	Depool action needed	Repool action needed	Status
kafka-logging1003
logstash[1012,1029-1031,1035]	drain shards 1012,1029,1035 depool 1030,1031 & set downtime	allocate shards, repool	shards allocating, pooled
xhgui1001

Core Platform

Platform Engineering

Servers	Depool action needed	Repool action needed	Status
dumpsdata1002	None	None
maps1010	None	None
restbase[1018,1025-1027,1030,1033]	`depool`	`pool`	Done
sessionstore1003	None	None
snapshot[1009,1015]	None	None

Infrastructure Foundations

Infrastructure-Foundations

Servers	Depool action needed	Repool action needed	Status
bast1003	sent announcement	NONE
cuminunpriv1001	NONE	NONE
ganeti[1019-1022,1033-1034]	NONE	NONE
idm1001	NONE	NONE
idm-test1001	NONE	NONE
ldap-replica1004	depooled	repooled	OK
ping1003	Remove ping redirect config on CR routers in eqiad	Re-run homer to add deleted firewall term back	repooled
pki-root1001	NONE	NONE
puppetboard1002	NONE	NONE
puppetmaster1002	sudo cumin '*' 'disable-puppet "Switch reboot: T333377"'	sudo cumin '*' 'enable-puppet "Switch reboot: T333377"'
sretest1001	NONE	NONE
urldownloader1004	NONE	NONE

Unowned

Servers	Depool action needed	Repool action needed	Status
irc1001	failed over to 2001	not needed, can remain on irc2001	OK
irc1002	NONE	NONE	OK

Search Platform

Discovery-Search

Servers	Depool action needed	Repool action needed
apifeatureusage1001	NONE	NONE
cloudelastic1004	NONE	NONE
elastic[1060-1067]	NONE	NONE
search-loader1001	NONE	NONE
wdqs[1005,1008]	NONE	NONE

ServiceOps-Collab

collaboration-services

Servers	Depool action needed	Repool action needed	Status
aphlict1001	NONE	NONE
gitlab-runner1004	Will be paused in admin interface	Will be unpaused in admin interface	paused
miscweb1003	NONE	NONE
releases1002

Machine Learning

Machine-Learning-Team

Servers	Depool action needed	Repool action needed	Status
ml-etcd1003	none	none	none
ml-serve1004	none	none	none
ml-serve-ctrl1002	none	none	none
ores[1007-1009]	sudo -i depool	sudo -i pool	repooled

Traffic

Servers	Depool action needed	Status
cp[1087-1090]	eqiad will be depooled, NOOP	done
dns1002	disable puppet and stop bird	done
doh1002	disable puppet and stop bird	done
durum1001	disable puppet and stop bird	done
lvs[1016,1020]	eqiad will be depooled, NOOP	done

Data Engineering

Data-Engineering

Servers	Depool action needed	Repool action needed	Status
an-airflow[1003-1004]	Announce downtime for these machines - Remind users to pause/unpause DAGs	None
an-conf1003	None	None
an-druid1005	None	None
an-presto[1001,1003]	None	None
an-test-coord1001	None	None
an-test-druid1001	None	None
an-test-presto1001	None	None
an-test-worker1003	None	None
an-worker[1092-1095,1101,1112-1116,1134-1138]	Stop gobblin ingestion with puppet (1 hour ahead), Stop YARN queues with puppet (30 minutes ahead), Put HDFS into safe mode (5 minutes ahead)	Reverse these three steps	Complete
analytics[1067-1068,1076-1077]	Stop gobblin ingestion with puppet (1 hour ahead), Stop YARN queues with puppet (30 minutes ahead), Put HDFS into safe mode (5 minutes ahead)	Reverse these three steps	Complete
aqs[1014-1015,1019]	`depool`	`pool`	Complete
dbstore1007	None	None
druid[1006,1008]	None	None
eventlog1003	None	None
flerovium	None	None
kafka-jumbo[1006,1008-1009]	None	None
schema1004	`depool`	`pool`	Complete
stat[1005-1006]	Announce downtime for stat100[5-6]	None

Data Engineering and Machine Learning

Data-Engineering Machine-Learning-Team

Servers	Depool action needed	Repool action needed	Status
dse-k8s-worker1004	none	none	none

Data Persistence

Data-Persistence

Servers	Depool action needed	Repool action needed	Status
backup[1001,1007]	1) make sure mediabackups on eqiad are stopped 2) ongoing bacula backups will fail - should be minimal disrruption	retry failed backups for faster recovery/check and restart media backups on eqiad
backupmon1001	just a monitoring host, downtime would be enough	Making sure checks work as usual
db[1102,1106,1114,1122-1123,1125,1136-1138,1140,1148-1149,1153,1172-1175,1182,1184,1221-1225]	None, eqiad will be depooled
dborch1001	Nothing to be done
dbprov1004	Make sure no ongoing backup	Retry failed, if any
dbproxy[1016-1017]	Failover m3-master and m5-master	Reload proxies	Both failed over already by @Marostegui
es[1023,1033-1034]	None, eqiad will be depooled
moss-fe1002	n/a	n/a	Not in production
ms-be[1043,1048,1055-1056,1059,1063,1067]	None	None
pc1014	None, eqiad will be depooled
thanos-be1004	None	None
ms-fe1013

WMCS

cloud-services-team

Servers	Depool action needed	Repool action needed	Status
cloudcontrol1007
cloudcumin1001
clouddb[1019-1020]
cloudrabbit1003
cloudweb1004

ServiceOps

serviceops

Servers	Depool action needed	Repool action needed
chartmuseum1001
conf1009
kafka-main[1004-1005]
kubernetes[1013-1014,1016,1021,1024]
kubestage1004
mc[1051-1054]
mc-gp1003
mc-wf1002
mw[1349-1384,1437-1447,1487-1488]
parse[1018-1024]
rdb[1010,1012]
scandium
testreduce1001	None	None

Details

Subject	Repo	Branch	Lines +/-
Disable the gobblin timers temporarily on the prod cluster	operations/puppet	production	+1 -1
hiera: temporarily removed dns1002 from authdns_servers	operations/puppet	production	+0 -1
Stop the YARN queues temporarily to facilitate switch maintenance	operations/puppet	production	+4 -4
depool eqiad	operations/dns	master	+2 -0
Failover irc.wikimedia.org to irc2001	operations/dns	master	+1 -1
wmnet: Failover m5-master	operations/dns	master	+1 -1
wmnet: Failover m3-master	operations/dns	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T253824 planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1)
Resolved	ayounsi	T254013 all network devices must run OpenSSH >= 7.2p1 but != 7.4p1
Resolved	ayounsi	T317175 Junos: resolve DNS through mgmt_junos
Resolved	ayounsi	T327862 Use mgmt_junos on all network devices
		Restricted Task
Open	None	T316539 Upgrade network devices to Junos 20+
Resolved	ayounsi	T327248 eqiad/codfw virtual-chassis upgrades
Resolved	cmooney	T291627 Packet Drops on Eqiad ASW -> CR uplinks
Resolved	Jclark-ctr	T313463 eqiad: upgrade row C and D uplinks from 4x10G to 1x40G
Resolved	ayounsi	T320566 Cr1-eqiad comms problem when moving to 40G row D handoff
Resolved	Clement_Goubert	T327920 March 2023 Datacenter Switchover
Resolved	cmooney	T333377 eqiad row D switches upgrade

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Jelto updated the task description. (Show Details)Apr 11 2023, 4:07 PM

@ayounsi we are placing new DB hosts in production, can you run the same query you ran to gather the affected DBs just in case we have new ones being affected?

ayounsi added a parent task: T327920: March 2023 Datacenter Switchover.Apr 12 2023, 12:30 PM

In T333377#8775126, @Marostegui wrote:

@ayounsi we are placing new DB hosts in production, can you run the same query you ran to gather the affected DBs just in case we have new ones being affected?

Updated, the diff is the addition of db1221-1225 as well as ms-fe1013

Thank you, nothing changes from our DB side!

ayounsi assigned this task to cmooney.Apr 14 2023, 7:03 AM

Vgutierrez updated the task description. (Show Details)Apr 14 2023, 11:22 AM

colewhite updated the task description. (Show Details)Apr 14 2023, 9:08 PM

BTullis updated the task description. (Show Details)Apr 17 2023, 10:45 AM

ssingh updated the task description. (Show Details)Apr 17 2023, 8:41 PM

Mentioned in SAL (#wikimedia-operations) [2023-04-17T21:17:57Z] <inflatador> bking@cumin1001 ban cloudelastic1004 for upcoming switch maintenance T333377

bking updated the task description. (Show Details)Apr 17 2023, 9:59 PM

Mentioned in SAL (#wikimedia-operations) [2023-04-17T21:59:38Z] <ryankemper@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 13 hosts with reason: T333377 maint

Mentioned in SAL (#wikimedia-operations) [2023-04-17T21:59:59Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 13 hosts with reason: T333377 maint

Marostegui updated the task description. (Show Details)Apr 18 2023, 7:02 AM

Marostegui updated the task description. (Show Details)Apr 18 2023, 7:10 AM

Change 909608 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Disable the gobblin timers temporarily on the prod cluster

https://gerrit.wikimedia.org/r/909608

gerritbot added a project: Patch-For-Review.Apr 18 2023, 8:51 AM

Change 909616 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Failover irc.wikimedia.org to irc2001

https://gerrit.wikimedia.org/r/909616

Change 909616 merged by Muehlenhoff:

[operations/dns@master] Failover irc.wikimedia.org to irc2001

https://gerrit.wikimedia.org/r/909616

Change 909621 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Stop the YARN queues temporarily to facilitate switch maintenance

https://gerrit.wikimedia.org/r/909621

MoritzMuehlenhoff updated the task description. (Show Details)Apr 18 2023, 9:51 AM

MoritzMuehlenhoff updated the task description. (Show Details)

ArielGlenn updated the task description. (Show Details)Apr 18 2023, 9:56 AM

eoghan updated the task description. (Show Details)Apr 18 2023, 10:17 AM

MoritzMuehlenhoff updated the task description. (Show Details)Apr 18 2023, 10:23 AM

MoritzMuehlenhoff updated the task description. (Show Details)Apr 18 2023, 10:26 AM

MoritzMuehlenhoff updated the task description. (Show Details)Apr 18 2023, 10:30 AM

jbond updated the task description. (Show Details)Apr 18 2023, 10:38 AM

Change 909608 merged by Btullis:

[operations/puppet@production] Disable the gobblin timers temporarily on the prod cluster

https://gerrit.wikimedia.org/r/909608

Mentioned in SAL (#wikimedia-analytics) [2023-04-18T11:34:27Z] <btullis> disable gobblin timers T333377

Mentioned in SAL (#wikimedia-analytics) [2023-04-18T11:36:25Z] <btullis> stopping YARN queues T333377

Change 909621 merged by Btullis:

[operations/puppet@production] Stop the YARN queues temporarily to facilitate switch maintenance

https://gerrit.wikimedia.org/r/909621

Change 909653 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: temporarily removed dns1002 from authdns_servers

https://gerrit.wikimedia.org/r/909653

Mentioned in SAL (#wikimedia-analytics) [2023-04-18T11:45:20Z] <btullis> depooled schema1004 T333377

Mentioned in SAL (#wikimedia-operations) [2023-04-18T11:48:46Z] <effie> depooling eqiad due to eqiad row D switches upgrade - T333377

jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade - T333377 started.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T11:50:55Z] <jiji@cumin1001> START - Cookbook sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade - T333377

BTullis updated the task description. (Show Details)Apr 18 2023, 11:54 AM

BTullis updated the task description. (Show Details)Apr 18 2023, 11:57 AM

hnowlan updated the task description. (Show Details)Apr 18 2023, 11:58 AM

Change 909653 merged by Ssingh:

[operations/puppet@production] hiera: temporarily removed dns1002 from authdns_servers

https://gerrit.wikimedia.org/r/909653

Maintenance_bot removed a project: Patch-For-Review.Apr 18 2023, 12:20 PM

ssingh updated the task description. (Show Details)Apr 18 2023, 12:21 PM

jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade - T333377 failed.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T12:26:32Z] <jiji@cumin1001> END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) depool all active/active services in eqiad: eqiad row D switches upgrade - T333377

jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade - T333377 started.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T12:27:14Z] <jiji@cumin1001> START - Cookbook sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade - T333377

jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row D switches upgrade - T333377 completed.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T12:27:19Z] <jiji@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in eqiad: eqiad row D switches upgrade - T333377

Change 909662 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] depool eqiad

https://gerrit.wikimedia.org/r/909662

gerritbot added a project: Patch-For-Review.Apr 18 2023, 12:33 PM

Change 909662 merged by Ssingh:

[operations/dns@master] depool eqiad

https://gerrit.wikimedia.org/r/909662

Mentioned in SAL (#wikimedia-operations) [2023-04-18T13:06:50Z] <topranks> disabling ping offload on cr1-eqiad and cr2-eqiad in advance of row D switch upgrade T333377

Icinga downtime and Alertmanager silence (ID=7fc7ae6f-d3b2-43ed-b030-194ed6367c80) set by cmooney@cumin1001 for 2:00:00 on 270 host(s) and their services with reason: eqiad row D upgrade

an-airflow[1003-1004].eqiad.wmnet,an-conf1003.eqiad.wmnet,an-druid1005.eqiad.wmnet,an-presto[1001,1003].eqiad.wmnet,an-test-coord1001.eqiad.wmnet,an-test-druid1001.eqiad.wmnet,an-test-presto1001.eqiad.wmnet,an-test-worker1003.eqiad.wmnet,an-worker[1092-1095,1101,1112-1116,1134-1138].eqiad.wmnet,analytics[1067-1069,1076-1077].eqiad.wmnet,aphlict[1001-1002].eqiad.wmnet,apifeatureusage1001.eqiad.wmnet,aqs[1014-1015,1019].eqiad.wmnet,backup[1001,1007].eqiad.wmnet,backupmon1001.eqiad.wmnet,bast1003.wikimedia.org,chartmuseum1001.eqiad.wmnet,cloudbackup1004.eqiad.wmnet,cloudcephmon1002.eqiad.wmnet,cloudcephosd[1011-1015,1019-1020,1023-1024].eqiad.wmnet,cloudcontrol1007.wikimedia.org,cloudcumin1001.eqiad.wmnet,clouddb[1019-1020].eqiad.wmnet,cloudelastic1004.wikimedia.org,cloudgw1002.eqiad.wmnet,cloudnet1006.eqiad.wmnet,cloudrabbit1003.wikimedia.org,cloudvirt[1028-1030,1036-1047].eqiad.wmnet,cloudvirtlocal1001.eqiad.wmnet,cloudweb1004.wikimedia.org,conf1009.eqiad.wmnet,cp[1087-1090].eqiad.wmnet,cuminunpriv1001.eqiad.wmnet,db[1106,1114,1122-1123,1125,1136-1138,1140,1148-1149,1153,1172-1175,1182,1184,1221-1225].eqiad.wmnet,dborch1001.wikimedia.org,dbprov1004.eqiad.wmnet,dbproxy[1016-1017].eqiad.wmnet,dbstore1007.eqiad.wmnet,dns1002.wikimedia.org,doh1002.wikimedia.org,druid[1006,1008].eqiad.wmnet,dse-k8s-worker1004.eqiad.wmnet,dumpsdata1002.eqiad.wmnet,durum1001.eqiad.wmnet,elastic[1060-1067].eqiad.wmnet,es[1023,1033-1034].eqiad.wmnet,eventlog1003.eqiad.wmnet,flerovium.eqiad.wmnet,ganeti[1019-1022,1033-1034].eqiad.wmnet,gitlab-runner1004.eqiad.wmnet,idm1001.wikimedia.org,idm-test1001.wikimedia.org,irc[1001-1002].wikimedia.org,kafka-jumbo[1006,1008-1009].eqiad.wmnet,kafka-logging1003.eqiad.wmnet,kafka-main[1004-1005].eqiad.wmnet,kubernetes[1013-1014,1016,1021,1024].eqiad.wmnet,kubestage1004.eqiad.wmnet,ldap-replica1004.wikimedia.org,logstash[1012,1029-1031,1035].eqiad.wmnet,lvs[1016,1020].eqiad.wmnet,maps1010.eqiad.wmnet,mc[1051-1054].eqiad.wmnet,mc-gp1003.eqiad.wmnet,mc-wf1002.eqiad.wmnet,miscweb1003.eqiad.wmnet,ml-etcd1003.eqiad.wmnet,ml-serve1004.eqiad.wmnet,ml-serve-ctrl1002.eqiad.wmnet,moss-fe1002.eqiad.wmnet,ms-be[1043,1048,1055-1056,1059,1063,1067].eqiad.wmnet,ms-fe1013.eqiad.wmnet,mw[1349-1384,1437-1447,1487-1488].eqiad.wmnet,ores[1007-1009].eqiad.wmnet,parse[1018-1024].eqiad.wmnet,pc1014.eqiad.wmnet,ping1003.eqiad.wmnet,pki-root1001.eqiad.wmnet,puppetboard1002.eqiad.wmnet,puppetmaster1002.eqiad.wmnet,rdb[1010,1012].eqiad.wmnet,releases1002.eqiad.wmnet,restbase[1018,1025-1027,1030,1033].eqiad.wmnet,scandium.eqiad.wmnet,schema1004.eqiad.wmnet,search-loader1001.eqiad.wmnet,sessionstore1003.eqiad.wmnet,snapshot[1009,1015].eqiad.wmnet,sretest1001.eqiad.wmnet,stat[1005-1006].eqiad.wmnet,testreduce1001.eqiad.wmnet,thanos-be1004.eqiad.wmnet,urldownloader1004.wikimedia.org,wdqs[1005,1008].eqiad.wmnet,xhgui1001.eqiad.wmnet

cmooney updated the task description. (Show Details)Apr 18 2023, 1:11 PM

Mentioned in SAL (#wikimedia-operations) [2023-04-18T13:12:11Z] <jbond> disable puppet fleet wide T333377

Icinga downtime and Alertmanager silence (ID=e714b564-285e-4f22-b860-267d7c23208d) set by cmooney@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: eqiad row D upgrade

asw2-d-eqiad

klausman updated the task description. (Show Details)Apr 18 2023, 1:21 PM

Mentioned in SAL (#wikimedia-operations) [2023-04-18T13:25:08Z] <topranks> Rebooting asw2-d-eqiad virtual-chassis (all row D top-of-rack switches) to upgrade JunOS. Row D going down T333377

Maintenance_bot removed a project: Patch-For-Review.Apr 18 2023, 1:26 PM

dbproxy[1016-1017] reloaded

klausman updated the task description. (Show Details)Apr 18 2023, 1:50 PM

MoritzMuehlenhoff updated the task description. (Show Details)Apr 18 2023, 1:52 PM

cmooney updated the task description. (Show Details)Apr 18 2023, 2:01 PM

Mentioned in SAL (#wikimedia-operations) [2023-04-18T14:04:33Z] <sukhe> running authdns-update to repool eqiad after switch maint: T333377

ayounsi added a parent task: T320566: Cr1-eqiad comms problem when moving to 40G row D handoff.Apr 18 2023, 2:53 PM

Mentioned in SAL (#wikimedia-operations) [2023-04-18T15:07:04Z] <claime> repooling all eqiad active active services post T333377