eqiad row B switches upgrade
Closed, ResolvedPublic
Actions

Description

eqiad row B switches upgrade

For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade eqiad row B switches during the scheduled DC switchover.

This has been re-scheduled to March 28th - 14:00-16:00 UTC (one week later than originally planned to not conflict with Sprint week), please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well (well, 15min in real-reality). Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.

The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:

please write NONE if no action is needed,
the cookbook/command to run if it can be done by a 3rd party
who will be around to take care of the depool
Link to the relevant doc
etc

The two main types of actions needed are depool and monitoring downtime

NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark None in the table) so there are less moving parts closer to the maintenance window.

All servers will be downtimed with sudo cookbook sre.hosts.downtime --hours 2 -r "eqiad row B upgrade" -t T330165 'P{P:netbox::host%location ~ "B.*eqiad"}' but specific services might need specific downtimes.

Traffic

Servers	Depool action needed	Status
dns1003 (formerly authdns1001)	disable puppet and stop bird	done
cp[1079-1082]	depool	eqiad will be depooled, no action
durum1002	disable puppet and stop bird	done
lvs1014	N/A
lvs1018	disable puppet & stop pybal	eqiad will be depooled, no action

ServiceOps-Collab

collaboration-services

Servers	Depool action needed	Repool action needed	Status
contint1002	`NONE`	`NONE`
gitlab1004	`NONE`	`NONE`	`NONE`
gitlab-runner1002	pause in admin interface	unpause in admin interface	`unpaused/pooled` again
otrs1001	`NONE`	`NONE`	downtime will be announced
phab1004	`NONE`	`NONE`	downtime announced in wikitech-l

Infrastructure Foundations

Infrastructure-Foundations

Servers	Depool action needed	Repool action needed	Status
debmonitor1002	none	none
failoid1002	none	none
ganeti[1013-1018]	none	none
idp1002	failover idp.w.o		ready to proceed
ldap-replica1003	depool		depooled
mirror1001	none	none
netflow1002	none	none
puppetdb1003	none	none
puppetmaster[1001,1003]	disable puppet in fleet wide (as puppet.wikimedia.org goes to eqiad		puppet disabled

Infrastructure Foundations and Observability

Infrastructure-Foundations SRE Observability

Servers	Depool action needed	Repool action needed	Status
netmon1003	None	None

Observability

SRE Observability

Servers	Depool action needed	Repool action needed	Status
arclamp1001	none	none	downtime scheduled
centrallog1002	none	none	downtime scheduled
graphite1005	failover to codfw	fail back to eqiad	moved back to eqiad
kafka-logging1001			downtime scheduled
logstash[1011,1027,1032]	drain shards 1011,1027 depool 1032	allocate shards 1011,1027 repool 1032	shards allocating, hosts pooled
prometheus1006	`depool` and remove from AM	`pool` and put back in AM	repool completed

Observability and Data Persistence

SRE Observability Data-Persistence

Servers	Depool action needed	Repool action needed	Status
thanos-fe1002	`depool`	`pool`	completed

Core Platform

Platform Engineering

Servers	Depool action needed	Repool action needed
dumpsdata1001	none
maps[1007-1008]	`depool`	`pool`
restbase[1017,1022-1024,1029,1032]	`depool`	`pool`
snapshot[1008,1010,1013]	none
thumbor[1001-1002]	`depool`	`pool`

Search Platform

Discovery-Search
contact: @bking (inflatador on IRC)

Servers	Depool action needed	Repool action needed	Status
cloudelastic[1002,1006]			no action needed
elastic[1055-1056,1074-1079,1085-1086]			no action needed
relforge1004			no action needed
wcqs1002			no action needed
wdqs[1007,1009,1012]			no action needed

Data Persistence

Data-Persistence

Servers	Depool action needed	Repool action needed	Status
backup[1003,1005]	Jaime to make sure they are idle during downtime
db[1104,1112-1113,1118-1119,1124,1130,1132,1139,1143-1144,1152,1155,1162-1165,1178-1179,1183,1187-1188,1206]	db1183 needs to be switched over (T330847) , db1164 needs to be switched over T331510		db1183 and db1164 are no longer masters (T330847 T331510)
dbprov1002	Jaime to make sure they are idle during downtime
dbproxy[1014-1015]	@Marostegui will failover dbproxy1014 and dbproxy1015	Reload the proxies	Both proxies have been failed over and are not active
es[1021,1025,1029-1030]		Jaime: may require an es backup retry after downtime	Nothing on MW side as eqiad is depooled
ms-be[1041,1047,1052-1053,1058,1061,1065]			nothing needed
ms-fe1010	`depool`	`pool`	depooled
pc1012			Nothing to do, eqiad is depooled
thanos-be1002			nothing needed

Machine Learning

Machine-Learning-Team

Servers	Depool action needed	Repool action needed	Status
ml-etcd1001	none	none	none
ml-serve1002	none	none	none
ml-serve-ctrl1001	none	none	none
ores[1003-1004]	sudo -i depool	sudo -i pool

Data Engineering

Data-Engineering
Announce downtime for other teams pipelines + announce downtime for Hive, Presto + Superset limited functionality

Servers	Depool action needed	Repool action needed	Status
an-conf1001	NONE	NONE
an-coord1001	Failover hive to an-coord1002 **n.b. We will lose MariaDB, therefore superset, some Druid functionality, Hive, DataHub	Fail back Hive to an-coord1001	Failed over hive
an-druid1004	NONE	NONE
an-launcher1002	Disable gobblin ingestion at 12:50 UTC	Re-enable ingestion	patch for gobblin Gobblin jobs absented
an-master1002	Putting into safe mode + Disabling YARN	Taking out of safe mode + Enabling YARN	Scheduled for 13:30 UTC : patch for YARN YARN queues stopped, safe mode entered
an-presto1004	NONE	NONE
an-test-coord1002	NONE	NONE
an-test-ui1001	NONE	NONE
an-tool[1008-1009]	NONE + Announce downtime for Hue	NONE	Done
an-web1001	NONE + Announce downtime for wikistats and analytics.wikimedia.org	NONE	Done
an-worker[1083-1087,1097-1098,1117,1124-1128,1130]	Putting HDFS into safe mode	Taking HDFS out of safe mode	Safe mode entered
analytics[1061-1063,1072-1073]	Putting HDFS into safe mode	Taking HDFS out of safe mode	Safe mode entered
aqs[1011,1017]	NONE	NONE
datahubsearch1002	NONE	NONE
druid[1005,1007]	NONE	NONE
kafka-jumbo1003	NONE	NONE
schema1003	NONE	NONE
stat[1007,1009]	Announce downtime for these two stats servers		Done

Data Engineering and Machine Learning

Data-Engineering Machine-Learning-Team

Servers	Depool action needed	Repool action needed
dse-k8s-ctrl1002	NONE	NONE
dse-k8s-etcd1002	NONE	NONE
dse-k8s-worker1002	NONE	NONE

WMCS

cloud-services-team

Servers	Depool action needed	Repool action needed	Status
dbproxy1019
cloudbackup[1001-1002]-dev
cloudcephmon1001
cloudcephosd1003
cloudcontrol1006
clouddb[1015-1016]
clouddumps1001
cloudrabbit1001
cloudservices1005
cloudvirt[1017,1019-1024]
cloudvirt-wdqs[1001-1003]
cloudweb1003

ServiceOps

serviceops

Servers	Depool action needed	Repool action needed	Status
conf1008
dragonfly-supernode1001
kafka-main1002
kubernetes[1009-1010,1015,1019,1022]
kubestage1003
kubestagemaster1001
kubestagetcd1005
kubetcd1006
mc[1041-1044]
mc-wf1001
mw[1393-1404,1423-1433,1466-1481]
mwmaint1002
parse[1007-1012,1017]
rdb1009

Details

Subject	Repo	Branch	Lines +/-
statsd: move writes to graphite2004	operations/puppet	production	+2 -1
Failover statsd to graphite2004	operations/mediawiki-config	master	+3 -3
wmnet: move writes to graphite2004	operations/dns	master	+3 -3
graphite: check graphite2004	operations/puppet	production	+3 -3
wmnet: move reads to graphite2004	operations/dns	master	+2 -2
clouddumps: make clouddumps1002 the primary during switch maintenance	operations/puppet	production	+1 -1
hiera: temporarily removed dns1003 from authdns_servers	operations/puppet	production	+0 -1
Disable the gobblin timers temporarily for switch maintenance	operations/puppet	production	+2 -2
Revert "Depool eqiad frontends for network maintenance"	operations/dns	master	+0 -2
prometheus1006: depool from alertmanager	operations/puppet	production	+1 -0
Depool eqiad frontends for network maintenance	operations/dns	master	+2 -0
Disable job submission to YARN queues to faciliatate maintenance	operations/puppet	production	+4 -4
Failover hive services to the standby coordinator	operations/dns	master	+1 -1
wmnet: Failover m2-master	operations/dns	master	+1 -1
wmnet: Failover m1-master	operations/dns	master	+1 -1

Related Objects
Search...

Status	Assigned	Task
Open	None	T253824 planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1)
Resolved	ayounsi	T254013 all network devices must run OpenSSH >= 7.2p1 but != 7.4p1
Resolved	ayounsi	T317175 Junos: resolve DNS through mgmt_junos
Resolved	ayounsi	T327862 Use mgmt_junos on all network devices
		Restricted Task
Open	None	T316539 Upgrade network devices to Junos 20+
Resolved	ayounsi	T327248 eqiad/codfw virtual-chassis upgrades
Resolved	Clement_Goubert	T327920 March 2023 Datacenter Switchover
Resolved	ayounsi	T330165 eqiad row B switches upgrade
Invalid	• Marostegui	T330977 Move db1183 to m1
Resolved	• Marostegui	T331510 Switchover m1 master (db1164 -> db1101)
Resolved	• Marostegui	T331511 Move db1101 to m1
Resolved	• Marostegui	T330847 Switchover m5 master (db1183 -> db1176)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

MatthewVernon updated the task description. (Show Details)Mar 17 2023, 12:07 PM

Change 901322 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Temporarily disable xenon/excimer for switch maintenance

https://gerrit.wikimedia.org/r/901322

gerritbot added a project: Patch-For-Review.Mar 20 2023, 11:57 PM

cmooney updated the task description. (Show Details)Mar 21 2023, 9:46 AM

cmooney mentioned this in T327919: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it.Mar 21 2023, 9:50 AM

colewhite updated the task description. (Show Details)Mar 23 2023, 3:18 PM

• Marostegui updated the task description. (Show Details)Mar 27 2023, 8:07 AM

• Marostegui mentioned this in T333123: Switchover m1 master (db1101 -> db1164).Mar 27 2023, 8:13 AM

Change 900238 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus1006: depool from alertmanager

https://gerrit.wikimedia.org/r/900238

• Marostegui closed subtask T331510: Switchover m1 master (db1164 -> db1101) as Resolved.Mar 27 2023, 8:20 AM

Change 903185 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: move reads to graphite2004

https://gerrit.wikimedia.org/r/903185

Change 903206 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: check graphite2004

https://gerrit.wikimedia.org/r/903206

Change 903207 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsd: move writes to graphite2004

https://gerrit.wikimedia.org/r/903207

Change 903208 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: move writes to graphite2004

https://gerrit.wikimedia.org/r/903208

Change 903209 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/mediawiki-config@master] Failover statsd to graphite2004

https://gerrit.wikimedia.org/r/903209

Change 900238 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus1006: depool from alertmanager

https://gerrit.wikimedia.org/r/900238

fgiunchedi updated the task description. (Show Details)Mar 27 2023, 8:56 AM

hnowlan updated the task description. (Show Details)Mar 27 2023, 10:01 AM

ArielGlenn updated the task description. (Show Details)Mar 27 2023, 10:01 AM

hnowlan updated the task description. (Show Details)Mar 27 2023, 10:02 AM

Jelto updated the task description. (Show Details)Mar 27 2023, 11:33 AM

Change 903246 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: temporarily removed dns1003 from authdns_servers

https://gerrit.wikimedia.org/r/903246

Change 903249 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] clouddumps: make clouddumps1002 the primary during switch maintenance

https://gerrit.wikimedia.org/r/903249

kamila subscribed.Mar 27 2023, 1:52 PM

Change 903249 merged by Andrew Bogott:

[operations/puppet@production] clouddumps: make clouddumps1002 the primary during switch maintenance

https://gerrit.wikimedia.org/r/903249

brennen subscribed.Mar 27 2023, 5:29 PM

Mentioned in SAL (#wikimedia-operations) [2023-03-27T21:45:34Z] <ryankemper> T330165 Depooled relevant search platform hosts: sudo -E cumin 'elastic[1055-1056,1074-1079,1085-1086]*,cloudelastic100[2,6]*,wcqs1002*,wdqs[1007,1012]*' 'sudo depool'

In T330165#8731601, @Stashbot wrote:

Mentioned in SAL (#wikimedia-operations) [2023-03-27T21:45:34Z] <ryankemper> T330165 Depooled relevant search platform hosts: sudo -E cumin 'elastic[1055-1056,1074-1079,1085-1086]*,cloudelastic100[2,6]*,wcqs1002*,wdqs[1007,1012]*' 'sudo depool'

Isn't this missing wdqs1009?
FYI you can also use a query like:

'P{elastic1*,cloudelastic1*,wcqs1*,wdqs1*} and P{P:netbox::host%location ~ "B.*eqiad"}'

colewhite updated the task description. (Show Details)Mar 27 2023, 11:17 PM

Change 903185 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: move reads to graphite2004

https://gerrit.wikimedia.org/r/903185

Change 903206 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: check graphite2004

https://gerrit.wikimedia.org/r/903206

• Marostegui updated the task description. (Show Details)Mar 28 2023, 7:47 AM

Mentioned in SAL (#wikimedia-operations) [2023-03-28T08:00:06Z] <godog> move graphite reads to codfw - T330165

Change 903207 merged by Filippo Giunchedi:

[operations/puppet@production] statsd: move writes to graphite2004

https://gerrit.wikimedia.org/r/903207

Change 903208 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: move writes to graphite2004

https://gerrit.wikimedia.org/r/903208

Change 903209 merged by jenkins-bot:

[operations/mediawiki-config@master] Failover statsd to graphite2004

https://gerrit.wikimedia.org/r/903209

Mentioned in SAL (#wikimedia-operations) [2023-03-28T08:02:36Z] <oblivian@deploy2002> Started scap: Backport for [[gerrit:903209|Failover statsd to graphite2004 (T330165)]]

Mentioned in SAL (#wikimedia-operations) [2023-03-28T08:04:11Z] <oblivian@deploy2002> oblivian and filippo: Backport for [[gerrit:903209|Failover statsd to graphite2004 (T330165)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-03-28T08:11:25Z] <oblivian@deploy2002> Finished scap: Backport for [[gerrit:903209|Failover statsd to graphite2004 (T330165)]] (duration: 08m 48s)

Change 903610 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Disable the gobblin timers temporarily for switch maintenance

https://gerrit.wikimedia.org/r/903610

BTullis updated the task description. (Show Details)Mar 28 2023, 9:51 AM

Change 903621 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Failover hive services to the standby coordinator

https://gerrit.wikimedia.org/r/903621

Change 903621 merged by Btullis:

[operations/dns@master] Failover hive services to the standby coordinator

https://gerrit.wikimedia.org/r/903621

Change 903627 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Disable job submission to YARN queues to faciliatate maintenance

https://gerrit.wikimedia.org/r/903627

BTullis updated the task description. (Show Details)Mar 28 2023, 11:27 AM

I "depooled" dbproxy1019 by following the procedure at https://wikitech.wikimedia.org/w/index.php?title=Portal:Data_Services/Admin/Runbooks/Depool_wikireplicas#Hardware_proxies

I modified the Prefix Puppet and HAproxy will route all traffic to 208.80.154.242 which is mapped to dbproxy1018 and is not affected by the switch upgrade.

Please note that LVS will likely trigger a few alerts when dbproxy1019 goes down... I added a note to the wiki page above asking if maybe we should modify the procedure.

ayounsi updated the task description. (Show Details)Mar 28 2023, 12:28 PM

ssingh updated the task description. (Show Details)Mar 28 2023, 12:50 PM

Change 903610 merged by Btullis:

[operations/puppet@production] Disable the gobblin timers temporarily for switch maintenance

https://gerrit.wikimedia.org/r/903610

Change 903642 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] Depool eqiad frontends for network maintenance

https://gerrit.wikimedia.org/r/903642

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switches upgrade - T330165 started.

Mentioned in SAL (#wikimedia-operations) [2023-03-28T12:58:05Z] <akosiaris@cumin1001> START - Cookbook sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switches upgrade - T330165

Change 903642 merged by Ayounsi:

[operations/dns@master] Depool eqiad frontends for network maintenance

https://gerrit.wikimedia.org/r/903642

Mentioned in SAL (#wikimedia-operations) [2023-03-28T12:59:49Z] <XioNoX> depool eqiad for network maintenance - T330165

BTullis updated the task description. (Show Details)Mar 28 2023, 1:01 PM

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switches upgrade - T330165 completed.

Mentioned in SAL (#wikimedia-operations) [2023-03-28T13:17:37Z] <akosiaris@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in eqiad: eqiad row B switches upgrade - T330165

Change 903246 merged by Ssingh:

[operations/puppet@production] hiera: temporarily removed dns1003 from authdns_servers

https://gerrit.wikimedia.org/r/903246

ssingh updated the task description. (Show Details)Mar 28 2023, 1:22 PM

Change 903627 merged by Btullis:

[operations/puppet@production] Disable job submission to YARN queues to faciliatate maintenance

https://gerrit.wikimedia.org/r/903627

Mentioned in SAL (#wikimedia-analytics) [2023-03-28T13:31:15Z] <btullis> setting all four YARN queues to STOPPED https://gerrit.wikimedia.org/r/c/operations/puppet/+/903627 T330165

BTullis updated the task description. (Show Details)Mar 28 2023, 1:35 PM

Mentioned in SAL (#wikimedia-analytics) [2023-03-28T13:37:03Z] <btullis> refreshed YARN queues with: sudo kerberos-run-command yarn /usr/bin/yarn rmadmin -refreshQueues on both an-master100[1-2] - T330165

Jelto updated the task description. (Show Details)Mar 28 2023, 1:40 PM

herron updated the task description. (Show Details)Mar 28 2023, 1:43 PM

Jelto updated the task description. (Show Details)Mar 28 2023, 1:45 PM

Icinga downtime and Alertmanager silence (ID=4c1e12e1-9d5e-4447-880a-f0ec09133a64) set by ayounsi@cumin1001 for 2:00:00 on 249 host(s) and their services with reason: eqiad row B upgrade

an-airflow1002.eqiad.wmnet,an-conf1001.eqiad.wmnet,an-coord1001.eqiad.wmnet,an-druid1004.eqiad.wmnet,an-launcher1002.eqiad.wmnet,an-master1002.eqiad.wmnet,an-presto1004.eqiad.wmnet,an-test-coord1002.eqiad.wmnet,an-test-ui1001.eqiad.wmnet,an-tool[1008-1009].eqiad.wmnet,an-web1001.eqiad.wmnet,an-worker[1083-1087,1097-1098,1117,1124-1128,1130].eqiad.wmnet,analytics[1061-1063,1072-1073].eqiad.wmnet,aqs[1011,1017].eqiad.wmnet,arclamp1001.eqiad.wmnet,backup[1003,1005].eqiad.wmnet,centrallog1002.eqiad.wmnet,cloudbackup[1001-1002]-dev.eqiad.wmnet,cloudcephmon1001.eqiad.wmnet,cloudcontrol1006.wikimedia.org,clouddb[1015-1016].eqiad.wmnet,clouddumps1001.wikimedia.org,cloudelastic[1002,1006].wikimedia.org,cloudrabbit1001.wikimedia.org,cloudservices1005.wikimedia.org,cloudvirt[1019-1020,1023-1024].eqiad.wmnet,cloudvirt-wdqs[1001-1003].eqiad.wmnet,cloudweb1003.wikimedia.org,conf1008.eqiad.wmnet,contint1002.wikimedia.org,cp[1079-1082].eqiad.wmnet,datahubsearch1002.eqiad.wmnet,db[1104,1112-1113,1118-1119,1124,1130,1132,1139,1143-1144,1152,1155,1162-1165,1178-1179,1183,1187-1188,1206].eqiad.wmnet,dbprov1002.eqiad.wmnet,dbproxy[1014-1015,1019].eqiad.wmnet,debmonitor1002.eqiad.wmnet,dns1003.wikimedia.org,dragonfly-supernode1001.eqiad.wmnet,druid[1005,1007].eqiad.wmnet,dse-k8s-ctrl1002.eqiad.wmnet,dse-k8s-etcd1002.eqiad.wmnet,dse-k8s-worker1002.eqiad.wmnet,dumpsdata1001.eqiad.wmnet,durum1002.eqiad.wmnet,elastic[1055-1056,1074-1079,1085-1086].eqiad.wmnet,es[1021,1025,1029-1030].eqiad.wmnet,failoid1002.eqiad.wmnet,ganeti[1013-1018].eqiad.wmnet,gerrit1001.wikimedia.org,gitlab1004.wikimedia.org,gitlab-runner1002.eqiad.wmnet,graphite1005.eqiad.wmnet,idp1002.wikimedia.org,kafka-jumbo1003.eqiad.wmnet,kafka-logging1001.eqiad.wmnet,kafka-main1002.eqiad.wmnet,kafka-test[1006-1010].eqiad.wmnet,kubernetes[1009-1010,1015,1019,1022].eqiad.wmnet,kubestage1003.eqiad.wmnet,kubestagemaster1001.eqiad.wmnet,kubestagetcd1005.eqiad.wmnet,kubetcd1006.eqiad.wmnet,ldap-replica1003.wikimedia.org,logstash[1011,1027,1032].eqiad.wmnet,lvs[1014,1018].eqiad.wmnet,maps[1007-1008].eqiad.wmnet,mc[1041-1044].eqiad.wmnet,mc-wf1001.eqiad.wmnet,mirror1001.wikimedia.org,ml-etcd1001.eqiad.wmnet,ml-serve1002.eqiad.wmnet,ml-serve-ctrl1001.eqiad.wmnet,ms-be[1041,1047,1052-1053,1058,1061,1065].eqiad.wmnet,ms-fe1010.eqiad.wmnet,mw[1393-1404,1423-1433,1466-1481].eqiad.wmnet,mwmaint1002.eqiad.wmnet,netflow1002.eqiad.wmnet,netmon1003.wikimedia.org,ores[1003-1004].eqiad.wmnet,otrs1001.eqiad.wmnet,parse[1007-1012,1017].eqiad.wmnet,pc1012.eqiad.wmnet,phab1004.eqiad.wmnet,prometheus1006.eqiad.wmnet,puppetdb1003.eqiad.wmnet,puppetmaster[1001,1003].eqiad.wmnet,rdb1009.eqiad.wmnet,relforge1004.eqiad.wmnet,restbase[1017,1022-1024,1029,1032].eqiad.wmnet,schema1003.eqiad.wmnet,snapshot[1008,1010,1013].eqiad.wmnet,stat[1007,1009].eqiad.wmnet,thanos-be1002.eqiad.wmnet,thanos-fe1002.eqiad.wmnet,thumbor[1001-1002].eqiad.wmnet,wcqs1002.eqiad.wmnet,wdqs[1007,1009,1012].eqiad.wmnet,zookeeper-test1002.eqiad.wmnet

jbond updated the task description. (Show Details)Mar 28 2023, 1:53 PM

Mentioned in SAL (#wikimedia-operations) [2023-03-28T13:54:49Z] <Emperor> depool ms-fe1010 before switch work T330165

Mentioned in SAL (#wikimedia-analytics) [2023-03-28T13:54:56Z] <btullis> entering safe mode for analytics-hadoop cluster: T330165

MatthewVernon updated the task description. (Show Details)Mar 28 2023, 1:55 PM

BTullis updated the task description. (Show Details)Mar 28 2023, 1:56 PM

Mentioned in SAL (#wikimedia-operations) [2023-03-28T13:58:09Z] <godog> depool thanos-fe1002 - T330165

fgiunchedi updated the task description. (Show Details)Mar 28 2023, 1:59 PM

jbond updated the task description. (Show Details)Mar 28 2023, 2:01 PM

Mentioned in SAL (#wikimedia-operations) [2023-03-28T14:05:58Z] <XioNoX> reboot eqiad row B for upgrade - T330165

fgiunchedi updated the task description. (Show Details)Mar 28 2023, 2:26 PM

Change 903666 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] Revert "Depool eqiad frontends for network maintenance"

https://gerrit.wikimedia.org/r/903666

Change 903666 merged by Ssingh:

[operations/dns@master] Revert "Depool eqiad frontends for network maintenance"

https://gerrit.wikimedia.org/r/903666

Mentioned in SAL (#wikimedia-analytics) [2023-03-28T14:31:16Z] <btullis> re-enabling YARN queues: https://gerrit.wikimedia.org/r/c/operations/puppet/+/903565 T330165

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches upgrade done - T330165 started.

Mentioned in SAL (#wikimedia-operations) [2023-03-28T14:32:27Z] <akosiaris@cumin1001> START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches upgrade done - T330165

Mentioned in SAL (#wikimedia-analytics) [2023-03-28T14:35:03Z] <btullis> re-enabling gobblin timers: https://gerrit.wikimedia.org/r/c/operations/puppet/+/903668 T330165

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches upgrade done - T330165 failed.

ayounsi added a parent task: T327248: eqiad/codfw virtual-chassis upgrades.Mar 28 2023, 2:49 PM

Mentioned in SAL (#wikimedia-operations) [2023-03-28T14:51:55Z] <akosiaris@cumin1001> END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: eqiad row B switches upgrade done - T330165

The switch upgrade itself went smoothly as well, like the other rows.

One issue was that gerrit1001 was missing from the list. This is because the host didn't have any owner at the time I collected the data. It was fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/892587

I'll make sure it doesn't happen again for the future rows.

Mentioned in SAL (#wikimedia-operations) [2023-03-28T16:00:32Z] <inflatador> bking@cumin1001 unban elastic and cloudelastic nodes post maintenance T330165

aborrero mentioned this in T333370: Toolforge k8s: network connetivity problems.Mar 28 2023, 4:49 PM

colewhite updated the task description. (Show Details)Mar 28 2023, 8:53 PM

Jelto updated the task description. (Show Details)Mar 29 2023, 7:28 AM