codfw row B switches upgrade
Closed, ResolvedPublic
Actions

Description

codfw row B switches upgrade

For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade codfw row B switches.

This is scheduled for Feb 21st - 14:00-16:00 UTC, please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well. Also a good opportunity to test the hosts depool mechanisms and services redundancy.

The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:

please write NONE if no action is needed,
the cookbook/command to run if it can be done by a 3rd party
who will be around to take care of the depool
Link to the relevant doc
etc

The two main types of actions needed are depool and monitoring downtime

NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark None in the table) so there are less moving parts closer to the maintenance window.

Core Platform

Platform Engineering

Servers	Depool action needed	Repool action needed	Status
maps[2006,2009]
restbase[2013-2014,2019,2021,2024]	depool	repool
sessionstore2001	`confctl --object-type discovery select 'dnsdisc=sessionstore,name=codfw' set/pooled=false`	`confctl --object-type discovery select 'dnsdisc=sessionstore,name=codfw' set/pooled=true`	Out of an abundance of caution, we should depool the datacenter prior to maintenance (see: https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues)
thumbor[2003-2004]

Search Platform

Discovery-Search

Servers	Depool action needed	Repool action needed	Status
elastic[2041-2044,2057-2058,2063-2064,2070,2077-2080]
wcqs2001
wdqs[2005,2007,2010]

ServiceOps-Collab

collaboration-services

Servers	Depool action needed	Repool action needed	Status
gitlab2003	`NONE`	`NONE`	insetup
gitlab-runner2002	pause gitlab-runner in admin interface	unpause gitlab-runner in admin interface	unpaused again by @Jelto
miscweb2002	NONE	NONE
releases2002	NONE	NONE
contint2002	NONE	NONE

Infrastructure Foundations

Infrastructure-Foundations

Servers	Depool action needed	Repool action needed	Status
apt2001	none
bast2002	none
ganeti[2019-2022,2031-2032]	none
install2003	none
mx2001	none
pki2002	none	none	none
puppetmaster2003	offline puppet master	revert offline change	done
serpens	none
urldownloader2002	failed over to 2001	not needed	done

Observability

SRE Observability

Servers	Depool action needed	Repool action needed	Status
centrallog2002	set downtimes	sync rotated logs for day of maint from eqiad (on the following day)
graphite2004	none	none	n/a
kafka-logging[2002,2004]	set downtimes, stop kafka service	start kafka service, ensure dashboard returns to green	repooled
logstash[2024-2025,2027,2034,2036]	conftool 202[45] disable shard allocation 2027,203[46]	conftool 202[45] allow shard allocation 2027,203[46]	shards re-allocating, repooled
prometheus2005	`depool` on the host	`pool`	repooled

Observability and Data Persistence

SRE Observability Data-Persistence

Servers	Depool action needed	Repool action needed	Status
thanos-fe2002	conftool depool, while making sure another thanos-fe host is pooled for service thanos-web	conftool pool, while making sure another thanos-fe host is pooled for service thanos-web	repooled

Traffic

Servers	Depool action needed	Status
cp[2031-2034]
doh2002	disable puppet and stop bird.service	depooled
lvs2008
ncredir2002
pybal-test[2001-2003]

Data Engineering

Data-Engineering

Servers	Depool action needed	Repool action needed	Status
aqs[2005-2008]	None	None
furud

Machine Learning

Machine-Learning-Team

Servers	Depool action needed	Repool action needed
ml-cache2002	-	-
ml-etcd2001	-	-
ml-serve[2002,2006]	-	-
ml-staging-ctrl2001	-	-
ml-staging-etcd2002	-	-
ores[2003-2004]	`sudo -i depool`	`sudo -i pool`
orespoolcounter2004	-	-

WMCS

cloud-services-team

Servers	Depool action needed	Repool action needed	Status
cloudcephmon[2004-2006]-dev
cloudcephosd[2001-2003]-dev
cloudcontrol[2001,2005]-dev
clouddb[2001-2002]-dev
cloudgw[2001-2003]-dev
cloudnet[2005-2006]-dev
cloudservices[2004-2005]-dev
cloudvirt[2001-2003]-dev
cloudweb2002-dev

Data Persistence

Data-Persistence

Servers	Depool action needed	Repool action needed	Status
backup[2005,2008]	They are not a service, but storage. Jaime will make sure earlier in the week they are not active at the time of the maintenance.	Jaime will restart some delayed backups, if any.
cassandra-dev2001	None	None
db[2096,2098,2107-2111,2123-2124,2134,2137,2143,2147-2148,2159-2164,2177-2178]	Nothing as codfw will be depooled		Nothing needed as codfw will be depooled. db2134 (m3 master can be ignored)
dbprov2002	They are not a service, but storage. Jaime will make sure earlier in the week they are not active at the time of the maintenance.	None
dbproxy2002	None	Reload haproxy
es[2021,2025,2029-2030]	Nothing as codfw will be depooled
moss-be2002	N/A	N/A	Not in production service
ms-be[2041,2046-2047,2053,2057,2063,2067]	None	None
ms-fe2010	`sudo depool`	`sudo pool`	@MatthewVernon is away, so @fgiunchedi will handle
pc2012	Nothing as codfw will be depooled
thanos-be2002	None	None

ServiceOps

serviceops
Due to the large number of services potentially affected (multiple mw appservers, kubernetes workers), a global depool of a/a services will be done:
sre.discovery.datacenter depool --reason T327991 codfw
After the maintenance, check state of T329664: Update wikikube codfw to k8s 1.23 and repool:
sre.discovery.datacenter pool --reason T327991 codfw
Depool restbase-async from eqiad:
cookbook sre.discovery.service-route --reason T327991 depool --wipe-cache eqiad restbase-async

Servers	Depool action needed	Repool action needed	Status
conf2004
contint2002
kafka-main2002
kubemaster2002
kubernetes[2006,2009-2010,2020,2023]
kubestage2002
kubestagetcd2001
kubetcd2006
mc[2042-2046]
mc-gp2002
mw[2259-2270,2310-2334]
mwdebug2002
parse[2006-2010]
poolcounter2004
rdb2008
registry2004

Details

Due Date: Feb 21 2023, 1:00 PM

	Subject	Repo	Branch	Lines +/-
	puppetmaster: offline 2003 for switch upgrade	operations/puppet	production	+1 -1
	Switch urldownloader in codfw to 2001	operations/dns	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T253824 planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1)
Resolved	ayounsi	T254013 all network devices must run OpenSSH >= 7.2p1 but != 7.4p1
Resolved	ayounsi	T317175 Junos: resolve DNS through mgmt_junos
Resolved	ayounsi	T327862 Use mgmt_junos on all network devices
		Restricted Task
Open	None	T316539 Upgrade network devices to Junos 20+
Resolved	ayounsi	T327248 eqiad/codfw virtual-chassis upgrades
Resolved	ayounsi	T327991 codfw row B switches upgrade
Resolved	Marostegui	T328022 Switchover s4 master (db2110 -> db2140)
Resolved	Marostegui	T328023 Switchover s5 master (db2123 -> db2113)
Resolved	Marostegui	T328024 Switchover s8 master (db2161 -> db2165)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Adding Jaime for the backup hosts.

Marostegui updated the task description. (Show Details)Jan 26 2023, 1:36 PM

Marostegui added a subtask: T328022: Switchover s4 master (db2110 -> db2140).Jan 26 2023, 1:38 PM

Marostegui added a subtask: T328023: Switchover s5 master (db2123 -> db2113).

Marostegui added a subtask: T328024: Switchover s8 master (db2161 -> db2165).Jan 26 2023, 1:43 PM

Marostegui updated the task description. (Show Details)

Marostegui closed subtask T328023: Switchover s5 master (db2123 -> db2113) as Resolved.Jan 26 2023, 2:08 PM

Marostegui closed subtask T328024: Switchover s8 master (db2161 -> db2165) as Resolved.Jan 26 2023, 4:13 PM

colewhite updated the task description. (Show Details)Jan 26 2023, 4:38 PM

jcrespo updated the task description. (Show Details)Jan 26 2023, 4:44 PM

herron updated the task description. (Show Details)Jan 26 2023, 4:52 PM

Eevans updated the task description. (Show Details)Jan 26 2023, 5:17 PM

Eevans updated the task description. (Show Details)Jan 26 2023, 5:23 PM

Marostegui updated the task description. (Show Details)Jan 27 2023, 8:51 AM

Jelto updated the task description. (Show Details)Jan 27 2023, 9:54 AM

Jelto subscribed.

elukey updated the task description. (Show Details)Jan 27 2023, 10:16 AM

fgiunchedi updated the task description. (Show Details)Jan 27 2023, 10:29 AM

fgiunchedi updated the task description. (Show Details)

fgiunchedi subscribed.

MatthewVernon updated the task description. (Show Details)Jan 27 2023, 10:32 AM

MatthewVernon subscribed.

ayounsi moved this task from Backlog to This quarter on the netops board.Jan 27 2023, 10:43 AM

Marostegui closed subtask T328022: Switchover s4 master (db2110 -> db2140) as Resolved.Jan 30 2023, 3:19 PM

bking subscribed.Jan 30 2023, 4:29 PM

• MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.Jan 30 2023, 4:29 PM

• MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Jelto updated the task description. (Show Details)Jan 30 2023, 4:53 PM

Jelto updated the task description. (Show Details)

LSobanski updated the task description. (Show Details)Jan 30 2023, 4:56 PM

LSobanski moved this task from Incoming to Consultation on the collaboration-services board.

elukey moved this task from Unsorted to Watching on the Machine-Learning-Team board.Jan 31 2023, 9:41 AM

Clement_Goubert moved this task from Incoming 🐫 to 🛠 Upgrades and Hardware on the serviceops board.Feb 1 2023, 1:02 PM

@Joe @akosiaris I assume we'll depool codfw for this one too?

In T327991#8593396, @Marostegui wrote:

@Joe @akosiaris I assume we'll depool codfw for this one too?

Yeah, as a team we are similarly affected to row A maint. Multiple MW hosts, multiple kubernetes hosts.

Clement_Goubert updated the task description. (Show Details)Feb 7 2023, 3:29 PM

Clement_Goubert updated the task description. (Show Details)

• MoritzMuehlenhoff updated the task description. (Show Details)Feb 8 2023, 8:36 AM

• MoritzMuehlenhoff updated the task description. (Show Details)Feb 8 2023, 8:38 AM

Marostegui updated the task description. (Show Details)Feb 8 2023, 9:03 AM

Marostegui updated the task description. (Show Details)

KOfori moved this task from Backlog to Upcoming on the Traffic board.Feb 8 2023, 11:29 AM

Marostegui updated the task description. (Show Details)Feb 13 2023, 11:16 AM

Gehel moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Feb 13 2023, 4:38 PM

Gehel moved this task from Ready for Dev -- SWE to Ready for Dev -- SRE/Ops on the Discovery-Search (Current work) board.

JMeybohm mentioned this in T329664: Update wikikube codfw to k8s 1.23.Feb 14 2023, 6:31 PM

JMeybohm subscribed.

JMeybohm updated the task description. (Show Details)Feb 14 2023, 6:33 PM

Change 889477 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Switch urldownloader in codfw to 2001

https://gerrit.wikimedia.org/r/889477

gerritbot added a project: Patch-For-Review.Feb 15 2023, 8:10 AM

Change 889477 merged by Muehlenhoff:

[operations/dns@master] Switch urldownloader in codfw to 2001

https://gerrit.wikimedia.org/r/889477

• MoritzMuehlenhoff updated the task description. (Show Details)Feb 15 2023, 9:42 AM

Maintenance_bot removed a project: Patch-For-Review.Feb 15 2023, 10:10 AM

colewhite updated the task description. (Show Details)Feb 17 2023, 9:09 PM

Jelto updated the task description. (Show Details)Feb 20 2023, 9:19 AM

Clement_Goubert updated the task description. (Show Details)Feb 20 2023, 11:53 AM

Mentioned in SAL (#wikimedia-operations) [2023-02-21T07:49:34Z] <XioNoX> Staging the new Junos version on the codfw row B switches - T327991

Global depool of a/a services from codfw is done.

jbond updated the task description. (Show Details)Feb 21 2023, 11:10 AM

Change 890792 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] puppetmaster: offline 2003 for switch upgrade

https://gerrit.wikimedia.org/r/890792

gerritbot added a project: Patch-For-Review.Feb 21 2023, 11:13 AM

Change 890792 merged by Jbond:

[operations/puppet@production] puppetmaster: offline 2003 for switch upgrade

https://gerrit.wikimedia.org/r/890792

jbond updated the task description. (Show Details)Feb 21 2023, 11:16 AM

fgiunchedi updated the task description. (Show Details)Feb 21 2023, 11:17 AM

Maintenance_bot removed a project: Patch-For-Review.Feb 21 2023, 11:30 AM

Icinga downtime and Alertmanager silence (ID=aec8ddda-9ad5-4b7f-8bca-c273e036a282) set by ayounsi@cumin1001 for 2:00:00 on 215 host(s) and their services with reason: codfw row B upgrade

apt2001.wikimedia.org,aqs[2005-2008].codfw.wmnet,backup[2005,2008].codfw.wmnet,bast2002.wikimedia.org,cassandra-dev2001.codfw.wmnet,centrallog2002.codfw.wmnet,cloudcephmon[2004-2006]-dev.codfw.wmnet,cloudcephosd[2001-2003]-dev.codfw.wmnet,cloudcontrol[2001,2005]-dev.wikimedia.org,clouddb2002-dev.codfw.wmnet,cloudgw[2002-2003]-dev.codfw.wmnet,cloudlb2001-dev.codfw.wmnet,cloudnet[2005-2006]-dev.codfw.wmnet,cloudservices[2004-2005]-dev.wikimedia.org,cloudvirt[2001-2003]-dev.codfw.wmnet,cloudweb2002-dev.wikimedia.org,conf2004.codfw.wmnet,contint2002.wikimedia.org,cp[2031-2034].codfw.wmnet,db[2096,2098,2107-2111,2123-2124,2134,2137,2143,2147-2148,2159-2164,2177-2178,2185].codfw.wmnet,dbprov2002.codfw.wmnet,dbproxy2002.codfw.wmnet,doh2002.wikimedia.org,elastic[2041-2044,2057-2058,2063-2064,2070,2077-2080].codfw.wmnet,es[2021,2025,2029-2030].codfw.wmnet,furud.codfw.wmnet,ganeti[2019-2022,2031-2032].codfw.wmnet,gitlab2003.wikimedia.org,gitlab-runner2002.codfw.wmnet,graphite2004.codfw.wmnet,irc2001.wikimedia.org,kafka-logging[2002,2004].codfw.wmnet,kafka-main2002.codfw.wmnet,kubemaster2002.codfw.wmnet,kubernetes[2006,2009-2010,2020,2023].codfw.wmnet,kubestage2002.codfw.wmnet,kubestagetcd2001.codfw.wmnet,kubetcd2006.codfw.wmnet,logstash[2024-2025,2027,2034,2036].codfw.wmnet,lvs2008.codfw.wmnet,maps[2006,2009].codfw.wmnet,mc[2042-2046].codfw.wmnet,mc-gp2002.codfw.wmnet,miscweb2002.codfw.wmnet,ml-cache2002.codfw.wmnet,ml-etcd2001.codfw.wmnet,ml-serve[2002,2006].codfw.wmnet,ml-staging-ctrl2001.codfw.wmnet,ml-staging-etcd2002.codfw.wmnet,moss-be2002.codfw.wmnet,ms-be[2041,2046-2047,2053,2057,2063,2067].codfw.wmnet,ms-fe2010.codfw.wmnet,mw[2259-2270,2310-2334,2428-2435].codfw.wmnet,mwdebug2002.codfw.wmnet,mx2001.wikimedia.org,ncredir2002.codfw.wmnet,ores[2003-2004].codfw.wmnet,orespoolcounter2004.codfw.wmnet,parse[2006-2010].codfw.wmnet,pc2012.codfw.wmnet,pki2002.codfw.wmnet,poolcounter2004.codfw.wmnet,prometheus2005.codfw.wmnet,puppetmaster2003.codfw.wmnet,pybal-test[2001-2003].codfw.wmnet,rdb2008.codfw.wmnet,registry2004.codfw.wmnet,releases2002.codfw.wmnet,restbase[2013-2014,2019,2021,2024].codfw.wmnet,serpens.wikimedia.org,sessionstore2001.codfw.wmnet,thanos-be2002.codfw.wmnet,thanos-fe2002.codfw.wmnet,thumbor[2003-2004].codfw.wmnet,urldownloader2002.wikimedia.org,wcqs2001.codfw.wmnet,wdqs[2005,2007,2010].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-02-21T13:48:34Z] <godog> stop kafka on kafka-logging[2002,2004].codfw.wmnet - T327991

Mentioned in SAL (#wikimedia-operations) [2023-02-21T13:54:10Z] <gehel> depooling elastic[2041-2044,2057-2058,2063-2064,2070,2077-2080].codfw.wmnet for switch maintenance - T327991

Mentioned in SAL (#wikimedia-operations) [2023-02-21T13:54:14Z] <vgutierrez> depool doh2002 - T327991

Mentioned in SAL (#wikimedia-operations) [2023-02-21T13:54:54Z] <gehel> depooling wcqs2001.codfw.wmnet for switch maintenance - T327991

Vgutierrez updated the task description. (Show Details)Feb 21 2023, 1:55 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-21T13:55:22Z] <gehel> depooling wdqs[2005,2007,2010].codfw.wmnet for switch maintenance - T327991

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:00:05Z] <vgutierrez> depooling codfw - T327991

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:02:48Z] <cgoubert@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: codfw maint (T327991)

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:03:07Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: codfw maint (T327991)

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:06:36Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on db[2134,2160].codfw.wmnet,db[1117,1159].eqiad.wmnet with reason: codfw maint (T327991)

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:06:51Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2134,2160].codfw.wmnet,db[1117,1159].eqiad.wmnet with reason: codfw maint (T327991)

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:07:01Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 27 hosts with reason: codfw maint (T327991)

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:07:20Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 27 hosts with reason: codfw maint (T327991)

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:24:39Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: codfw maint (T327991)

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:24:44Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: codfw maint (T327991)

Upgrade went smoothly, less than 15min hard downtime here too.

fgiunchedi updated the task description. (Show Details)Feb 21 2023, 2:37 PM

jbond updated the task description. (Show Details)Feb 21 2023, 2:45 PM

fgiunchedi updated the task description. (Show Details)Feb 21 2023, 2:51 PM

I restarted es5 codfw backup job, the only backup-related thingy affected by the downtime.

KOfori moved this task from Upcoming to Ready for work on the Traffic board.Feb 21 2023, 3:39 PM

Jelto updated the task description. (Show Details)Feb 21 2023, 3:50 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-21T16:49:57Z] <akosiaris@cumin1001> START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: T327991 - None

Mentioned in SAL (#wikimedia-operations) [2023-02-21T17:04:34Z] <akosiaris@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in codfw: T327991 - None

Mentioned in SAL (#wikimedia-operations) [2023-02-21T17:57:48Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.service-route depool restbase-async in eqiad: T327991

Mentioned in SAL (#wikimedia-operations) [2023-02-21T18:02:52Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in eqiad: T327991

colewhite updated the task description. (Show Details)Feb 21 2023, 8:04 PM

ayounsi closed this task as Resolved.Feb 22 2023, 7:14 AM

ayounsi claimed this task.

Maintenance_bot moved this task from In progress to Done on the DBA board.Feb 22 2023, 7:15 AM

calbon moved this task from Watching to 2023-2024 Q3 Done on the Machine-Learning-Team board.Nov 29 2023, 2:20 PM

	ayounsi
	Jan 26 2023, 6:49 AM

codfw row B switches upgradeClosed, ResolvedPublicActions