Page MenuHomePhabricator

codfw row B switches upgrade
Closed, ResolvedPublic

Description

codfw row B switches upgrade

For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade codfw row B switches.

This is scheduled for Feb 21st - 14:00-16:00 UTC, please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well. Also a good opportunity to test the hosts depool mechanisms and services redundancy.

The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:

  • please write NONE if no action is needed,
  • the cookbook/command to run if it can be done by a 3rd party
  • who will be around to take care of the depool
  • Link to the relevant doc
  • etc

The two main types of actions needed are depool and monitoring downtime

NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark None in the table) so there are less moving parts closer to the maintenance window.

Core Platform

Platform Engineering

ServersDepool action neededRepool action neededStatus
maps[2006,2009]
restbase[2013-2014,2019,2021,2024]depoolrepool
sessionstore2001confctl --object-type discovery select 'dnsdisc=sessionstore,name=codfw' set/pooled=falseconfctl --object-type discovery select 'dnsdisc=sessionstore,name=codfw' set/pooled=trueOut of an abundance of caution, we should depool the datacenter prior to maintenance (see: https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues)
thumbor[2003-2004]

Search Platform

Discovery-Search

ServersDepool action neededRepool action neededStatus
elastic[2041-2044,2057-2058,2063-2064,2070,2077-2080]
wcqs2001
wdqs[2005,2007,2010]

ServiceOps-Collab

collaboration-services

ServersDepool action neededRepool action neededStatus
gitlab2003NONENONEinsetup
gitlab-runner2002pause gitlab-runner in admin interfaceunpause gitlab-runner in admin interfaceunpaused again by @Jelto
miscweb2002NONENONE
releases2002NONENONE
contint2002NONENONE

Infrastructure Foundations

Infrastructure-Foundations

ServersDepool action neededRepool action neededStatus
apt2001none
bast2002none
ganeti[2019-2022,2031-2032]none
install2003none
mx2001none
pki2002nonenonenone
puppetmaster2003offline puppet masterrevert offline changedone
serpensnone
urldownloader2002failed over to 2001not neededdone

Observability

SRE Observability

ServersDepool action neededRepool action neededStatus
centrallog2002set downtimessync rotated logs for day of maint from eqiad (on the following day)
graphite2004nonenonen/a
kafka-logging[2002,2004]set downtimes, stop kafka servicestart kafka service, ensure dashboard returns to greenrepooled
logstash[2024-2025,2027,2034,2036]conftool 202[45] disable shard allocation 2027,203[46]conftool 202[45] allow shard allocation 2027,203[46]shards re-allocating, repooled
prometheus2005depool on the hostpoolrepooled

Observability and Data Persistence

SRE Observability Data-Persistence

ServersDepool action neededRepool action neededStatus
thanos-fe2002conftool depool, while making sure another thanos-fe host is pooled for service thanos-webconftool pool, while making sure another thanos-fe host is pooled for service thanos-webrepooled

Traffic

Traffic

ServersDepool action neededRepool action neededStatus
cp[2031-2034]
doh2002disable puppet and stop bird.servicedepooled
lvs2008
ncredir2002
pybal-test[2001-2003]

Data Engineering

Data-Engineering

ServersDepool action neededRepool action neededStatus
aqs[2005-2008]NoneNone
furud

Machine Learning

Machine-Learning-Team

ServersDepool action neededRepool action neededStatus
ml-cache2002--
ml-etcd2001--
ml-serve[2002,2006]--
ml-staging-ctrl2001--
ml-staging-etcd2002--
ores[2003-2004]sudo -i depoolsudo -i pool
orespoolcounter2004--

WMCS

cloud-services-team

ServersDepool action neededRepool action neededStatus
cloudcephmon[2004-2006]-dev
cloudcephosd[2001-2003]-dev
cloudcontrol[2001,2005]-dev
clouddb[2001-2002]-dev
cloudgw[2001-2003]-dev
cloudnet[2005-2006]-dev
cloudservices[2004-2005]-dev
cloudvirt[2001-2003]-dev
cloudweb2002-dev

Data Persistence

Data-Persistence

ServersDepool action neededRepool action neededStatus
backup[2005,2008]They are not a service, but storage. Jaime will make sure earlier in the week they are not active at the time of the maintenance.Jaime will restart some delayed backups, if any.
cassandra-dev2001NoneNone
db[2096,2098,2107-2111,2123-2124,2134,2137,2143,2147-2148,2159-2164,2177-2178]Nothing as codfw will be depooledNothing needed as codfw will be depooled. db2134 (m3 master can be ignored)
dbprov2002They are not a service, but storage. Jaime will make sure earlier in the week they are not active at the time of the maintenance.None
dbproxy2002NoneReload haproxy
es[2021,2025,2029-2030]Nothing as codfw will be depooled
moss-be2002N/AN/ANot in production service
ms-be[2041,2046-2047,2053,2057,2063,2067]NoneNone
ms-fe2010sudo depoolsudo pool@MatthewVernon is away, so @fgiunchedi will handle
pc2012Nothing as codfw will be depooled
thanos-be2002NoneNone

ServiceOps

serviceops
Due to the large number of services potentially affected (multiple mw appservers, kubernetes workers), a global depool of a/a services will be done:
sre.discovery.datacenter depool --reason T327991 codfw
After the maintenance, check state of T329664: Update wikikube codfw to k8s 1.23 and repool:
sre.discovery.datacenter pool --reason T327991 codfw
Depool restbase-async from eqiad:
cookbook sre.discovery.service-route --reason T327991 depool --wipe-cache eqiad restbase-async

ServersDepool action neededRepool action neededStatus
conf2004
contint2002
kafka-main2002
kubemaster2002
kubernetes[2006,2009-2010,2020,2023]
kubestage2002
kubestagetcd2001
kubetcd2006
mc[2042-2046]
mc-gp2002
mw[2259-2270,2310-2334]
mwdebug2002
parse[2006-2010]
poolcounter2004
rdb2008
registry2004

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Adding Jaime for the backup hosts.

fgiunchedi updated the task description. (Show Details)
fgiunchedi subscribed.
Jelto updated the task description. (Show Details)

@Joe @akosiaris I assume we'll depool codfw for this one too?

@Joe @akosiaris I assume we'll depool codfw for this one too?

Yeah, as a team we are similarly affected to row A maint. Multiple MW hosts, multiple kubernetes hosts.

Change 889477 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Switch urldownloader in codfw to 2001

https://gerrit.wikimedia.org/r/889477

Change 889477 merged by Muehlenhoff:

[operations/dns@master] Switch urldownloader in codfw to 2001

https://gerrit.wikimedia.org/r/889477

Mentioned in SAL (#wikimedia-operations) [2023-02-21T07:49:34Z] <XioNoX> Staging the new Junos version on the codfw row B switches - T327991

Global depool of a/a services from codfw is done.

Change 890792 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] puppetmaster: offline 2003 for switch upgrade

https://gerrit.wikimedia.org/r/890792

Change 890792 merged by Jbond:

[operations/puppet@production] puppetmaster: offline 2003 for switch upgrade

https://gerrit.wikimedia.org/r/890792

Icinga downtime and Alertmanager silence (ID=aec8ddda-9ad5-4b7f-8bca-c273e036a282) set by ayounsi@cumin1001 for 2:00:00 on 215 host(s) and their services with reason: codfw row B upgrade

apt2001.wikimedia.org,aqs[2005-2008].codfw.wmnet,backup[2005,2008].codfw.wmnet,bast2002.wikimedia.org,cassandra-dev2001.codfw.wmnet,centrallog2002.codfw.wmnet,cloudcephmon[2004-2006]-dev.codfw.wmnet,cloudcephosd[2001-2003]-dev.codfw.wmnet,cloudcontrol[2001,2005]-dev.wikimedia.org,clouddb2002-dev.codfw.wmnet,cloudgw[2002-2003]-dev.codfw.wmnet,cloudlb2001-dev.codfw.wmnet,cloudnet[2005-2006]-dev.codfw.wmnet,cloudservices[2004-2005]-dev.wikimedia.org,cloudvirt[2001-2003]-dev.codfw.wmnet,cloudweb2002-dev.wikimedia.org,conf2004.codfw.wmnet,contint2002.wikimedia.org,cp[2031-2034].codfw.wmnet,db[2096,2098,2107-2111,2123-2124,2134,2137,2143,2147-2148,2159-2164,2177-2178,2185].codfw.wmnet,dbprov2002.codfw.wmnet,dbproxy2002.codfw.wmnet,doh2002.wikimedia.org,elastic[2041-2044,2057-2058,2063-2064,2070,2077-2080].codfw.wmnet,es[2021,2025,2029-2030].codfw.wmnet,furud.codfw.wmnet,ganeti[2019-2022,2031-2032].codfw.wmnet,gitlab2003.wikimedia.org,gitlab-runner2002.codfw.wmnet,graphite2004.codfw.wmnet,irc2001.wikimedia.org,kafka-logging[2002,2004].codfw.wmnet,kafka-main2002.codfw.wmnet,kubemaster2002.codfw.wmnet,kubernetes[2006,2009-2010,2020,2023].codfw.wmnet,kubestage2002.codfw.wmnet,kubestagetcd2001.codfw.wmnet,kubetcd2006.codfw.wmnet,logstash[2024-2025,2027,2034,2036].codfw.wmnet,lvs2008.codfw.wmnet,maps[2006,2009].codfw.wmnet,mc[2042-2046].codfw.wmnet,mc-gp2002.codfw.wmnet,miscweb2002.codfw.wmnet,ml-cache2002.codfw.wmnet,ml-etcd2001.codfw.wmnet,ml-serve[2002,2006].codfw.wmnet,ml-staging-ctrl2001.codfw.wmnet,ml-staging-etcd2002.codfw.wmnet,moss-be2002.codfw.wmnet,ms-be[2041,2046-2047,2053,2057,2063,2067].codfw.wmnet,ms-fe2010.codfw.wmnet,mw[2259-2270,2310-2334,2428-2435].codfw.wmnet,mwdebug2002.codfw.wmnet,mx2001.wikimedia.org,ncredir2002.codfw.wmnet,ores[2003-2004].codfw.wmnet,orespoolcounter2004.codfw.wmnet,parse[2006-2010].codfw.wmnet,pc2012.codfw.wmnet,pki2002.codfw.wmnet,poolcounter2004.codfw.wmnet,prometheus2005.codfw.wmnet,puppetmaster2003.codfw.wmnet,pybal-test[2001-2003].codfw.wmnet,rdb2008.codfw.wmnet,registry2004.codfw.wmnet,releases2002.codfw.wmnet,restbase[2013-2014,2019,2021,2024].codfw.wmnet,serpens.wikimedia.org,sessionstore2001.codfw.wmnet,thanos-be2002.codfw.wmnet,thanos-fe2002.codfw.wmnet,thumbor[2003-2004].codfw.wmnet,urldownloader2002.wikimedia.org,wcqs2001.codfw.wmnet,wdqs[2005,2007,2010].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-02-21T13:48:34Z] <godog> stop kafka on kafka-logging[2002,2004].codfw.wmnet - T327991

Mentioned in SAL (#wikimedia-operations) [2023-02-21T13:54:10Z] <gehel> depooling elastic[2041-2044,2057-2058,2063-2064,2070,2077-2080].codfw.wmnet for switch maintenance - T327991

Mentioned in SAL (#wikimedia-operations) [2023-02-21T13:54:54Z] <gehel> depooling wcqs2001.codfw.wmnet for switch maintenance - T327991

Mentioned in SAL (#wikimedia-operations) [2023-02-21T13:55:22Z] <gehel> depooling wdqs[2005,2007,2010].codfw.wmnet for switch maintenance - T327991

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:02:48Z] <cgoubert@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: codfw maint (T327991)

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:03:07Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: codfw maint (T327991)

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:06:36Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on db[2134,2160].codfw.wmnet,db[1117,1159].eqiad.wmnet with reason: codfw maint (T327991)

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:06:51Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2134,2160].codfw.wmnet,db[1117,1159].eqiad.wmnet with reason: codfw maint (T327991)

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:07:01Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 27 hosts with reason: codfw maint (T327991)

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:07:20Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 27 hosts with reason: codfw maint (T327991)

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:24:39Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: codfw maint (T327991)

Mentioned in SAL (#wikimedia-operations) [2023-02-21T14:24:44Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: codfw maint (T327991)

Upgrade went smoothly, less than 15min hard downtime here too.

I restarted es5 codfw backup job, the only backup-related thingy affected by the downtime.

Mentioned in SAL (#wikimedia-operations) [2023-02-21T16:49:57Z] <akosiaris@cumin1001> START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: T327991 - None

Mentioned in SAL (#wikimedia-operations) [2023-02-21T17:04:34Z] <akosiaris@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in codfw: T327991 - None

Mentioned in SAL (#wikimedia-operations) [2023-02-21T17:57:48Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.service-route depool restbase-async in eqiad: T327991

Mentioned in SAL (#wikimedia-operations) [2023-02-21T18:02:52Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in eqiad: T327991

ayounsi claimed this task.