Page MenuHomePhabricator

codfw row C switches upgrade
Closed, ResolvedPublic

Description

codfw row C switches upgrade

For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade codfw row C switches during the scheduled DC switchover.

Scheduled on May 2nd - 13:00-15:00 UTC , please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well (well, 15min in real-reality). Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.

The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:

  • please write NONE if no action is needed,
  • the cookbook/command to run if it can be done by a 3rd party
  • who will be around to take care of the depool
  • Link to the relevant doc
  • etc

The two main types of actions needed are depool and monitoring downtime

NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark None in the table) so there are less moving parts closer to the maintenance window.

All servers will be downtimed with sudo cookbook sre.hosts.downtime --hours 2 -r "codfw row C upgrade" -t T334049 'P{P:netbox::host%location ~ "C.*codfw"}' but specific services might need specific downtimes.

WMCS

cloud-services-team

ServersDepool action neededRepool action neededStatus
cloudbackup2002
cloudcumin2001

Machine Learning

Machine-Learning-Team

ServersDepool action neededRepool action neededStatus
ml-cache2003nonenone
ml-etcd2002nonenone
ml-serve[2003,2007]nonenone
ml-serve-ctrl2001nonenone
ml-staging-etcd2003nonenone
ores[2005-2006]depoolpooldone

Data Engineering

Data-Engineering

ServersDepool action neededRepool action neededStatus
kafka-stretch2001NoneNone
schema2003depoolpoolDepooled

Traffic

Traffic

ServersDepool action neededRepool action neededStatus
cp[2035-2038]NOOP, codfw will be depooledNOOP
dns2001move ns1 to dns2002move ns1 back to dns2001DONE
durum2001disable puppet and stop pybalDONE
lvs2009NOOP, codfw will be depooledNOOP

Core Platform

Platform Engineering

ServersDepool action neededRepool action neededStatus
maps2007NoneNone
restbase[2015-2016,2020,2022,2025]depoolpool
sessionstore2002NoneNone

ServiceOps-Collab

collaboration-services

ServersDepool action neededRepool action neededStatus
doc2002NONENONE
gitlab-runner2003will be paused ahead of maintenance in admin interfacewill be unpaused ahead of maintenance in admin interfaceunpaused again by @Jelto
phab2002NONENONE
vrts2001NONENONE

Search Platform

Discovery-Search

ServersDepool action neededRepool action neededStatus
elastic[2045-2048,2059,2065-2066,2071,2081-2083]NONENONE
wcqs2002NONENONE
wdqs[2008,2011,2017-2019]NONENONE

Observability

SRE Observability

ServersDepool action neededRepool action neededStatus
alert2001
kafka-logging2003
logstash[2002,2028,2032,2035,2037]drain shards 2002,2028,2035,2037 depool 2032 & set downtimeallocate shards, repoolshards allocating, re-pooled
mwlog2002n/an/ano action
prometheus2006depool and remove from AMpool and put back in AMincomplete
thanos-fe2004n/an/anot in production yet
webperf2003no action

Infrastructure Foundations

Infrastructure-Foundations

ServersDepool action neededRepool action neededStatus
build2001NONENONE
cumin2002NONENONE
debmonitor2002NONENONE
failoid2002NONENONE
ganeti[2009-2014]NONENONE
idp-test2002NONENONE
ldap-replica2005depooledrepooledOK
netflow2002NoneNone
puppetboard2002NONENONE
puppetmaster2005NONENONE
urldownloader2003NONENONE

Data Persistence

Data-Persistence

ServersDepool action neededRepool action neededStatus
backup[2003,2006,2009]mediabackups should be pausedrestart codfw media backups
cassandra-dev2002NoneNone
db[2099,2102,2112-2116,2125-2127,2135,2138,2141,2144,2149-2150,2165-2169,2179-2180,2184,2186]Nothing - codfw will be depooled
dbprov2004db backups should be pausedrestart db backups, if any
dbproxy2003Nothing, not active
es[2022,2031-2032]Nothing - codfw will be depooled
moss-fe2001NoneNone
ms-backup2001mediabackups should be pausedrestart codfw media backups
ms-be[2042,2048-2049,2054-2055,2058,2064,2068,2072]NoneNone
ms-fe2011NoneNone
pc2013Nothing - codfw will be depooled
thanos-be2003NoneNone

ServiceOps

serviceops

ServersDepool action neededRepool action neededStatus
conf2005
deploy2002
dragonfly-supernode2001
kafka-main2003
kubernetes[2011-2012,2015,2017,2021]
kubestagetcd2002
kubetcd2005
mc[2047-2050]
mc-wf2001
mw[2335-2339,2350-2365,2412-2419,2436-2443]
mwmaint2002
parse[2011-2015]
rdb2009

No owner

ServersDepool action neededRepool action neededStatus
irc2002NONENONE

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@ayounsi to confirm, codfw will be depooled before this maintenance right? @akosiaris @Joe ?

Yes, we 'll have to depool codfw.

@ayounsi to confirm, codfw will be depooled before this maintenance right? @akosiaris @Joe ?

That's my understanding, yeah. It will be after the switch-back, so eqiad will be primary again.

Marostegui added a subscriber: jcrespo.

@jcrespo kindly check backup servers needs. Thanks

klausman updated the task description. (Show Details)

@Ladsgroup During this operation, replication codfw -> eqiad is still active, so as there are codfw masters involved (even if codfw will be depooled), remember to downtime also eqiad topologies too as otherwise they'll alert for broken replication.

Mentioned in SAL (#wikimedia-operations) [2023-05-01T14:04:52Z] <sukhe> move ns1 from dns2001 to dns2002: T334049

Mentioned in SAL (#wikimedia-operations) [2023-05-01T14:09:32Z] <sukhe> move backup routes for ns0 from dns2001 to dns2002: T334049

Change 913952 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: temporarily remove dns2001 from authdns_servers

https://gerrit.wikimedia.org/r/913952

Change 913966 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] ntp/codfw: point to dns2002 temporarily

https://gerrit.wikimedia.org/r/913966

Change 913966 merged by Ssingh:

[operations/dns@master] ntp/codfw: point to dns2002 temporarily

https://gerrit.wikimedia.org/r/913966

Mentioned in SAL (#wikimedia-operations) [2023-05-01T21:08:31Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic[2045-2048,2059,2065-2066,2071,2081-2083] for row C switch upgrade - bking@cumin1001 - T334049

Mentioned in SAL (#wikimedia-operations) [2023-05-01T21:08:35Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic[2045-2048,2059,2065-2066,2071,2081-2083] for row C switch upgrade - bking@cumin1001 - T334049

Mentioned in SAL (#wikimedia-operations) [2023-05-01T21:08:49Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic[2045-2048,2059,2065-2066,2071,2081-2083]* for row C switch upgrade - bking@cumin1001 - T334049

Mentioned in SAL (#wikimedia-operations) [2023-05-01T21:08:52Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic[2045-2048,2059,2065-2066,2071,2081-2083]* for row C switch upgrade - bking@cumin1001 - T334049

Mentioned in SAL (#wikimedia-operations) [2023-05-01T21:15:12Z] <ryankemper@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 17 hosts with reason: T334049 maint

Mentioned in SAL (#wikimedia-operations) [2023-05-01T21:15:38Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 17 hosts with reason: T334049 maint

Mentioned in SAL (#wikimedia-operations) [2023-05-02T08:27:31Z] <XioNoX> stage Junos 21 on asw-c-codfw - T334049

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row C switches upgrade - T334049 started.

Mentioned in SAL (#wikimedia-operations) [2023-05-02T09:59:25Z] <akosiaris@cumin1001> START - Cookbook sre.discovery.datacenter depool all active/active services in codfw: codfw row C switches upgrade - T334049

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row C switches upgrade - T334049 completed.

Mentioned in SAL (#wikimedia-operations) [2023-05-02T10:52:40Z] <akosiaris@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in codfw: codfw row C switches upgrade - T334049

Mentioned in SAL (#wikimedia-operations) [2023-05-02T11:49:15Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 6:00:00 on 41 hosts with reason: Row c switch maint T334049

Mentioned in SAL (#wikimedia-operations) [2023-05-02T11:49:42Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 41 hosts with reason: Row c switch maint T334049

Mentioned in SAL (#wikimedia-operations) [2023-05-02T11:51:27Z] <Amir1> stop slave on db1130 (eqiad master of s5) (T334049)

Mentioned in SAL (#wikimedia-operations) [2023-05-02T12:05:42Z] <Amir1> stop slave again on db1130 (eqiad master of s5) (T334049)

Mentioned in SAL (#wikimedia-operations) [2023-05-02T12:17:47Z] <Amir1> stop slave on eqiad masters of s1, x1, s8 (T334049)

Change 914314 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] depool codfw for switch upgrade

https://gerrit.wikimedia.org/r/914314

Change 914314 merged by Ssingh:

[operations/dns@master] depool codfw for switch upgrade

https://gerrit.wikimedia.org/r/914314

Mentioned in SAL (#wikimedia-operations) [2023-05-02T12:20:50Z] <sukhe> run authdns-update to depool codfwL T334049

Change 913952 merged by Ssingh:

[operations/puppet@production] hiera: temporarily remove dns2001 from authdns_servers

https://gerrit.wikimedia.org/r/913952

Change 914315 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore: disable client connections to sessionstore2002

https://gerrit.wikimedia.org/r/914315

Change 914315 abandoned by Eevans:

[operations/puppet@production] sessionstore: disable client connections to sessionstore2002

Reason:

Unnecessary with codfw depooled

https://gerrit.wikimedia.org/r/914315

Icinga downtime and Alertmanager silence (ID=21224f03-d3c2-4431-accb-64fcadd01a0f) set by ayounsi@cumin1001 for 2:00:00 on 185 host(s) and their services with reason: codfw row C upgrade

backup[2003,2006,2009-2010].codfw.wmnet,build2001.codfw.wmnet,cassandra-dev2002.codfw.wmnet,cloudbackup2002.codfw.wmnet,cloudcumin2001.codfw.wmnet,conf2005.codfw.wmnet,cp[2035-2038].codfw.wmnet,cumin2002.codfw.wmnet,db[2099,2102,2112-2116,2125-2127,2135,2138,2141,2144,2149-2150,2165-2169,2179-2180,2184,2186].codfw.wmnet,dbprov2004.codfw.wmnet,dbproxy2003.codfw.wmnet,debmonitor2002.codfw.wmnet,deploy2002.codfw.wmnet,dns[2001,2005].wikimedia.org,doc2002.codfw.wmnet,dragonfly-supernode2001.codfw.wmnet,durum2001.codfw.wmnet,elastic[2045-2048,2059,2065-2066,2071,2081-2083].codfw.wmnet,es[2022,2031-2032].codfw.wmnet,failoid2002.codfw.wmnet,ganeti[2009-2014].codfw.wmnet,gitlab-runner2003.codfw.wmnet,idp-test2002.wikimedia.org,irc2002.wikimedia.org,kafka-logging2003.codfw.wmnet,kafka-main2003.codfw.wmnet,kafka-stretch2001.codfw.wmnet,kubernetes[2011-2012,2015,2017,2021].codfw.wmnet,kubestagetcd2002.codfw.wmnet,kubetcd2005.codfw.wmnet,ldap-replica2005.wikimedia.org,logstash[2002,2028,2032,2035,2037].codfw.wmnet,lvs2009.codfw.wmnet,maps2007.codfw.wmnet,mc[2047-2050].codfw.wmnet,mc-wf2001.codfw.wmnet,ml-cache2003.codfw.wmnet,ml-etcd2002.codfw.wmnet,ml-serve[2003,2007].codfw.wmnet,ml-serve-ctrl2001.codfw.wmnet,ml-staging-etcd2003.codfw.wmnet,moss-fe2001.codfw.wmnet,ms-backup2001.codfw.wmnet,ms-be[2042,2048-2049,2054-2055,2058,2064,2068,2072].codfw.wmnet,ms-fe2011.codfw.wmnet,mw[2335-2339,2350-2365,2412-2419,2436-2443].codfw.wmnet,mwlog2002.codfw.wmnet,mwmaint2002.codfw.wmnet,netflow2002.codfw.wmnet,ores[2005-2006].codfw.wmnet,parse[2011-2015].codfw.wmnet,pc2013.codfw.wmnet,phab2002.codfw.wmnet,prometheus2006.codfw.wmnet,puppetboard2002.codfw.wmnet,puppetmaster2005.codfw.wmnet,rdb2009.codfw.wmnet,restbase[2015-2016,2020,2022,2025].codfw.wmnet,schema2003.codfw.wmnet,sessionstore2002.codfw.wmnet,sretest2002.codfw.wmnet,thanos-be2003.codfw.wmnet,thanos-fe2004.codfw.wmnet,urldownloader2003.wikimedia.org,vrts2001.codfw.wmnet,wcqs2002.codfw.wmnet,wdqs[2008,2011,2017-2019].codfw.wmnet,webperf2003.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-05-02T13:05:38Z] <XioNoX> rebooting asw-c-codfw for software upgrade - T334049

jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row C switches upgrade - T334049 started.

Mentioned in SAL (#wikimedia-operations) [2023-05-02T14:59:47Z] <jiji@cumin1001> START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: codfw row C switches upgrade - T334049

ayounsi claimed this task.

Upgrade went fine! Thanks everybody.

jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row C switches upgrade - T334049 completed.

Mentioned in SAL (#wikimedia-operations) [2023-05-02T15:16:48Z] <jiji@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in codfw: codfw row C switches upgrade - T334049

Change 919388 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] ntp/codfw: point to dns2003 temporarily

https://gerrit.wikimedia.org/r/919388

Change 919388 merged by Ssingh:

[operations/dns@master] ntp/codfw: point to dns2003 temporarily

https://gerrit.wikimedia.org/r/919388