Page MenuHomePhabricator

codfw row D switches upgrade
Closed, ResolvedPublic

Description

codfw row D switches upgrade

For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade codfw row D.

Scheduled on May 16th - 13:00-15:00 UTC , please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well (well, 15min in real-reality). Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.

The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:

  • please write NONE if no action is needed,
  • the cookbook/command to run if it can be done by a 3rd party
  • who will be around to take care of the depool
  • Link to the relevant doc
  • etc

The two main types of actions needed are depool and monitoring downtime

NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark None in the table) so there are less moving parts closer to the maintenance window.

All servers will be downtimed with sudo cookbook sre.hosts.downtime --hours 2 -r "codfw row D upgrade" -t XXX 'P{P:netbox::host%location ~ "D.*codfw"}' but specific services might need specific downtimes.

Core Platform

Platform Engineering

ServersDepool action neededRepool action neededStatus
maps[2008,2010]depoolpool
restbase[2012,2017-2018,2023,2026-2027]depoolpool
sessionstore2003NoneNone
thumbor2006NoneNone

Infrastructure Foundations

Infrastructure-Foundations

ServersDepool action neededRepool action neededStatus
ganeti[2015-2018,2025-2026]NONENONEOK
idm2001NONENONEOK
idp2002NONENONEOK
install2004NONENONEOK
ldap-replica2006depooledrepooledOK
netbox-dev2002NONENONEOK
ping2003redirect ICMP trafficrevert
puppetdb2003NONENONEOK
puppetmaster2002Puppet disabled in codfw/esams/ulsfoPuppet re-enabledOK
urldownloader2004NONENONEOK

Infrastructure Foundations and Data Engineering

Infrastructure-Foundations Data-Engineering

ServersDepool action neededRepool action neededStatus
krb[2001-2002]NONENONE

WMCS

cloud-services-team

ServersDepool action neededRepool action neededStatus
cloudcontrol2004-devNONENONE

ServiceOps

serviceops

ServersDepool action neededRepool action neededStatus
chartmuseum2001
conf2006
kafka-main[2004-2005]
kubernetes[2013-2014,2016,2022,2024]
kubestagemaster2001
kubestagetcd2003
mc[2051-2054]
mc-gp2003
mc-wf2002
mw[2271-2279,2281-2290,2366-2376,2444-2451]
parse[2016-2020]
rdb2010

Observability

SRE Observability

ServersDepool action neededRepool action neededStatus
arclamp2001nothing to do, not active
dispatch-be2001not in production
kafka-logging2005no action needed
logstash[2003,2029-2031]drain shards 2003,2029 depool 2030,2031 & set downtimeallocate shards, re-pooldone

Observability and Data Persistence

SRE Observability Data-Persistence

ServersDepool action neededRepool action neededStatus
thanos-fe2003sudo depoolsudo pool@MatthewVernon done

Search Platform

Discovery-Search

ServersDepool action neededRepool action neededStatus
apifeatureusage2001nonenone
elastic[2050-2054,2060,2067-2068,2072,2084-2086]inflatador/rkemper ban/depool day beforeinflatador/rkemper unban/repool
search-loader2001nonenone
wcqs2003nonenone
wdqs[2006,2012,2015,2021-2022]nonenone

ServiceOps-Collab

collaboration-services

ServersDepool action neededRepool action neededStatus
aphlict2001NONENONE
gerrit2002NONENONE
gitlab-runner2004Will be paused in admin menuWill be unpaused in admin menupaused
miscweb2003NONENONE

Traffic

Traffic

ServersDepool action neededRepool action neededStatus
cp[2039-2042]NOOP, codfw depooledDONE
dns2002stop puppet and disable birdDONE
durum2002stop puppet and disable birdDONE
lvs2010NOOP, codfw depooledDONE

Machine Learning

Machine-Learning-Team

ServersDepool action neededRepool action neededStatus
ml-etcd2003nonenone
ml-serve[2004,2008]nonenone
ml-serve-ctrl2002nonenone
ml-staging2002nonenone
ml-staging-ctrl2002nonenone
ores[2007-2009]depoolpool

Data Engineering

Data-Engineering

ServersDepool action neededRepool action neededStatus
aqs[2009-2012]NoneNone
kafka-stretch2002NoneNone
schema2004depoolpoolDepooled

Data Persistence

Data-Persistence

ServersDepool action neededRepool action neededStatus
backup[2001,2007]to make sure media backups are pausedresume media backups
cassandra-dev2003
db[2100-2101,2117-2120,2128-2131,2139-2140,2151-2152,2170-2174,2181-2182,2187]Nothing - codfw will be depooled
dbprov2003to make sure db backups are not ongoingretry db backups if failed
dbproxy2004Nothing, not in useReload haproxy
es[2023,2033-2034]Nothing - codfw will be depooled
moss-fe2002Nothing, not in service
ms-backup2002to make sure media backups are pausedresume media backups
ms-be[2043,2050,2056,2059,2061,2065,2069,2073]no action required
ms-fe2012sudo depoolsudo poolDone
pc2014Nothing - codfw will be depooled
thanos-be2004no action required

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptApr 19 2023, 2:57 PM
colewhite updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)
Marostegui added a subscriber: jcrespo.

@jcrespo kindly check what is needed for backup involved hosts, thanks!

@jcrespo kindly check what is needed for backup involved hosts, thanks!

Done.

klausman updated the task description. (Show Details)
bking updated the task description. (Show Details)

Change 919847 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: temporarily remove dns2002 from authdns_servers

https://gerrit.wikimedia.org/r/919847

Mentioned in SAL (#wikimedia-operations) [2023-05-15T18:54:35Z] <sukhe> set routing-options static route 208.80.153.231/32 next-hop [ 208.80.153.48 208.80.153.10 ]: codfw row D maint 2023/05/16 [dns2002] T335042

Mentioned in SAL (#wikimedia-operations) [2023-05-15T19:47:01Z] <ryankemper@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 2:00:00 on 20 hosts with reason: T335042 maintenance

Mentioned in SAL (#wikimedia-operations) [2023-05-15T19:47:16Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 2:00:00 on 20 hosts with reason: T335042 maintenance

Mentioned in SAL (#wikimedia-operations) [2023-05-15T19:49:13Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic[2050-2054,2060,2067-2068,2072,2084-2086] for row D switch upgrade - bking@cumin1001 - T335042

Mentioned in SAL (#wikimedia-operations) [2023-05-15T19:49:17Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic[2050-2054,2060,2067-2068,2072,2084-2086] for row D switch upgrade - bking@cumin1001 - T335042

Mentioned in SAL (#wikimedia-operations) [2023-05-15T19:50:21Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic[2050-2054,2060,2067-2068,2072,2084-2086]* for row D switch upgrade - bking@cumin1001 - T335042

Mentioned in SAL (#wikimedia-operations) [2023-05-15T19:50:25Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic[2050-2054,2060,2067-2068,2072,2084-2086]* for row D switch upgrade - bking@cumin1001 - T335042

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row D switches upgrade - T335042 started.

Mentioned in SAL (#wikimedia-operations) [2023-05-16T10:29:46Z] <akosiaris@cumin1001> START - Cookbook sre.discovery.datacenter depool all active/active services in codfw: codfw row D switches upgrade - T335042

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row D switches upgrade - T335042 failed.

Mentioned in SAL (#wikimedia-operations) [2023-05-16T10:48:13Z] <akosiaris@cumin1001> END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) depool all active/active services in codfw: codfw row D switches upgrade - T335042

Mentioned in SAL (#wikimedia-operations) [2023-05-16T11:57:04Z] <XioNoX> stage upgrade on asw-d-codfw - T335042

Change 919847 merged by Ssingh:

[operations/puppet@production] hiera: temporarily remove dns2002 from authdns_servers

https://gerrit.wikimedia.org/r/919847

Change 920265 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] depool codfw for row D switch upgrade

https://gerrit.wikimedia.org/r/920265

Mentioned in SAL (#wikimedia-operations) [2023-05-16T12:21:27Z] <XioNoX> disable ping offload in codfw - T335042

Change 920265 merged by Ssingh:

[operations/dns@master] depool codfw for row D switch upgrade

https://gerrit.wikimedia.org/r/920265

Mentioned in SAL (#wikimedia-operations) [2023-05-16T12:22:51Z] <sukhe> running authdns-update to disable codfw for switch upgrade: T335042

Mentioned in SAL (#wikimedia-operations) [2023-05-16T12:24:31Z] <sukhe> [done] running authdns-update to disable codfw for switch upgrade: T335042

Icinga downtime and Alertmanager silence (ID=3a841f97-aecd-4c7a-8eb4-8acd1caa15b3) set by ayounsi@cumin1001 for 2:00:00 on 189 host(s) and their services with reason: codfw row D upgrade

aphlict2001.codfw.wmnet,apifeatureusage2001.codfw.wmnet,aqs[2009-2012].codfw.wmnet,arclamp2001.codfw.wmnet,backup[2001,2007,2011].codfw.wmnet,bast2003.wikimedia.org,cassandra-dev2003.codfw.wmnet,chartmuseum2001.codfw.wmnet,cloudcontrol2004-dev.wikimedia.org,conf2006.codfw.wmnet,cp[2039-2042].codfw.wmnet,db[2100-2101,2117-2120,2128-2131,2139-2140,2151-2152,2170-2174,2181-2182,2187].codfw.wmnet,dbprov2003.codfw.wmnet,dbproxy2004.codfw.wmnet,dispatch-be2001.codfw.wmnet,dns[2002,2006].wikimedia.org,durum2002.codfw.wmnet,elastic[2050-2054,2060,2067-2068,2072,2084-2086].codfw.wmnet,es[2023,2033-2034].codfw.wmnet,ganeti[2015-2018,2025-2026].codfw.wmnet,gerrit2002.wikimedia.org,gitlab-runner2004.codfw.wmnet,idm2001.wikimedia.org,idp2002.wikimedia.org,install2004.wikimedia.org,kafka-logging2005.codfw.wmnet,kafka-main[2004-2005].codfw.wmnet,kafka-stretch2002.codfw.wmnet,krb[2001-2002].codfw.wmnet,kubernetes[2013-2014,2016,2022,2024].codfw.wmnet,kubestagemaster2001.codfw.wmnet,kubestagetcd2003.codfw.wmnet,ldap-replica2006.wikimedia.org,logstash[2003,2029-2031].codfw.wmnet,lvs2010.codfw.wmnet,maps[2008,2010].codfw.wmnet,mc[2051-2054].codfw.wmnet,mc-gp2003.codfw.wmnet,mc-wf2002.codfw.wmnet,miscweb2003.codfw.wmnet,ml-etcd2003.codfw.wmnet,ml-serve[2004,2008].codfw.wmnet,ml-serve-ctrl2002.codfw.wmnet,ml-staging2002.codfw.wmnet,ml-staging-ctrl2002.codfw.wmnet,moss-fe2002.codfw.wmnet,ms-backup2002.codfw.wmnet,ms-be[2043,2050,2056,2059,2061,2065,2069,2073].codfw.wmnet,ms-fe2012.codfw.wmnet,mw[2271-2279,2281-2290,2366-2376,2444-2451].codfw.wmnet,netbox-dev2002.codfw.wmnet,ores[2007-2009].codfw.wmnet,parse[2016-2020].codfw.wmnet,pc2014.codfw.wmnet,ping2003.codfw.wmnet,puppetdb2003.codfw.wmnet,puppetmaster2002.codfw.wmnet,rdb2010.codfw.wmnet,restbase[2012,2017-2018,2023,2026-2027].codfw.wmnet,schema2004.codfw.wmnet,search-loader2001.codfw.wmnet,sessionstore2003.codfw.wmnet,thanos-be2004.codfw.wmnet,thanos-fe2003.codfw.wmnet,thumbor2006.codfw.wmnet,urldownloader2004.wikimedia.org,wcqs2003.codfw.wmnet,wdqs[2006,2012,2015,2021-2022].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-05-16T12:50:06Z] <moritzm> disabling Puppet in codfw/esams/ulsfo for switch maintenance T335042

Mentioned in SAL (#wikimedia-operations) [2023-05-16T13:01:16Z] <XioNoX> asw-d-codfw> request system reboot all-members - T335042

Mentioned in SAL (#wikimedia-operations) [2023-05-16T13:25:23Z] <moritzm> enabled Puppet in codfw/esams/ulsfo for switch maintenance T335042

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row D switches upgrade done - T335042 started.

Mentioned in SAL (#wikimedia-operations) [2023-05-16T13:54:21Z] <akosiaris@cumin1001> START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: codfw row D switches upgrade done - T335042

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row D switches upgrade done - T335042 failed.

Mentioned in SAL (#wikimedia-operations) [2023-05-16T14:10:45Z] <akosiaris@cumin1001> END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in codfw: codfw row D switches upgrade done - T335042

ayounsi claimed this task.

Upgrade went very well. Thanks everybody! That was the last one!