Page MenuHomePhabricator

codfw row A switches upgrade
Closed, ResolvedPublic

Description

codfw row A switches upgrade

For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade codfw row A switches.

This is scheduled for Feb 7th - 14:00-16:00 UTC, please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well. Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.

The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:

  • please write NONE if no action is needed,
  • the cookbook/command to run if it can be done by a 3rd party
  • who will be around to take care of the depool
  • Link to the relevant doc
  • etc

The two main types of actions needed are depool and monitoring downtime

NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark None in the table) so there are less moving parts closer to the maintenance window.

Data Engineering

Data-Engineering

ServersDepool action neededRepool action neededStatus
aqs[2001-2004]NoneNone

Observability

SRE Observability

ServersDepool action neededRepool action neededStatus
grafana2001set downtimenonedepooled
kafka-logging2001set downtime, stop kafka servicestart kafka service, confirm kafka logging dashboard returns greendepooled
kafkamon2002set downtimenonedepooled
logstash[2001,2023,2026,2033]conftool 2023, drain shards 2001,2026,2033conftool 2023, allocate shards 2001,2026,2033all re-pooled
xhgui2001nonenonen/a

Observability and Data Persistence

SRE Observability Data-Persistence

ServersDepool action neededRepool action neededStatus
thanos-fe2001conftool depool, while making sure another thanos-fe host is pooled for service thanos-webconftool pool. make sure only one thanos-fe host is pooled for thanos-web service

Search Platform

Discovery-Search

ServersDepool action neededRepool action neededStatus
elastic[2037-2040,2055-2056,2061-2062,2069,2073-2076]NoneNoneSearch team will depool & ban hosts from cluster one day prior to upgrade
wdqs[2003-2004,2009]NoneNoneSearch team will depool one day prior to upgrade

Core Platform

Platform Engineering

ServersDepool action neededRepool action neededStatus
maps2005
thumbor2005

WMCS

cloud-services-team

ServersDepool action neededRepool action neededStatus
cloudbackup2001NONENONE

ServiceOps-Collab

collaboration-services

ServersDepool action neededRepool action neededStatus
contint2001NONENONE
doc2001NONENONE
gitlab2002NONENONE
planet2002NONENONE

Data Persistence

Data-Persistence

ServersDepool action neededRepool action neededStatus
backup[2002,2004]They are not a service, but storage. Jaime will make sure earlier in the week they are not active at the time of the maintenance.Jaime will restart some delayed backups, if any.no blockers
db[2094,2097,2103-2106,2121-2122,2132-2133,2136,2142,2145-2146,2153-2158,2175-2176,2183]All MW need to be depooled, and some masters need to be switched over (misc masters do not need switchover/depooling@Marostegui will repool everything@Marostegui No longer masters: db2103, db2104, db2105, db2121, db2142 - the rest of masters are misc so they can be ignored - what needs to be depooled, is already depooled
dbprov2001They are not a service, but storage. Jaime will make sure earlier in the week they are not active at the time of the maintenance.Noneno blockers
dbproxy2001NoneReload haproxy
es[2020,2024,2026-2028]All need to be depooled (@Marostegui will do it)@Marostegui will repool everythingDepooled
moss-be2001N/AN/ANot currently in production service
ms-be[2040,2044-2045,2051-2052,2060,2062,2066]NoneNone
ms-fe2009sudo depoolsudo pool
pc2011To be depooledTo be repooled once it is all doneAlready depooled by @Marostegui
thanos-be2001NoneNone

Infrastructure Foundations

Infrastructure-Foundations

ServersDepool action neededRepool action neededStatus
ganeti[2023-2024,2027-2030]None
ganeti-test[2001-2003]None
netbox2002NoneNone
netboxdb2002NoneNone
pki2001NoneNoneN/A
puppetdb2002sudo cumin 'A:codfw or A:esams or A:ulsfo' 'disable-puppet "Switch reboot: T327925"'sudo cumin 'A:codfw or A:esams or A:ulsfo' 'enable-puppet "Switch reboot: T327925"'jbond will handle
puppetmaster[2001,2004]sudo cumin 'A:codfw or A:esams or A:ulsfo' 'disable-puppet "Switch reboot: T327925"'sudo cumin 'A:codfw or A:esams or A:ulsfo' 'enable-puppet "Switch reboot: T327925"'jbond will handle
rpki2002None
test-reimage2001nNone
testvm[2001-2005]None
urldownloader2001None

Infrastructure Foundations and Observability

Infrastructure-Foundations SRE Observability

ServersDepool action neededRepool action neededStatus
netmon2002NoneNone

Machine Learning

Machine-Learning-Team

ServersDepool action neededRepool action neededStatus
ml-cache2001--
ml-serve[2001,2005]--
ml-staging2001--
ml-staging-etcd2001--
ores[2001-2002]sudo -i depoolsudo -i pool
orespoolcounter2003---

Traffic

Traffic

ServersDepool action neededRepool action neededStatus
acmechief2001N/A
acmechief-test2001N/A
authdns2001redirect to authdns1001the oppositedone
cp[2027-2030]N/AN/A
doh2001disable puppet & stop bird.servicethe opppositedone
lvs2007N/AN/A
ncredir2001N/AN/A

ServiceOps

serviceops
Due to the large number of services potentially affected (multiple mw appservers, kubernetes workers), a global depool of a/a services was done:
sre.discovery.datacenter-route depool --reason T327925 codfw
After the maintenance, repool:
sre.discovery.datacenter-route pool --reason T327925 codfw
Depool restbase-async from eqiad:
cookbook sre.discovery.service-route --reason T327925 depool --wipe-cache eqiad restbase-async

ServersDepool action neededRepool action neededStatus
kafka-main2001done
kubemaster2001done
kubernetes[2005,2007-2008,2018-2019]done
kubestage2001done
kubetcd2004done
mc[2038-2041,2055]done
mc-gp2001done
mw[2291-2309,2377-2411]done
mwdebug2001done
parse[2001-2005]done
poolcounter2003done
rdb2007done
registry2003done

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
herron updated the task description. (Show Details)

I would suggest that instead of handling individual systems, we depool the whole datacenter from external and internal traffic for the duration of the maintenance.

That is, unless we want to verify we can survive the loss of one Row and then i'd suggest we do nothing instead.

I would suggest that instead of handling individual systems, we depool the whole datacenter from external and internal traffic for the duration of the maintenance.

That is, unless we want to verify we can survive the loss of one Row and then i'd suggest we do nothing instead.

+1 from my side (and I would apply this to all the future rows maintenance that is already scheduled)

Mentioned in SAL (#wikimedia-operations) [2023-02-06T07:19:13Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool es2020 es2024 es2026 es2027 es2028 T327925', diff saved to https://phabricator.wikimedia.org/P43586 and previous config saved to /var/cache/conftool/dbconfig/20230206-071913-root.json

Mentioned in SAL (#wikimedia-operations) [2023-02-06T07:30:16Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2094 db2097 db2103 db2104 db2105 db2106 db2121 db2122 db2132 db2133 db2136 db2142 db2145 db2146 db2153 db2154 db2155 db2156 db2157 db2158 db2175 db2176 db2183 T327925', diff saved to https://phabricator.wikimedia.org/P43587 and previous config saved to /var/cache/conftool/dbconfig/20230206-073015-root.json

Change 886834 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Delay codfw es (db content) backups by one day

https://gerrit.wikimedia.org/r/886834

I would suggest that instead of handling individual systems, we depool the whole datacenter from external and internal traffic for the duration of the maintenance.

That is, unless we want to verify we can survive the loss of one Row and then i'd suggest we do nothing instead.

+1 from my side (and I would apply this to all the future rows maintenance that is already scheduled)

Yeah, let's avoid causing issues to end users and having to stress about that. Let's fully depool codfw.

Cool! I am going to repool the hosts then :)

Change 886834 merged by Jcrespo:

[operations/puppet@production] dbbackups: Delay codfw es (db content) backups by one day

https://gerrit.wikimedia.org/r/886834

Change 886812 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] Revert "dbbackups: Delay codfw es (db content) backups by one day"

https://gerrit.wikimedia.org/r/886812

I am repooling all the databases since we are going to fully depool codfw for reads.

If we're "just" depooling codfw it's worth noting we will still need to depool the affected ms-fe* nodes (since mw always tries to write to both DCs).

Mentioned in SAL (#wikimedia-operations) [2023-02-06T22:42:20Z] <inflatador> bking@cumin2002 banning Elastic nodes from cluster in preparation for T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-06T22:48:27Z] <ryankemper> T327925 Banned elastic[2037-2040,2055-2056,2061-2062,2069,2073-2076] on codfw elastic

Icinga downtime and Alertmanager silence (ID=e0e96453-af13-467f-a75e-ebd1c4122a32) set by bking@cumin2002 for 1 day, 0:00:00 on 13 host(s) and their services with reason: switch upgrade

elastic[2037-2040,2055-2056,2061-2062,2069,2073-2076].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-02-06T22:55:13Z] <ryankemper> T327925 Depooled codfw wdqs hosts: ryankemper@cumin2002:~$ sudo -E cumin -b 3 'wdqs[2003-2004,2009]*' 'sudo depool'

Staging the new version on the switches:
asw-a-codfw> request system software add force-host set [ /var/tmp/jinstall-ex-4300-21.4R3-S1.5-signed.tgz /var/tmp/jinstall-host-qfx-5-21.4R3-S1.5-signed.tgz ]

This went fine.
Ready for a request system reboot all-members

Change 887284 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/dns@master] admin_state: depool codfw

https://gerrit.wikimedia.org/r/887284

Mentioned in SAL (#wikimedia-operations) [2023-02-07T12:24:43Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 8:00:00 on doh2001.wikimedia.org with reason: depooled; T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T12:24:58Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on doh2001.wikimedia.org with reason: depooled; T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T12:28:55Z] <vgutierrez> depooling authdns2001 - T327925

To depool all services in codfw we will just need to run:

sudo cookbook sre.discovery.datacenter-route --reason 'T327925' depool codfw

from one of the cumin hosts.

Please note: this won't depool docker-registry, which will still be active in codfw for the duration of the maintenance.

Mentioned in SAL (#wikimedia-operations) [2023-02-07T13:05:49Z] <jbond> diable puppet in codfw, ulsfo and esams for switch upgrade T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T13:11:17Z] <oblivian@cumin2002> START - Cookbook sre.discovery.datacenter-route depool all active/active services in codfw: T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T13:12:59Z] <jbond> enable puppet in codfw, ulsfo and esams to allow depools post switch upgrade T327925

Change 887284 merged by Vgutierrez:

[operations/dns@master] admin_state: depool codfw

https://gerrit.wikimedia.org/r/887284

Mentioned in SAL (#wikimedia-operations) [2023-02-07T13:31:52Z] <vgutierrez> depool codfw edge site - T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T13:32:43Z] <oblivian@cumin2002> END (PASS) - Cookbook sre.discovery.datacenter-route (exit_code=0) depool all active/active services in codfw: T327925

For the record, full row hosts downtime done with:
sudo cookbook sre.hosts.downtime --hours 2 -r "codfw row A upgrade" -t T327925 'P{P:netbox::host%location ~ "A.*codfw"}'

Icinga downtime and Alertmanager silence (ID=295bf4d5-8856-488b-9ca9-06a0ff06db18) set by ayounsi@cumin1001 for 2:00:00 on 199 host(s) and their services with reason: codfw row A upgrade

acmechief2001.codfw.wmnet,acmechief-test2001.codfw.wmnet,aqs[2001-2004].codfw.wmnet,authdns2001.wikimedia.org,backup[2002,2004].codfw.wmnet,cloudbackup2001.codfw.wmnet,contint2001.wikimedia.org,cp[2027-2030].codfw.wmnet,db[2094,2097,2103-2106,2121-2122,2132-2133,2136,2142,2145-2146,2153-2158,2175-2176,2183].codfw.wmnet,dbprov2001.codfw.wmnet,dbproxy2001.codfw.wmnet,doc2001.codfw.wmnet,doh2001.wikimedia.org,elastic[2037-2040,2055-2056,2061-2062,2069,2073-2076].codfw.wmnet,es[2020,2024,2026-2028].codfw.wmnet,ganeti[2023-2024,2027-2030].codfw.wmnet,ganeti-test[2001-2003].codfw.wmnet,gitlab2002.wikimedia.org,grafana2001.codfw.wmnet,kafka-logging2001.codfw.wmnet,kafka-main2001.codfw.wmnet,kafkamon2002.codfw.wmnet,kubemaster2001.codfw.wmnet,kubernetes[2005,2007-2008,2018-2019].codfw.wmnet,kubestage2001.codfw.wmnet,kubetcd2004.codfw.wmnet,logstash[2001,2023,2026,2033].codfw.wmnet,lvs2007.codfw.wmnet,maps2005.codfw.wmnet,mc[2038-2041,2055].codfw.wmnet,mc-gp2001.codfw.wmnet,ml-cache2001.codfw.wmnet,ml-serve[2001,2005].codfw.wmnet,ml-staging2001.codfw.wmnet,ml-staging-etcd2001.codfw.wmnet,moss-be2001.codfw.wmnet,ms-be[2040,2044-2045,2051-2052,2060,2062,2066].codfw.wmnet,ms-fe2009.codfw.wmnet,mw[2291-2309,2377-2411].codfw.wmnet,mwdebug2001.codfw.wmnet,ncredir2001.codfw.wmnet,netbox2002.codfw.wmnet,netboxdb2002.codfw.wmnet,netmon2002.wikimedia.org,ores[2001-2002].codfw.wmnet,orespoolcounter2003.codfw.wmnet,parse[2001-2005].codfw.wmnet,pc2011.codfw.wmnet,people2002.codfw.wmnet,pki2001.codfw.wmnet,planet2002.codfw.wmnet,poolcounter2003.codfw.wmnet,puppetdb2002.codfw.wmnet,puppetmaster[2001,2004].codfw.wmnet,rdb2007.codfw.wmnet,registry2003.codfw.wmnet,rpki2002.codfw.wmnet,test-reimage2001.codfw.wmnet,testvm[2002,2004-2005].codfw.wmnet,thanos-be2001.codfw.wmnet,thanos-fe2001.codfw.wmnet,thumbor2005.codfw.wmnet,urldownloader2001.wikimedia.org,wdqs[2003-2004,2009].codfw.wmnet,xhgui2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-02-07T13:59:57Z] <XioNoX> disable puppet in ulsfo/esams/codfw for codfw row A switch upgrade - T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T14:06:51Z] <XioNoX> asw-a-codfw> request system reboot all-members - T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T14:08:45Z] <jbond> disable puppet in codfw, uslfo, esams for switch upgrade T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T14:26:37Z] <claime> depooled appserver, api_appserver, jobrunner, parsoid - T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T14:27:59Z] <jbond> enable puppet in codfw, uslfo, esams post switch upgrade T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T14:32:20Z] <Emperor> pool ms-fe2009 (codfw as a whole still depooled) T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T14:36:25Z] <claime> repooled appserver, api_appserver, jobrunner, parsoid - T327925

Change 886984 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/dns@master] Revert "admin_state: depool codfw"

https://gerrit.wikimedia.org/r/886984

Mentioned in SAL (#wikimedia-operations) [2023-02-07T14:46:49Z] <volans@cumin2002> START - Cookbook sre.discovery.datacenter-route pool all active/active services in codfw: T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T15:00:53Z] <vgutierrez> restart pybal in lvs2009 - T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T15:04:37Z] <vgutierrez> restart pybal in lvs2010 - T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T15:07:56Z] <volans@cumin2002> END (PASS) - Cookbook sre.discovery.datacenter-route (exit_code=0) pool all active/active services in codfw: T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T15:09:02Z] <volans@cumin2002> START - Cookbook sre.discovery.service-route depool restbase-async in eqiad: T327925

Change 886984 merged by Vgutierrez:

[operations/dns@master] Revert "admin_state: depool codfw"

https://gerrit.wikimedia.org/r/886984

Mentioned in SAL (#wikimedia-operations) [2023-02-07T15:12:21Z] <vgutierrez> repool codfw edge site - T327925

Mentioned in SAL (#wikimedia-operations) [2023-02-07T15:14:06Z] <volans@cumin2002> END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in eqiad: T327925

ayounsi claimed this task.

The upgrade was smooth, ~15min hard downtime.
No user impact, all the depools did their job. There was some paging alerts that got discussed on IRC, not sure what the conclusion was.

Thanks everybody!

Change 886812 merged by Jcrespo:

[operations/puppet@production] Revert "dbbackups: Delay codfw es (db content) backups by one day"

https://gerrit.wikimedia.org/r/886812

Mentioned in SAL (#wikimedia-operations) [2023-02-07T17:55:48Z] <inflatador> bking@cumin1001 repooling elastic and wdqs hosts post-maintenance T327925