Page MenuHomePhabricator

eqiad row B switches upgrade
Closed, ResolvedPublic

Description

eqiad row B switches upgrade

For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade eqiad row B switches during the scheduled DC switchover.

This has been re-scheduled to March 28th - 14:00-16:00 UTC (one week later than originally planned to not conflict with Sprint week), please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well (well, 15min in real-reality). Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.

The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:

  • please write NONE if no action is needed,
  • the cookbook/command to run if it can be done by a 3rd party
  • who will be around to take care of the depool
  • Link to the relevant doc
  • etc

The two main types of actions needed are depool and monitoring downtime

NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark None in the table) so there are less moving parts closer to the maintenance window.

All servers will be downtimed with sudo cookbook sre.hosts.downtime --hours 2 -r "eqiad row B upgrade" -t T330165 'P{P:netbox::host%location ~ "B.*eqiad"}' but specific services might need specific downtimes.

Traffic

Traffic

ServersDepool action neededRepool action neededStatus
dns1003 (formerly authdns1001)disable puppet and stop birddone
cp[1079-1082]depooleqiad will be depooled, no action
durum1002disable puppet and stop birddone
lvs1014N/A
lvs1018disable puppet & stop pybaleqiad will be depooled, no action

ServiceOps-Collab

collaboration-services

ServersDepool action neededRepool action neededStatus
contint1002NONENONE
gitlab1004NONENONENONE
gitlab-runner1002pause in admin interfaceunpause in admin interfaceunpaused/pooled again
otrs1001NONENONEdowntime will be announced
phab1004NONENONEdowntime announced in wikitech-l

Infrastructure Foundations

Infrastructure-Foundations

ServersDepool action neededRepool action neededStatus
debmonitor1002nonenone
failoid1002nonenone
ganeti[1013-1018]nonenone
idp1002failover idp.w.oready to proceed
ldap-replica1003depooldepooled
mirror1001nonenone
netflow1002nonenone
puppetdb1003nonenone
puppetmaster[1001,1003]disable puppet in fleet wide (as puppet.wikimedia.org goes to eqiadpuppet disabled

Infrastructure Foundations and Observability

Infrastructure-Foundations SRE Observability

ServersDepool action neededRepool action neededStatus
netmon1003NoneNone

Observability

SRE Observability

ServersDepool action neededRepool action neededStatus
arclamp1001nonenonedowntime scheduled
centrallog1002nonenonedowntime scheduled
graphite1005failover to codfwfail back to eqiadmoved back to eqiad
kafka-logging1001downtime scheduled
logstash[1011,1027,1032]drain shards 1011,1027 depool 1032allocate shards 1011,1027 repool 1032shards allocating, hosts pooled
prometheus1006depool and remove from AMpool and put back in AMrepool completed

Observability and Data Persistence

SRE Observability Data-Persistence

ServersDepool action neededRepool action neededStatus
thanos-fe1002depoolpoolcompleted

Core Platform

Platform Engineering

ServersDepool action neededRepool action neededStatus
dumpsdata1001none
maps[1007-1008]depoolpool
restbase[1017,1022-1024,1029,1032]depoolpool
snapshot[1008,1010,1013]none
thumbor[1001-1002]depoolpool

Search Platform

Discovery-Search
contact: @bking (inflatador on IRC)

ServersDepool action neededRepool action neededStatus
cloudelastic[1002,1006]no action needed
elastic[1055-1056,1074-1079,1085-1086]no action needed
relforge1004no action needed
wcqs1002no action needed
wdqs[1007,1009,1012]no action needed

Data Persistence

Data-Persistence

ServersDepool action neededRepool action neededStatus
backup[1003,1005]Jaime to make sure they are idle during downtime
db[1104,1112-1113,1118-1119,1124,1130,1132,1139,1143-1144,1152,1155,1162-1165,1178-1179,1183,1187-1188,1206]db1183 needs to be switched over (T330847) , db1164 needs to be switched over T331510db1183 and db1164 are no longer masters (T330847 T331510)
dbprov1002Jaime to make sure they are idle during downtime
dbproxy[1014-1015]@Marostegui will failover dbproxy1014 and dbproxy1015Reload the proxiesBoth proxies have been failed over and are not active
es[1021,1025,1029-1030]Jaime: may require an es backup retry after downtimeNothing on MW side as eqiad is depooled
ms-be[1041,1047,1052-1053,1058,1061,1065]nothing needed
ms-fe1010depoolpooldepooled
pc1012Nothing to do, eqiad is depooled
thanos-be1002nothing needed

Machine Learning

Machine-Learning-Team

ServersDepool action neededRepool action neededStatus
ml-etcd1001nonenonenone
ml-serve1002nonenonenone
ml-serve-ctrl1001nonenonenone
ores[1003-1004]sudo -i depoolsudo -i pool

Data Engineering

Data-Engineering
Announce downtime for other teams pipelines + announce downtime for Hive, Presto + Superset limited functionality

ServersDepool action neededRepool action neededStatus
an-conf1001NONENONE
an-coord1001Failover hive to an-coord1002 **n.b. We will lose MariaDB, therefore superset, some Druid functionality, Hive, DataHubFail back Hive to an-coord1001Failed over hive
an-druid1004NONENONE
an-launcher1002Disable gobblin ingestion at 12:50 UTCRe-enable ingestionpatch for gobblin Gobblin jobs absented
an-master1002Putting into safe mode + Disabling YARNTaking out of safe mode + Enabling YARNScheduled for 13:30 UTC : patch for YARN YARN queues stopped, safe mode entered
an-presto1004NONENONE
an-test-coord1002NONENONE
an-test-ui1001NONENONE
an-tool[1008-1009]NONE + Announce downtime for HueNONEDone
an-web1001NONE + Announce downtime for wikistats and analytics.wikimedia.orgNONEDone
an-worker[1083-1087,1097-1098,1117,1124-1128,1130]Putting HDFS into safe modeTaking HDFS out of safe modeSafe mode entered
analytics[1061-1063,1072-1073]Putting HDFS into safe modeTaking HDFS out of safe modeSafe mode entered
aqs[1011,1017]NONENONE
datahubsearch1002NONENONE
druid[1005,1007]NONENONE
kafka-jumbo1003NONENONE
schema1003NONENONE
stat[1007,1009]Announce downtime for these two stats serversDone

Data Engineering and Machine Learning

Data-Engineering Machine-Learning-Team

ServersDepool action neededRepool action neededStatus
dse-k8s-ctrl1002NONENONE
dse-k8s-etcd1002NONENONE
dse-k8s-worker1002NONENONE

WMCS

cloud-services-team

ServersDepool action neededRepool action neededStatus
dbproxy1019
cloudbackup[1001-1002]-dev
cloudcephmon1001
cloudcephosd1003
cloudcontrol1006
clouddb[1015-1016]
clouddumps1001
cloudrabbit1001
cloudservices1005
cloudvirt[1017,1019-1024]
cloudvirt-wdqs[1001-1003]
cloudweb1003

ServiceOps

serviceops

ServersDepool action neededRepool action neededStatus
conf1008
dragonfly-supernode1001
kafka-main1002
kubernetes[1009-1010,1015,1019,1022]
kubestage1003
kubestagemaster1001
kubestagetcd1005
kubetcd1006
mc[1041-1044]
mc-wf1001
mw[1393-1404,1423-1433,1466-1481]
mwmaint1002
parse[1007-1012,1017]
rdb1009

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 901322 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Temporarily disable xenon/excimer for switch maintenance

https://gerrit.wikimedia.org/r/901322

Change 900238 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus1006: depool from alertmanager

https://gerrit.wikimedia.org/r/900238

Change 903185 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: move reads to graphite2004

https://gerrit.wikimedia.org/r/903185

Change 903206 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: check graphite2004

https://gerrit.wikimedia.org/r/903206

Change 903207 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsd: move writes to graphite2004

https://gerrit.wikimedia.org/r/903207

Change 903208 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: move writes to graphite2004

https://gerrit.wikimedia.org/r/903208

Change 903209 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/mediawiki-config@master] Failover statsd to graphite2004

https://gerrit.wikimedia.org/r/903209

Change 900238 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus1006: depool from alertmanager

https://gerrit.wikimedia.org/r/900238

Change 903246 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: temporarily removed dns1003 from authdns_servers

https://gerrit.wikimedia.org/r/903246

Change 903249 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] clouddumps: make clouddumps1002 the primary during switch maintenance

https://gerrit.wikimedia.org/r/903249

Change 903249 merged by Andrew Bogott:

[operations/puppet@production] clouddumps: make clouddumps1002 the primary during switch maintenance

https://gerrit.wikimedia.org/r/903249

Mentioned in SAL (#wikimedia-operations) [2023-03-27T21:45:34Z] <ryankemper> T330165 Depooled relevant search platform hosts: sudo -E cumin 'elastic[1055-1056,1074-1079,1085-1086]*,cloudelastic100[2,6]*,wcqs1002*,wdqs[1007,1012]*' 'sudo depool'

Mentioned in SAL (#wikimedia-operations) [2023-03-27T21:45:34Z] <ryankemper> T330165 Depooled relevant search platform hosts: sudo -E cumin 'elastic[1055-1056,1074-1079,1085-1086]*,cloudelastic100[2,6]*,wcqs1002*,wdqs[1007,1012]*' 'sudo depool'

Isn't this missing wdqs1009?
FYI you can also use a query like:

'P{elastic1*,cloudelastic1*,wcqs1*,wdqs1*} and P{P:netbox::host%location ~ "B.*eqiad"}'

Change 903185 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: move reads to graphite2004

https://gerrit.wikimedia.org/r/903185

Change 903206 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: check graphite2004

https://gerrit.wikimedia.org/r/903206

Mentioned in SAL (#wikimedia-operations) [2023-03-28T08:00:06Z] <godog> move graphite reads to codfw - T330165

Change 903207 merged by Filippo Giunchedi:

[operations/puppet@production] statsd: move writes to graphite2004

https://gerrit.wikimedia.org/r/903207

Change 903208 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: move writes to graphite2004

https://gerrit.wikimedia.org/r/903208

Change 903209 merged by jenkins-bot:

[operations/mediawiki-config@master] Failover statsd to graphite2004

https://gerrit.wikimedia.org/r/903209

Mentioned in SAL (#wikimedia-operations) [2023-03-28T08:02:36Z] <oblivian@deploy2002> Started scap: Backport for [[gerrit:903209|Failover statsd to graphite2004 (T330165)]]

Mentioned in SAL (#wikimedia-operations) [2023-03-28T08:04:11Z] <oblivian@deploy2002> oblivian and filippo: Backport for [[gerrit:903209|Failover statsd to graphite2004 (T330165)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-03-28T08:11:25Z] <oblivian@deploy2002> Finished scap: Backport for [[gerrit:903209|Failover statsd to graphite2004 (T330165)]] (duration: 08m 48s)

Change 903610 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Disable the gobblin timers temporarily for switch maintenance

https://gerrit.wikimedia.org/r/903610

Change 903621 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Failover hive services to the standby coordinator

https://gerrit.wikimedia.org/r/903621

Change 903621 merged by Btullis:

[operations/dns@master] Failover hive services to the standby coordinator

https://gerrit.wikimedia.org/r/903621

Change 903627 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Disable job submission to YARN queues to faciliatate maintenance

https://gerrit.wikimedia.org/r/903627

I "depooled" dbproxy1019 by following the procedure at https://wikitech.wikimedia.org/w/index.php?title=Portal:Data_Services/Admin/Runbooks/Depool_wikireplicas#Hardware_proxies

I modified the Prefix Puppet and HAproxy will route all traffic to 208.80.154.242 which is mapped to dbproxy1018 and is not affected by the switch upgrade.

Please note that LVS will likely trigger a few alerts when dbproxy1019 goes down... I added a note to the wiki page above asking if maybe we should modify the procedure.

Change 903610 merged by Btullis:

[operations/puppet@production] Disable the gobblin timers temporarily for switch maintenance

https://gerrit.wikimedia.org/r/903610

Change 903642 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] Depool eqiad frontends for network maintenance

https://gerrit.wikimedia.org/r/903642

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switches upgrade - T330165 started.

Mentioned in SAL (#wikimedia-operations) [2023-03-28T12:58:05Z] <akosiaris@cumin1001> START - Cookbook sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switches upgrade - T330165

Change 903642 merged by Ayounsi:

[operations/dns@master] Depool eqiad frontends for network maintenance

https://gerrit.wikimedia.org/r/903642

Mentioned in SAL (#wikimedia-operations) [2023-03-28T12:59:49Z] <XioNoX> depool eqiad for network maintenance - T330165

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switches upgrade - T330165 completed.

Mentioned in SAL (#wikimedia-operations) [2023-03-28T13:17:37Z] <akosiaris@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in eqiad: eqiad row B switches upgrade - T330165

Change 903246 merged by Ssingh:

[operations/puppet@production] hiera: temporarily removed dns1003 from authdns_servers

https://gerrit.wikimedia.org/r/903246

Change 903627 merged by Btullis:

[operations/puppet@production] Disable job submission to YARN queues to faciliatate maintenance

https://gerrit.wikimedia.org/r/903627

Mentioned in SAL (#wikimedia-analytics) [2023-03-28T13:37:03Z] <btullis> refreshed YARN queues with: sudo kerberos-run-command yarn /usr/bin/yarn rmadmin -refreshQueues on both an-master100[1-2] - T330165

Icinga downtime and Alertmanager silence (ID=4c1e12e1-9d5e-4447-880a-f0ec09133a64) set by ayounsi@cumin1001 for 2:00:00 on 249 host(s) and their services with reason: eqiad row B upgrade

an-airflow1002.eqiad.wmnet,an-conf1001.eqiad.wmnet,an-coord1001.eqiad.wmnet,an-druid1004.eqiad.wmnet,an-launcher1002.eqiad.wmnet,an-master1002.eqiad.wmnet,an-presto1004.eqiad.wmnet,an-test-coord1002.eqiad.wmnet,an-test-ui1001.eqiad.wmnet,an-tool[1008-1009].eqiad.wmnet,an-web1001.eqiad.wmnet,an-worker[1083-1087,1097-1098,1117,1124-1128,1130].eqiad.wmnet,analytics[1061-1063,1072-1073].eqiad.wmnet,aqs[1011,1017].eqiad.wmnet,arclamp1001.eqiad.wmnet,backup[1003,1005].eqiad.wmnet,centrallog1002.eqiad.wmnet,cloudbackup[1001-1002]-dev.eqiad.wmnet,cloudcephmon1001.eqiad.wmnet,cloudcontrol1006.wikimedia.org,clouddb[1015-1016].eqiad.wmnet,clouddumps1001.wikimedia.org,cloudelastic[1002,1006].wikimedia.org,cloudrabbit1001.wikimedia.org,cloudservices1005.wikimedia.org,cloudvirt[1019-1020,1023-1024].eqiad.wmnet,cloudvirt-wdqs[1001-1003].eqiad.wmnet,cloudweb1003.wikimedia.org,conf1008.eqiad.wmnet,contint1002.wikimedia.org,cp[1079-1082].eqiad.wmnet,datahubsearch1002.eqiad.wmnet,db[1104,1112-1113,1118-1119,1124,1130,1132,1139,1143-1144,1152,1155,1162-1165,1178-1179,1183,1187-1188,1206].eqiad.wmnet,dbprov1002.eqiad.wmnet,dbproxy[1014-1015,1019].eqiad.wmnet,debmonitor1002.eqiad.wmnet,dns1003.wikimedia.org,dragonfly-supernode1001.eqiad.wmnet,druid[1005,1007].eqiad.wmnet,dse-k8s-ctrl1002.eqiad.wmnet,dse-k8s-etcd1002.eqiad.wmnet,dse-k8s-worker1002.eqiad.wmnet,dumpsdata1001.eqiad.wmnet,durum1002.eqiad.wmnet,elastic[1055-1056,1074-1079,1085-1086].eqiad.wmnet,es[1021,1025,1029-1030].eqiad.wmnet,failoid1002.eqiad.wmnet,ganeti[1013-1018].eqiad.wmnet,gerrit1001.wikimedia.org,gitlab1004.wikimedia.org,gitlab-runner1002.eqiad.wmnet,graphite1005.eqiad.wmnet,idp1002.wikimedia.org,kafka-jumbo1003.eqiad.wmnet,kafka-logging1001.eqiad.wmnet,kafka-main1002.eqiad.wmnet,kafka-test[1006-1010].eqiad.wmnet,kubernetes[1009-1010,1015,1019,1022].eqiad.wmnet,kubestage1003.eqiad.wmnet,kubestagemaster1001.eqiad.wmnet,kubestagetcd1005.eqiad.wmnet,kubetcd1006.eqiad.wmnet,ldap-replica1003.wikimedia.org,logstash[1011,1027,1032].eqiad.wmnet,lvs[1014,1018].eqiad.wmnet,maps[1007-1008].eqiad.wmnet,mc[1041-1044].eqiad.wmnet,mc-wf1001.eqiad.wmnet,mirror1001.wikimedia.org,ml-etcd1001.eqiad.wmnet,ml-serve1002.eqiad.wmnet,ml-serve-ctrl1001.eqiad.wmnet,ms-be[1041,1047,1052-1053,1058,1061,1065].eqiad.wmnet,ms-fe1010.eqiad.wmnet,mw[1393-1404,1423-1433,1466-1481].eqiad.wmnet,mwmaint1002.eqiad.wmnet,netflow1002.eqiad.wmnet,netmon1003.wikimedia.org,ores[1003-1004].eqiad.wmnet,otrs1001.eqiad.wmnet,parse[1007-1012,1017].eqiad.wmnet,pc1012.eqiad.wmnet,phab1004.eqiad.wmnet,prometheus1006.eqiad.wmnet,puppetdb1003.eqiad.wmnet,puppetmaster[1001,1003].eqiad.wmnet,rdb1009.eqiad.wmnet,relforge1004.eqiad.wmnet,restbase[1017,1022-1024,1029,1032].eqiad.wmnet,schema1003.eqiad.wmnet,snapshot[1008,1010,1013].eqiad.wmnet,stat[1007,1009].eqiad.wmnet,thanos-be1002.eqiad.wmnet,thanos-fe1002.eqiad.wmnet,thumbor[1001-1002].eqiad.wmnet,wcqs1002.eqiad.wmnet,wdqs[1007,1009,1012].eqiad.wmnet,zookeeper-test1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-03-28T13:54:49Z] <Emperor> depool ms-fe1010 before switch work T330165

Mentioned in SAL (#wikimedia-analytics) [2023-03-28T13:54:56Z] <btullis> entering safe mode for analytics-hadoop cluster: T330165

Mentioned in SAL (#wikimedia-operations) [2023-03-28T14:05:58Z] <XioNoX> reboot eqiad row B for upgrade - T330165

Change 903666 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] Revert "Depool eqiad frontends for network maintenance"

https://gerrit.wikimedia.org/r/903666

Change 903666 merged by Ssingh:

[operations/dns@master] Revert "Depool eqiad frontends for network maintenance"

https://gerrit.wikimedia.org/r/903666

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches upgrade done - T330165 started.

Mentioned in SAL (#wikimedia-operations) [2023-03-28T14:32:27Z] <akosiaris@cumin1001> START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches upgrade done - T330165

akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches upgrade done - T330165 failed.

Mentioned in SAL (#wikimedia-operations) [2023-03-28T14:51:55Z] <akosiaris@cumin1001> END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: eqiad row B switches upgrade done - T330165

The switch upgrade itself went smoothly as well, like the other rows.

One issue was that gerrit1001 was missing from the list. This is because the host didn't have any owner at the time I collected the data. It was fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/892587

I'll make sure it doesn't happen again for the future rows.

Mentioned in SAL (#wikimedia-operations) [2023-03-28T16:00:32Z] <inflatador> bking@cumin1001 unban elastic and cloudelastic nodes post maintenance T330165

ayounsi claimed this task.

Thanks again everybody!