Page MenuHomePhabricator

eqiad row A switches upgrade
Closed, ResolvedPublic

Description

eqiad row A switches upgrade

For reasons detailed in T327248: eqiad/codfw virtual-chassis upgrades we're going to upgrade eqiad row A switches during the scheduled DC switchover.

This is scheduled for March 7th - 14:00-16:00 UTC*, please let us know if there is any issue with the scheduled time.
It means a 30min hard downtime for the whole row if everything goes well. Also a good opportunity to test the hosts depool mechanisms and row redundancy of services.

The list of impacted servers and teams for this row is listed below.
The actions needed is quite free form:

  • please write NONE if no action is needed,
  • the cookbook/command to run if it can be done by a 3rd party
  • who will be around to take care of the depool
  • Link to the relevant doc
  • etc

The two main types of actions needed are depool and monitoring downtime

NOTE: If the servers can handle a longer depool, it's preferred to depool them many hours or the day before (and mark None in the table) so there are less moving parts closer to the maintenance window.

All servers will be downtimed with sudo cookbook sre.hosts.downtime --hours 2 -r "eqiad row A upgrade" -t T329073 'P{P:netbox::host%location ~ "A.*eqiad"}' but specific services might need specific downtimes.

ServiceOps-Collab

collaboration-services

ServersDepool action neededRepool action neededStatus
gitlab1003NoneNone
moscoviumNoneNone
planet1002NoneNone

Observability

SRE Observability

ServersDepool action neededRepool action neededStatus
dispatch-be1001n/an/anot in production
grafana1002failover to codfw ahead of maint windowfailback to eqiaddepooled (failed over to codfw)
kafkamon1002set downtimen/adowntime scheduled
logstash[1010,1023-1024,1026,1033]drain shards 1010,1026,1033 depool 1023,1024 set downtimeallocate shards, poolshards allocating, hosts re-pooled
prometheus1005depool on the hostpool on the hostdepooled

Observability and Data Persistence

SRE Observability Data-Persistence

ServersDepool action neededRepool action neededStatus
thanos-fe1001depool and make sure another host in eqiad is pooled for thanos-web servicepool and put back thanos-fe1001 as the single host pooled for thanos-webdepooled

WMCS

cloud-services-team

ServersDepool action neededRepool action neededStatus
clouddb[1013-1014,1021]Likely yes?Likely yes?Please coordinate with @Marostegui
cloudmetrics1003NoNoContact: @aborrero
cloudservices1004NoNoContact: @Andrew

Machine Learning

Machine-Learning-Team

ServersDepool action neededRepool action neededStatus
ml-serve1001nonenone
ores[1001-1002]sudo -i depoolsudo -i pooldepooled
orespoolcounter1003nonenone

Search Platform

Search team has already done the depool/downtime/etc for the relevant hosts, so all of these hosts are ready for the operation.

Discovery-Search

ServersDepool action neededRepool action neededStatus
cloudelastic[1001,1005]no action needed
elastic[1053-1054,1068-1073,1084]no action needed
relforge1003no action needed
wcqs1001no action needed
wdqs[1003-1004,1006,1011]no action needed

ServiceOps

serviceops

ServersDepool action neededRepool action neededStatus
conf1007
kafka-main1001
kubemaster1001
kubernetes[1005,1007-1008,1017-1018]
kubestagetcd1004
kubetcd1005
mc[1037-1040]
mc-gp1001
mw[1385-1392,1414-1422,1448-1465]
mwdebug1002
parse[1001-1006]
poolcounter1004
rdb1011
registry1003

Traffic

Traffic

ServersDepool action neededRepool action neededStatus
cp[1075-1078]nothing to be done, eqiad depooled
dns1001removed from authdns_serversdepooled
lvs[1013,1017]nothing to be done, eqiad depooled
ncredir1002nothing to be done, eqiad depooled

Data Engineering

Data-Engineering

We will likely stop a number of pipelines e.g. oozie, airflow

ServersDepool action neededRepool action neededStatus
an-db1001NoneNone
an-druid[1001,1003]NoneNone
an-master1001Fail over to an-master1002 - who: DE SREsfail back to an-master1001Failed over to standby
an-presto[1002,1005]NoneNone
an-test-client1001NoneNone
an-test-master1001Fail over to an-test-master1002 - who: DE SREsfail back to an-test-master1001Failed over to standby
an-test-worker1001NoneNone
an-tool1011NoneNone
an-worker[1078-1082,1096,1102-1103,1118-1123,1129,1139-1141]Stop ingestion via gobblin & Put HDFS into safe mode - who: DE SREsRe-enable gobblin jobs & Take HDFS out of safe mode - who: DE SREscompleted
analytics[1058-1060,1070-1071]Stop ingestion via gobblin & Put HDFS into safe modeRe-enable gobblin jobs & Take HDFS out of safe modecompleted
aqs[1010,1016]sudo -i depoolsudo -i poolcompleted
archiva1002None (archiva.wikimedia.org will be unavailable)None
datahubsearch1001sudo -i depoolsudo -i poolcompleted
dbstore1003None (s1, s5, s7 will be unavailable)None
druid1004sudo -i depoolsudo -i poolcompleted
kafka-jumbo[1001-1002]None (to be checked)None (to be checked)
karapace1001Stop Datahub ingestion - who: DE SREsRestart DataHub ingestion
stat[1004,1008]Announce downtime via analytics-annouce@lists and Data-Engineering Slack channelNone

Data Engineering and Machine Learning

Data-Engineering Machine-Learning-Team

ServersDepool action neededRepool action neededStatus
dse-k8s-ctrl1001nonenone
dse-k8s-etcd1001None
dse-k8s-worker1001None

Infrastructure Foundations

Infrastructure-Foundations

ServersDepool action neededRepool action neededStatus
apt1001nonenonedone
aux-k8s-ctrl[1001-1002]None
aux-k8s-etcd[1001-1003]None
aux-k8s-worker[1001-1002]None
ganeti[1023,1025-1026,1029-1032]nonenonedone
install1003nonenonedone
netbox1002None - announce unavailablenonenone
netboxdb1002Nonenonenone
pki1001merge and deploy 895225no action lets leave it in codfw for the dc switchover
puppetmaster1004Disable puppet in eqiad/drmrs/esmasPuppet re-enableddone
urldownloader1001failed overno need, can remain on 1002done

Infrastructure Foundations and Data Engineering

Infrastructure-Foundations Data-Engineering

ServersDepool action neededRepool action neededStatus
krb1001nonenonedone

Data Persistence

Data-Persistence

ServersDepool action neededRepool action neededStatus
backup[1004,1008]
db[1103,1107,1111,1115-1117,1126-1129,1141-1142,1151,1154,1156-1161,1176-1177,1185-1186]db1176 (m1 master) needs to be switched over. db1159 (m5 master needs to be switched over T331384)db1176 is no longer a master after finishing T329259, db1159 is no longer a master (T331384)
dbprov1001
dbproxy[1012-1013]dbproxy1013 (m2-master) has been switched over
es[1020,1024,1026-1028]Nothing to be done, eqiad will be depooled
moss-fe1001sudo depoolsudo pooldepooled
ms-backup1001
ms-be[1040,1044-1046,1051,1057,1060,1064]no action required
ms-fe1009sudo depoolsudo pooldepooled
pc1011Nothing to be done, eqiad will be depooled
thanos-be1001no action required

Core Platform

Platform Engineering

ServersDepool action neededRepool action neededStatus
dumpsdata1004
htmldumper1001
maps[1005-1006]NoneNone
restbase[1016,1019-1021,1028,1031]sudo -i depoolsudo -i pool
sessionstore1001NoneNone
snapshot[1011-1012]
thumbor1005NoneNone

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 894056 merged by Herron:

[operations/puppet@production] grafana: serve grafana/grafana-rw from codfw

https://gerrit.wikimedia.org/r/894056

Mentioned in SAL (#wikimedia-operations) [2023-03-06T14:57:50Z] <herron> failing grafana over to codfw T329073

Mentioned in SAL (#wikimedia-operations) [2023-03-06T23:04:57Z] <inflatador> bking@cumin2002 'depool wcqs and wdqs row A hosts T329073'

Mentioned in SAL (#wikimedia-operations) [2023-03-06T23:05:23Z] <ryankemper> T329073 Pre-emptively depooled internal wdqs hosts wdqs10[03,11]

Mentioned in SAL (#wikimedia-operations) [2023-03-06T23:16:00Z] <inflatador> bking@cumin2002 ban row A cloudelastic hosts T329073

Icinga downtime and Alertmanager silence (ID=786ee8c7-4753-4e2d-96f9-8b55b691ff09) set by bking@cumin2002 for 1 day, 0:00:00 on 12 host(s) and their services with reason: switch maintenance

cloudelastic[1001,1005].wikimedia.org,elastic[1053-1054,1068-1073,1084].eqiad.wmnet,relforge1003.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=f9f1bd07-4af1-41e3-82b7-3ab0f2ff8672) set by bking@cumin2002 for 1 day, 0:00:00 on 5 host(s) and their services with reason: switch maintenance

wcqs1001.eqiad.wmnet,wdqs[1003-1004,1006,1011].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-03-07T07:31:05Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Row A switch maintenance T329073

Mentioned in SAL (#wikimedia-operations) [2023-03-07T07:31:22Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Row A switch maintenance T329073

Mentioned in SAL (#wikimedia-operations) [2023-03-07T07:31:34Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 12:00:00 on db1115.eqiad.wmnet with reason: Row A switch maintenance T329073

Mentioned in SAL (#wikimedia-operations) [2023-03-07T07:31:48Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1115.eqiad.wmnet with reason: Row A switch maintenance T329073

Mentioned in SAL (#wikimedia-operations) [2023-03-07T07:32:37Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 12:00:00 on db[1151-1153].eqiad.wmnet with reason: Row A switch maintenance T329073

Mentioned in SAL (#wikimedia-operations) [2023-03-07T07:32:52Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db[1151-1153].eqiad.wmnet with reason: Row A switch maintenance T329073

Mentioned in SAL (#wikimedia-operations) [2023-03-07T07:33:05Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 12:00:00 on db[2142-2144].codfw.wmnet with reason: Row A switch maintenance T329073

Mentioned in SAL (#wikimedia-operations) [2023-03-07T07:33:31Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db[2142-2144].codfw.wmnet with reason: Row A switch maintenance T329073

Mentioned in SAL (#wikimedia-operations) [2023-03-07T07:34:05Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 12:00:00 on 15 hosts with reason: Row A switch maintenance T329073

Mentioned in SAL (#wikimedia-operations) [2023-03-07T07:34:12Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 15 hosts with reason: Row A switch maintenance T329073

aborrero updated the task description. (Show Details)
aborrero added subscribers: aborrero, Andrew.

Sent a ping to @Marostegui regarding clouddb[1013-1014,1021]

Also @Andrew regarding cloudservices host, but I think the host can be taken down any time.

@aborrero regarding clouddb* hosts, it is up to your team but I think it would be nice if you could depool them. Better user experience for sure :)

Change 894537 merged by Btullis:

[operations/puppet@production] Disable all gobblin jobs to allow for HDFS maintenance

https://gerrit.wikimedia.org/r/894537

Change 895217 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] clouddb: depool clouddb[1013-1014]

https://gerrit.wikimedia.org/r/895217

@Marostegui @aborrero the patch above should depool clouddb1013 and clouddb1014.

I don't think clouddb1021 can be depooled easily as it looks like a special host without a standby (if I understand correctly):

clouddb1021 is a special db host dedicated only to the Analytics team

Mentioned in SAL (#wikimedia-analytics) [2023-03-07T12:56:14Z] <btullis> depooled datahubsearch1001 for T329073

Change 894654 merged by Ssingh:

[operations/puppet@production] hiera: temporarily removed dns1001 from authdns_servers

https://gerrit.wikimedia.org/r/894654

Mentioned in SAL (#wikimedia-operations) [2023-03-07T12:57:05Z] <sukhe> removing dns1001 from authdns_servers for T329073

Change 895225 had a related patch set uploaded (by Jbond; author: jbond):

[operations/dns@master] pki: failover to codfw for switch reboot

https://gerrit.wikimedia.org/r/895225

Mentioned in SAL (#wikimedia-operations) [2023-03-07T13:50:35Z] <moritzm> disabling Puppet in eqiad/esams/drmrs for forthcoming Switch maintenance T329073

Change 895225 merged by Jbond:

[operations/dns@master] pki: failover to codfw for switch reboot

https://gerrit.wikimedia.org/r/895225

Mentioned in SAL (#wikimedia-operations) [2023-03-07T13:59:09Z] <jbond> failover pki.discovery.wmnet to codfw T329073

Change 895217 merged by FNegri:

[operations/puppet@production] clouddb: depool clouddb[1013-1014]

https://gerrit.wikimedia.org/r/895217

Icinga downtime and Alertmanager silence (ID=f4ffc353-a529-4620-994f-ae7b737f3c7a) set by cmooney@cumin1001 for 2:00:00 on 238 host(s) and their services with reason: eqiad row A upgrade

an-db1001.eqiad.wmnet,an-druid[1001,1003].eqiad.wmnet,an-master1001.eqiad.wmnet,an-presto[1002,1005].eqiad.wmnet,an-test-client1001.eqiad.wmnet,an-test-master1001.eqiad.wmnet,an-test-worker1001.eqiad.wmnet,an-tool1011.eqiad.wmnet,an-worker[1078-1082,1096,1102-1103,1118-1123,1129,1139-1141].eqiad.wmnet,analytics[1058-1060,1070-1071].eqiad.wmnet,apt1001.wikimedia.org,aqs[1010,1016].eqiad.wmnet,archiva1002.wikimedia.org,aux-k8s-ctrl[1001-1002].eqiad.wmnet,aux-k8s-etcd[1001-1003].eqiad.wmnet,aux-k8s-worker[1001-1002].eqiad.wmnet,backup[1004,1008].eqiad.wmnet,clouddb[1013-1014,1021].eqiad.wmnet,cloudelastic[1001,1005].wikimedia.org,cloudmetrics1003.eqiad.wmnet,cloudservices1004.wikimedia.org,conf1007.eqiad.wmnet,cp[1075-1078].eqiad.wmnet,datahubsearch1001.eqiad.wmnet,db[1103,1107,1111,1115-1117,1126-1129,1141-1142,1151,1154,1156-1161,1176-1177,1185-1186].eqiad.wmnet,dbprov1001.eqiad.wmnet,dbproxy[1012-1013].eqiad.wmnet,dbstore1003.eqiad.wmnet,dispatch-be1001.eqiad.wmnet,dns1001.wikimedia.org,druid1004.eqiad.wmnet,dse-k8s-ctrl1001.eqiad.wmnet,dse-k8s-etcd1001.eqiad.wmnet,dse-k8s-worker1001.eqiad.wmnet,dumpsdata1004.eqiad.wmnet,elastic[1053-1054,1068-1073,1084].eqiad.wmnet,es[1020,1024,1026-1028].eqiad.wmnet,ganeti[1023,1025-1026,1029-1032].eqiad.wmnet,gitlab1003.wikimedia.org,grafana1002.eqiad.wmnet,htmldumper1001.eqiad.wmnet,kafka-jumbo[1001-1002].eqiad.wmnet,kafka-main1001.eqiad.wmnet,kafkamon1002.eqiad.wmnet,karapace1001.eqiad.wmnet,krb1001.eqiad.wmnet,kubemaster1001.eqiad.wmnet,kubernetes[1005,1007-1008,1017-1018].eqiad.wmnet,kubestagetcd1004.eqiad.wmnet,kubetcd1005.eqiad.wmnet,lists1001.wikimedia.org,logstash[1010,1023-1024,1026,1033].eqiad.wmnet,lvs[1013,1017].eqiad.wmnet,maps[1005-1006].eqiad.wmnet,mc[1037-1040].eqiad.wmnet,mc-gp1001.eqiad.wmnet,ml-serve1001.eqiad.wmnet,moscovium.eqiad.wmnet,moss-fe1001.eqiad.wmnet,ms-backup1001.eqiad.wmnet,ms-be[1040,1044-1046,1051,1057,1060,1064].eqiad.wmnet,ms-fe1009.eqiad.wmnet,mw[1385-1392,1414-1422,1448-1465].eqiad.wmnet,mwdebug1002.eqiad.wmnet,ncredir1002.eqiad.wmnet,netbox1002.eqiad.wmnet,netboxdb1002.eqiad.wmnet,ores[1001-1002].eqiad.wmnet,orespoolcounter1003.eqiad.wmnet,parse[1001-1006].eqiad.wmnet,pc1011.eqiad.wmnet,people1003.eqiad.wmnet,pki1001.eqiad.wmnet,planet1002.eqiad.wmnet,poolcounter1004.eqiad.wmnet,prometheus1005.eqiad.wmnet,puppetmaster1004.eqiad.wmnet,rdb1011.eqiad.wmnet,registry1003.eqiad.wmnet,relforge1003.eqiad.wmnet,restbase[1016,1019-1021,1028,1031].eqiad.wmnet,sessionstore1001.eqiad.wmnet,snapshot[1011-1012].eqiad.wmnet,stat[1004,1008].eqiad.wmnet,thanos-be1001.eqiad.wmnet,thanos-fe1001.eqiad.wmnet,thumbor1005.eqiad.wmnet,urldownloader1001.wikimedia.org,wcqs1001.eqiad.wmnet,wdqs[1003-1004,1006,1011].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-03-07T14:13:55Z] <akosiaris> kubectl cordon kubernetes{1005,1007,1008,1017,1018}.eqiad.wmnet T329073

Icinga downtime and Alertmanager silence (ID=0a07bba2-0f50-4eec-9718-0c768add34f3) set by cmooney@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: eqiad row A upgrade

mr1-eqiad

Mentioned in SAL (#wikimedia-operations) [2023-03-07T14:40:59Z] <moritzm> enabling Puppet in eqiad/esams/drmrs after completed Switch maintenance T329073

Happy to say the upgrade went as expected, no issues encountered. All devices now back online running 21.4R3-S1.5.

the following hosts paged during this maintenance:

NodeDown wmcs cloudvirt1023:9100 (node eqiad)
NodeDown wmcs cloudvirt1024:9100 (node eqiad)
NodeDown wmcs cloudvirt1030:9100 (node eqiad)
NodeDown wmcs cloudvirt1055:9100 (node eqiad)
NodeDown wmcs cloudvirt1045:9100 (node eqiad)
NodeDown wmcs cloudvirt1029:9100 (node eqiad)
NodeDown wmcs cloudvirt1059:9100 (node eqiad)
NodeDown wmcs cloudvirt1051:9100 (node eqiad)
NodeDown wmcs cloudvirt1049:9100 (node eqiad)
NodeDown wmcs cloudvirt1017:9100 (node eqiad)
NodeDown wmcs cloudvirt1019:9100 (node eqiad)

Are those hosts coupled to this switch somehow?

Mentioned in SAL (#wikimedia-operations) [2023-03-07T14:52:30Z] <inflatador> bking@cumin2002 unban row A cloudelastic nodes T329073

Mentioned in SAL (#wikimedia-operations) [2023-03-07T14:56:39Z] <inflatador> bking@cumin2002 unban production row A elastic nodes from all clusters T329073

Change 895239 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Reenable the gobblin timers after switch maintenance

https://gerrit.wikimedia.org/r/895239

the following hosts paged during this maintenance:

NodeDown wmcs cloudvirt1023:9100 (node eqiad)
NodeDown wmcs cloudvirt1024:9100 (node eqiad)
NodeDown wmcs cloudvirt1030:9100 (node eqiad)
NodeDown wmcs cloudvirt1055:9100 (node eqiad)
NodeDown wmcs cloudvirt1045:9100 (node eqiad)
NodeDown wmcs cloudvirt1029:9100 (node eqiad)
NodeDown wmcs cloudvirt1059:9100 (node eqiad)
NodeDown wmcs cloudvirt1051:9100 (node eqiad)
NodeDown wmcs cloudvirt1049:9100 (node eqiad)
NodeDown wmcs cloudvirt1017:9100 (node eqiad)
NodeDown wmcs cloudvirt1019:9100 (node eqiad)

Are those hosts coupled to this switch somehow?

No, they're all in rows B/C/E and F. So should not have been affected. When I look at Grafana I don't see any sign of interruption to network during the window either:

https://grafana.wikimedia.org/d/Bv-Zik-Vz/cathal-cloudvirt-network-usage?orgId=1&refresh=5m

So I'd be more inclined to think this is a monitoring issue than an actual interruption to comms.

Change 895239 merged by Btullis:

[operations/puppet@production] Reenable the gobblin timers after switch maintenance

https://gerrit.wikimedia.org/r/895239

Mentioned in SAL (#wikimedia-operations) [2023-03-07T17:51:50Z] <inflatador> bking@cumin2002 repool wdqs hosts post-maintenance T329073