Page MenuHomePhabricator

Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad
Closed, ResolvedPublic

Description

2024-06-27, 15:00 UTC
on this rack: https://netbox.wikimedia.org/dcim/racks/87/

  • an-worker1163
  • an-worker1164
  • an-worker1165
  • es1037 - es6
  • ms-be1078

Teams involved: Data Platform, Data Persistence

Expected outage: 15-30 minutes

Please use the below sheet to detail any actions that are required in advance of the work:

https://docs.google.com/spreadsheets/d/1pLPpzGBmdExXxQ_0_eGXpO0VlUU5oPKZy-_KViMSwuM

Details

Other Assignee
ABran-WMF

Event Timeline

ABran-WMF renamed this task from Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad to Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad.May 27 2024, 11:59 AM
ABran-WMF created this task.
ABran-WMF updated the task description. (Show Details)

From the swift POV, this is just checking the cluster is happy afterwards. I'm on annual leave when this is happening, though, so we'll need to find someone to cover once we know what time the work is planned for.

cmooney triaged this task as Medium priority.
cmooney updated the task description. (Show Details)
cmooney updated the task description. (Show Details)
cmooney added a subscriber: MatthewVernon.

From the swift POV, this is just checking the cluster is happy afterwards. I'm on annual leave when this is happening, though, so we'll need to find someone to cover once we know what time the work is planned for.

Thanks for the input Matthew. I've provisionally scheduled it for 15:00 UTC, do you have any suggestions of someone else on your team who might be able to check the status of the swift cluster afterwards? If nobody is available we can also re-schedule. Cheers.

@Eevans are you OK to do this, please? Should just be a case of checking swift-dispersion-report and swift-recon -r both look good after the work is complete...

@Eevans are you OK to do this, please? Should just be a case of checking swift-dispersion-report and swift-recon -r both look good after the work is complete...

Sure.

Mentioned in SAL (#wikimedia-operations) [2024-06-27T14:37:42Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'T365988 - depool es1037', diff saved to https://phabricator.wikimedia.org/P65531 and previous config saved to /var/cache/conftool/dbconfig/20240627-143741-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-06-27T14:38:02Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on es1037.eqiad.wmnet with reason: T365988

Mentioned in SAL (#wikimedia-operations) [2024-06-27T14:38:15Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1037.eqiad.wmnet with reason: T365988

Icinga downtime and Alertmanager silence (ID=66810f76-0e2d-43f3-8c96-bbfe4e6a7aee) set by cmooney@cumin1002 for 0:50:00 on 1 host(s) and their services with reason: prep JunOS upgrade lsw1-e7-eqiad

lsw1-e7-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=2863d158-d71c-4317-a811-4dd3cb8e6e72) set by cmooney@cumin1002 for 0:40:00 on 4 host(s) and their services with reason: JunOS upgrade lsw1-e7-eqiad

lsw1-e7-eqiad,lsw1-e7-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=bd008f08-7b85-4b69-ba4e-5d84a9307d79) set by cmooney@cumin1002 for 0:40:00 on 5 host(s) and their services with reason: JunOS upgrade lsw1-e7-eqiad

an-worker[1163-1165].eqiad.wmnet,es1037.eqiad.wmnet,ms-be1078.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-27T15:00:31Z] <topranks> rebooting lsw1-e7-eqiad to upgrade JunOS on switch T365988

Upgrade completed, all looking good network-wise.

Mentioned in SAL (#wikimedia-operations) [2024-06-27T15:21:08Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es1037 (re)pooling @ 5%: post T365988 repool', diff saved to https://phabricator.wikimedia.org/P65532 and previous config saved to /var/cache/conftool/dbconfig/20240627-152107-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-06-27T15:36:13Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: post T365988 repool', diff saved to https://phabricator.wikimedia.org/P65533 and previous config saved to /var/cache/conftool/dbconfig/20240627-153613-arnaudb.json

Upgrade completed, all looking good network-wise.

Thanks @cmooney; For posterity sake: All looks good with Swift

Mentioned in SAL (#wikimedia-operations) [2024-06-27T15:51:19Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es1037 (re)pooling @ 25%: post T365988 repool', diff saved to https://phabricator.wikimedia.org/P65534 and previous config saved to /var/cache/conftool/dbconfig/20240627-155118-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-06-27T16:06:24Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es1037 (re)pooling @ 50%: post T365988 repool', diff saved to https://phabricator.wikimedia.org/P65535 and previous config saved to /var/cache/conftool/dbconfig/20240627-160624-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-06-27T16:21:30Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: post T365988 repool', diff saved to https://phabricator.wikimedia.org/P65536 and previous config saved to /var/cache/conftool/dbconfig/20240627-162129-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-06-27T16:36:35Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: post T365988 repool', diff saved to https://phabricator.wikimedia.org/P65537 and previous config saved to /var/cache/conftool/dbconfig/20240627-163635-arnaudb.json

Thanks all for the help with this one!