Page MenuHomePhabricator

Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad
Closed, ResolvedPublic

Description

2024-06-18 15:00 UTC
on this rack: https://netbox.wikimedia.org/dcim/racks/95/

  • an-worker1172
  • an-worker1173
  • an-worker1174
  • es1040 - es7 replica
  • ms-be1081

Teams involved: Data Platform, Data Persistence

Expected outage: 15-30 minutes

Please use the below sheet to detail any actions that are required in advance of the work:

https://docs.google.com/spreadsheets/d/1pLPpzGBmdExXxQ_0_eGXpO0VlUU5oPKZy-_KViMSwuM

Details

Other Assignee
ABran-WMF

Event Timeline

ABran-WMF updated the task description. (Show Details)
ABran-WMF renamed this task from T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad to Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad .May 27 2024, 1:16 PM

[swift-wise, just need to check cluster OK afterwards]

cmooney updated the task description. (Show Details)
cmooney added a subscriber: MatthewVernon.
cmooney triaged this task as Medium priority.Mon, Jun 10, 2:25 PM

Mentioned in SAL (#wikimedia-operations) [2024-06-18T14:39:12Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on es1040.eqiad.wmnet with reason: T365984

Mentioned in SAL (#wikimedia-operations) [2024-06-18T14:39:26Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: T365984

Mentioned in SAL (#wikimedia-operations) [2024-06-18T14:39:53Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es1040 depool - T365984', diff saved to https://phabricator.wikimedia.org/P65156 and previous config saved to /var/cache/conftool/dbconfig/20240618-143951-arnaudb.json

Icinga downtime and Alertmanager silence (ID=0039bfdd-84ad-4638-9b4c-c0c23984e401) set by cmooney@cumin1002 for 1:40:00 on 1 host(s) and their services with reason: prep JunOS upgrade lsw1-f7-eqiad

lsw1-f7-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=b16e0477-5d40-4e59-950e-09e82271c822) set by cmooney@cumin1002 for 0:40:00 on 4 host(s) and their services with reason: JunOS upgrade lsw1-f7-eqiad

lsw1-f7-eqiad,lsw1-f7-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=80e189d2-8757-4138-ad14-1e0cf5cfbbdb) set by cmooney@cumin1002 for 0:35:00 on 5 host(s) and their services with reason: JunOS upgrade lsw1-f7-eqiad

an-worker[1172-1174].eqiad.wmnet,es1040.eqiad.wmnet,ms-be1081.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-18T15:00:21Z] <topranks> rebooting lsw1-f7-eqiad to upgrade JunOS on switch T365984

Switch is back online after upgrade, everything looks good at first glance.