Page MenuHomePhabricator

Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad
Closed, ResolvedPublic

Description

2024-06-11, 15:00 UTC

on this rack: https://netbox.wikimedia.org/dcim/racks/93/

  • an-worker1166
  • an-worker1167
  • an-worker1168
  • elastic1107
  • es1038
  • ms-be1079

Teams involved: Data Platform, Data Persistence, Search

Expected outage: 15-30 minutes

Please use the below sheet to detail any actions that are required in advance of the work:

https://docs.google.com/spreadsheets/d/1pLPpzGBmdExXxQ_0_eGXpO0VlUU5oPKZy-_KViMSwuM

Details

Other Assignee
ABran-WMF

Event Timeline

From a swift POV, should just be "check cluster is happy afterwards"

cmooney triaged this task as Medium priority.
cmooney added a subscriber: MatthewVernon.

Mentioned in SAL (#wikimedia-operations) [2024-06-10T14:43:13Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic1107.eqiad.wmnet with reason: T365982

Mentioned in SAL (#wikimedia-operations) [2024-06-10T14:43:28Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic1107.eqiad.wmnet with reason: T365982

Icinga downtime and Alertmanager silence (ID=adbdaf29-9da2-42ea-b64e-fc6d141eaf9e) set by cmooney@cumin1002 for 1:20:00 on 1 host(s) and their services with reason: prep upgrade of device

lsw1-f5-eqiad.mgmt

Mentioned in SAL (#wikimedia-operations) [2024-06-11T14:30:45Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on es1038.eqiad.wmnet with reason: T365982

Mentioned in SAL (#wikimedia-operations) [2024-06-11T14:30:58Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1038.eqiad.wmnet with reason: T365982

Mentioned in SAL (#wikimedia-operations) [2024-06-11T14:46:25Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es1038 depool T365982', diff saved to https://phabricator.wikimedia.org/P64631 and previous config saved to /var/cache/conftool/dbconfig/20240611-144624-arnaudb.json

Icinga downtime and Alertmanager silence (ID=22e81c7a-3dde-4cd2-9376-bd003c744dc6) set by cmooney@cumin1002 for 0:40:00 on 4 host(s) and their services with reason: prep upgrade of device

lsw1-f5-eqiad,lsw1-f5-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=d67744a2-77a0-40dc-aff6-4af804b0b5ce) set by cmooney@cumin1002 for 0:35:00 on 6 host(s) and their services with reason: upgrade lsw1-f5-eqiad

an-worker[1166-1168].eqiad.wmnet,elastic1107.eqiad.wmnet,es1038.eqiad.wmnet,ms-be1079.eqiad.wmnet

Switch has reloaded on new version and initial checks look ok.