Page MenuHomePhabricator

Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad
Closed, ResolvedPublic

Description

2024-06-25 15:00 UTC
on this rack: https://netbox.wikimedia.org/dcim/racks/85/

  • es1035 - es7 master
  • ms-be1076
  • an-worker1157
  • an-worker1158
  • an-worker1159
  • elastic1105
  • kafka-main1010

Teams involved: Service Ops, Data Persistence, Search, Data Platform

Expected outage: 15-30 minutes

Please use the below sheet to detail any actions that are required in advance of the work:

https://docs.google.com/spreadsheets/d/1pLPpzGBmdExXxQ_0_eGXpO0VlUU5oPKZy-_KViMSwuM

Details

Other Assignee
ABran-WMF

Event Timeline

ABran-WMF updated the task description. (Show Details)

[swift-wise, cluster health just needs checking afterwards]

cmooney triaged this task as Medium priority.
cmooney updated the task description. (Show Details)
cmooney added a subscriber: MatthewVernon.

Icinga downtime and Alertmanager silence (ID=7a21c2a6-e267-4150-8111-b348788c4a9b) set by cmooney@cumin1002 for 0:50:00 on 1 host(s) and their services with reason: prep JunOS upgrade lsw1-e5-eqiad

lsw1-e5-eqiad.mgmt

Mentioned in SAL (#wikimedia-operations) [2024-06-25T14:55:58Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'T365986 - depool es1035', diff saved to https://phabricator.wikimedia.org/P65413 and previous config saved to /var/cache/conftool/dbconfig/20240625-145558-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-06-25T14:56:20Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 0:45:00 on es1035.eqiad.wmnet with reason: T365986

Mentioned in SAL (#wikimedia-operations) [2024-06-25T14:56:33Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on es1035.eqiad.wmnet with reason: T365986

Icinga downtime and Alertmanager silence (ID=01b84d43-d6d0-4f45-bc2e-375ff79e21f8) set by cmooney@cumin1002 for 0:40:00 on 4 host(s) and their services with reason: JunOS upgrade lsw1-e5-eqiad

lsw1-e5-eqiad,lsw1-e5-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=65c438b1-9725-4de3-9a45-8318edea15f1) set by cmooney@cumin1002 for 0:40:00 on 7 host(s) and their services with reason: JunOS upgrade lsw1-e5-eqiad

an-worker[1157-1159].eqiad.wmnet,elastic1105.eqiad.wmnet,es1035.eqiad.wmnet,kafka-main1010.eqiad.wmnet,ms-be1076.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-25T15:00:09Z] <topranks> rebooting lsw1-e5-eqiad to upgrade JunOS on switch T365986

Switch upgrade is completed, all looks good in terms of the network side.

Mentioned in SAL (#wikimedia-operations) [2024-06-25T15:18:03Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65414 and previous config saved to /var/cache/conftool/dbconfig/20240625-151802-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-06-25T15:33:08Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65415 and previous config saved to /var/cache/conftool/dbconfig/20240625-153307-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-06-25T16:03:19Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es1035 (re)pooling @ 50%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65417 and previous config saved to /var/cache/conftool/dbconfig/20240625-160318-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-06-25T16:18:25Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65418 and previous config saved to /var/cache/conftool/dbconfig/20240625-161824-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-06-25T16:33:30Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: post T365986 repool', diff saved to https://phabricator.wikimedia.org/P65419 and previous config saved to /var/cache/conftool/dbconfig/20240625-163330-arnaudb.json