Page MenuHomePhabricator

Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad
Closed, ResolvedPublic

Description

2024-07-16, 15:00 UTC
on this rack: https://netbox.wikimedia.org/dcim/racks/90/

  • db1194 - s7
  • db1200 - s5
  • db1201 - s6
  • dbstore1009
  • ms-be1071
  • an-presto1013
  • an-presto1014
  • an-worker1145
  • cephosd1005
  • druid1011
  • kafka-jumbo1014
  • elastic1098
  • elastic1099
  • wdqs1019
  • wdqs1021
  • aqs1021
  • ml-serve1007
  • kafka-logging1005
  • kubernetes1062
  • mw1494
  • mw1495

Teams Involved: Data Persistence, Data Platform, Search, Machine Learning, Observability, Service Ops

Expected outage: 15-30 minutes

Please use the below sheet to detail any actions that are required in advance of the work:

https://docs.google.com/spreadsheets/d/1pLPpzGBmdExXxQ_0_eGXpO0VlUU5oPKZy-_KViMSwuM

Details

Other Assignee
ABran-WMF

Related Objects

Event Timeline

[swift-wise, just need to check cluster OK afterwards]

cmooney triaged this task as Medium priority.
cmooney updated Other Assignee, added: ABran-WMF.
cmooney updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2024-07-16T14:33:07Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'T365997 - depool db1194-s7,db1200-s5,db1201-s6', diff saved to https://phabricator.wikimedia.org/P66634 and previous config saved to /var/cache/conftool/dbconfig/20240716-143306-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T14:33:21Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db[1194,1200-1201].eqiad.wmnet,dbstore1009.eqiad.wmnet with reason: T365997

Mentioned in SAL (#wikimedia-operations) [2024-07-16T14:33:24Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[1194,1200-1201].eqiad.wmnet,dbstore1009.eqiad.wmnet with reason: T365997

Mentioned in SAL (#wikimedia-operations) [2024-07-16T14:34:07Z] <claime> Cordoning kubernetes1062.eqiad.wmnet mw1494.eqiad.wmnet mw1495.eqiad.wmnet - T365997

Icinga downtime and Alertmanager silence (ID=36afd2cf-508d-4c02-a8cc-afb66ea29242) set by cmooney@cumin1002 for 0:50:00 on 1 host(s) and their services with reason: prep JunOS upgrade lsw1-f2-eqiad

lsw1-f2-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=81c0aaa1-44d2-4d05-942a-66bcdfb90d2d) set by cmooney@cumin1002 for 0:30:00 on 4 host(s) and their services with reason: JunOS upgrade lsw1-f2-eqiad

lsw1-f2-eqiad,lsw1-f2-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=58bc700a-b84d-4058-9776-9f6510239089) set by cmooney@cumin1002 for 0:30:00 on 21 host(s) and their services with reason: JunOS upgrade lsw1-f2-eqiad

an-presto[1013-1014].eqiad.wmnet,an-worker1145.eqiad.wmnet,aqs1021.eqiad.wmnet,cephosd1005.eqiad.wmnet,db[1194,1200-1201].eqiad.wmnet,dbstore1009.eqiad.wmnet,druid1011.eqiad.wmnet,elastic[1098-1099].eqiad.wmnet,kafka-jumbo1014.eqiad.wmnet,kafka-logging1005.eqiad.wmnet,kubernetes1062.eqiad.wmnet,ml-serve1007.eqiad.wmnet,ms-be1071.eqiad.wmnet,mw[1494-1495].eqiad.wmnet,wdqs[1019,1021].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-07-16T15:08:32Z] <topranks> Rebooting lsw1-f2-eqiad to complete JunOS upgrade T365997

Upgrade completed, all hosts back online and pinging ok. Thanks all for the assistance!

Mentioned in SAL (#wikimedia-operations) [2024-07-16T15:27:12Z] <claime> Uncordoning kubernetes1062.eqiad.wmnet mw1494.eqiad.wmnet mw1495.eqiad.wmnet - T365997

Mentioned in SAL (#wikimedia-operations) [2024-07-16T15:28:56Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1194 (re)pooling @ 5%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66643 and previous config saved to /var/cache/conftool/dbconfig/20240716-152855-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T15:29:10Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1200 (re)pooling @ 5%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66644 and previous config saved to /var/cache/conftool/dbconfig/20240716-152910-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T15:29:19Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1201 (re)pooling @ 5%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66645 and previous config saved to /var/cache/conftool/dbconfig/20240716-152918-arnaudb.json

dbstore1009 has replication up to date on all 3 instances

all 3 other nodes are repooling ↑

Mentioned in SAL (#wikimedia-operations) [2024-07-16T15:44:01Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1194 (re)pooling @ 10%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66647 and previous config saved to /var/cache/conftool/dbconfig/20240716-154401-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T15:44:16Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1200 (re)pooling @ 10%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66648 and previous config saved to /var/cache/conftool/dbconfig/20240716-154415-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T15:44:24Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1201 (re)pooling @ 10%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66649 and previous config saved to /var/cache/conftool/dbconfig/20240716-154424-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T15:59:06Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1194 (re)pooling @ 25%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66652 and previous config saved to /var/cache/conftool/dbconfig/20240716-155905-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T15:59:21Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1200 (re)pooling @ 25%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66653 and previous config saved to /var/cache/conftool/dbconfig/20240716-155920-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T15:59:30Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1201 (re)pooling @ 25%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66654 and previous config saved to /var/cache/conftool/dbconfig/20240716-155930-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T16:14:12Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1194 (re)pooling @ 50%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66656 and previous config saved to /var/cache/conftool/dbconfig/20240716-161411-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T16:14:27Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1200 (re)pooling @ 50%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66657 and previous config saved to /var/cache/conftool/dbconfig/20240716-161426-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T16:14:35Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1201 (re)pooling @ 50%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66658 and previous config saved to /var/cache/conftool/dbconfig/20240716-161435-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T16:29:17Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1194 (re)pooling @ 75%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66660 and previous config saved to /var/cache/conftool/dbconfig/20240716-162916-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T16:29:32Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1200 (re)pooling @ 75%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66661 and previous config saved to /var/cache/conftool/dbconfig/20240716-162931-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T16:29:41Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66662 and previous config saved to /var/cache/conftool/dbconfig/20240716-162940-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T16:44:22Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1194 (re)pooling @ 100%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66664 and previous config saved to /var/cache/conftool/dbconfig/20240716-164422-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T16:44:37Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1200 (re)pooling @ 100%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66665 and previous config saved to /var/cache/conftool/dbconfig/20240716-164437-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-16T16:44:47Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: post T365997 repool', diff saved to https://phabricator.wikimedia.org/P66666 and previous config saved to /var/cache/conftool/dbconfig/20240716-164446-arnaudb.json