Page MenuHomePhabricator

Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad
Closed, ResolvedPublic

Description

2024-07-03, 14:00 UTC
on this rack: https://netbox.wikimedia.org/dcim/racks/82/

  • db1191 - s7
  • db1196 - s1 sanitarium master
  • db1197 - s2
  • dbstore1008
  • ms-be1069
  • an-presto1008
  • an-presto1009
  • an-worker1143
  • cephosd1002
  • druid1009
  • kafka-jumbo1011
  • elastic1091
  • elastic1092
  • wdqs1020
  • wdqs1018
  • aqs1020
  • ml-serve1005
  • kafka-logging1004
  • kubernetes1060
  • wikikube-worker1007
  • wikikube-worker1021
  • lvs1016

Teams Involved: Data Persistence, Data Platform, Search, Machine Learning, Observability, Service Ops, Traffic

Expected outage: 15-30 minutes

Please use the below sheet to detail any actions that are required in advance of the work:

https://docs.google.com/spreadsheets/d/1pLPpzGBmdExXxQ_0_eGXpO0VlUU5oPKZy-_KViMSwuM

Details

Other Assignee
MatthewVernon

Related Objects

Event Timeline

ABran-WMF updated Other Assignee, added: MatthewVernon.
MatthewVernon subscribed.

swift-wise, this should just be a case of checking the cluster is healthy afterwards. I'm on annual leave when this is going on,though, so we'll have to find someone to do so, once we know when the work is planned to take place.

cmooney triaged this task as Medium priority.
cmooney updated the task description. (Show Details)

@Eevans you OK to handle this, please? Should just be a quick cluster health check afterwards.

@Eevans you OK to handle this, please? Should just be a quick cluster health check afterwards.

Sure.

Folks just FYI I've pushed the time here back an hour if that's ok, seems to suit most best.

Mentioned in SAL (#wikimedia-operations) [2024-07-03T13:17:16Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'T365994 - depool db1191,db1196,db1197', diff saved to https://phabricator.wikimedia.org/P65721 and previous config saved to /var/cache/conftool/dbconfig/20240703-131715-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T13:18:10Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db[1191,1196-1197].eqiad.wmnet with reason: T365994

Mentioned in SAL (#wikimedia-operations) [2024-07-03T13:18:25Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[1191,1196-1197].eqiad.wmnet with reason: T365994

Mentioned in SAL (#wikimedia-operations) [2024-07-03T13:44:04Z] <jayme> draining wikikube-worker1007.eqiad.wmnet wikikube-worker1021.eqiad.wmnet kubernetes1060.eqiad.wmnet for T365994

Icinga downtime and Alertmanager silence (ID=c8dbb89d-640c-4078-bc10-bbbe9c30f3ef) set by cmooney@cumin1002 for 0:50:00 on 1 host(s) and their services with reason: prep JunOS upgrade lsw1-e2-eqiad

lsw1-e2-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=753739a5-e1fb-44b6-9174-f7b3a8c4b73b) set by jayme@cumin1002 for 1:20:00 on 3 host(s) and their services with reason: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2

kubernetes1060.eqiad.wmnet,wikikube-worker[1007,1021].eqiad.wmnet

!log jayme@cumin1002 conftool action : set/pooled=no; selector: name=(wikikube-worker1007.eqiad.wmnet|wikikube-worker1021.eqiad.wmnet|kubernetes1060.eqiad.wmnet)

Icinga downtime and Alertmanager silence (ID=185956f6-b0e6-4a89-9e32-6a8223f5678e) set by cmooney@cumin1002 for 0:40:00 on 4 host(s) and their services with reason: JunOS upgrade lsw1-e2-eqiad

lsw1-e2-eqiad,lsw1-e2-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=11036a9f-0b48-4b07-9e63-571b4f67c201) set by cmooney@cumin1002 for 0:40:00 on 22 host(s) and their services with reason: JunOS upgrade lsw1-e2-eqiad

an-presto[1008-1009].eqiad.wmnet,an-worker1143.eqiad.wmnet,aqs1020.eqiad.wmnet,cephosd1002.eqiad.wmnet,db[1191,1196-1197].eqiad.wmnet,dbstore1008.eqiad.wmnet,druid1009.eqiad.wmnet,elastic[1091-1092].eqiad.wmnet,kafka-jumbo1011.eqiad.wmnet,kafka-logging1004.eqiad.wmnet,kubernetes1060.eqiad.wmnet,lvs1016.eqiad.wmnet,ml-serve1005.eqiad.wmnet,ms-be1069.eqiad.wmnet,wdqs[1018,1020].eqiad.wmnet,wikikube-worker[1007,1021].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-07-03T14:04:32Z] <topranks> rebooting lsw1-e2-eqiad to install updated JunOS version T365994

Mentioned in SAL (#wikimedia-operations) [2024-07-03T14:16:50Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 0:45:00 on db1154.eqiad.wmnet with reason: T365994

Mentioned in SAL (#wikimedia-operations) [2024-07-03T14:17:02Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on db1154.eqiad.wmnet with reason: T365994

Mentioned in SAL (#wikimedia-operations) [2024-07-03T14:17:42Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 0:45:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017,1021].eqiad.wmnet with reason: T365994

Mentioned in SAL (#wikimedia-operations) [2024-07-03T14:17:58Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017,1021].eqiad.wmnet with reason: T365994

Switch is back up, all looks good at first glance from the network side.

Mentioned in SAL (#wikimedia-operations) [2024-07-03T14:25:41Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1197 (re)pooling @ 5%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65723 and previous config saved to /var/cache/conftool/dbconfig/20240703-142541-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T14:25:54Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1196 (re)pooling @ 5%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65724 and previous config saved to /var/cache/conftool/dbconfig/20240703-142553-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T14:26:14Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1191 (re)pooling @ 5%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65725 and previous config saved to /var/cache/conftool/dbconfig/20240703-142614-arnaudb.json

!log jayme@cumin1002 conftool action : set/pooled=no; selector: name=(wikikube-worker1007.eqiad.wmnet|wikikube-worker1021.eqiad.wmnet|kubernetes1060.eqiad.wmnet)

repooled, uncordoned, downtime removed

Mentioned in SAL (#wikimedia-operations) [2024-07-03T14:40:47Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1197 (re)pooling @ 10%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65727 and previous config saved to /var/cache/conftool/dbconfig/20240703-144046-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T14:40:59Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65728 and previous config saved to /var/cache/conftool/dbconfig/20240703-144059-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T14:41:20Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1191 (re)pooling @ 10%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65729 and previous config saved to /var/cache/conftool/dbconfig/20240703-144119-arnaudb.json

Switch is back up, all looks good at first glance from the network side.

Swift checks out OK ✅

Mentioned in SAL (#wikimedia-operations) [2024-07-03T14:55:52Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1197 (re)pooling @ 25%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65731 and previous config saved to /var/cache/conftool/dbconfig/20240703-145552-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T14:56:05Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65732 and previous config saved to /var/cache/conftool/dbconfig/20240703-145604-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T14:56:26Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1191 (re)pooling @ 25%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65733 and previous config saved to /var/cache/conftool/dbconfig/20240703-145625-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T15:10:58Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1197 (re)pooling @ 50%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65737 and previous config saved to /var/cache/conftool/dbconfig/20240703-151057-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T15:11:11Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65738 and previous config saved to /var/cache/conftool/dbconfig/20240703-151110-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T15:11:31Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1191 (re)pooling @ 50%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65739 and previous config saved to /var/cache/conftool/dbconfig/20240703-151131-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T15:26:04Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1197 (re)pooling @ 75%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65741 and previous config saved to /var/cache/conftool/dbconfig/20240703-152603-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T15:26:16Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65742 and previous config saved to /var/cache/conftool/dbconfig/20240703-152616-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T15:26:37Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1191 (re)pooling @ 75%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65743 and previous config saved to /var/cache/conftool/dbconfig/20240703-152636-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T15:41:09Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1197 (re)pooling @ 100%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65746 and previous config saved to /var/cache/conftool/dbconfig/20240703-154109-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T15:41:22Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65747 and previous config saved to /var/cache/conftool/dbconfig/20240703-154121-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T15:41:42Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1191 (re)pooling @ 100%: post T365994 repool', diff saved to https://phabricator.wikimedia.org/P65748 and previous config saved to /var/cache/conftool/dbconfig/20240703-154142-arnaudb.json