Page MenuHomePhabricator

Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad
Closed, ResolvedPublic

Description

2024-07-11, 14:00 UTC
on this rack: https://netbox.wikimedia.org/dcim/racks/89/

  • backup1011
  • db1193 - s8
  • dbproxy1027
  • ms-be1070
  • ms-fe1014
  • thanos-fe1004
  • an-coord1004
  • an-mariadb1002
  • an-presto1011
  • an-presto1012
  • an-worker1148
  • an-worker1155
  • an-worker1144
  • cephosd1004
  • dumpsdata1007
  • kafka-jumbo1013
  • elastic1096
  • elastic1097
  • elastic1106
  • dse-k8s-worker1007
  • ml-cache1003
  • logstash1037
  • titan1001

Teams Involved: Data Persistence, Data Platform, Search, Machine Learning, Observability

Expected outage: 15-30 minutes

Please use the below sheet to detail any actions that are required in advance of the work:

https://docs.google.com/spreadsheets/d/1pLPpzGBmdExXxQ_0_eGXpO0VlUU5oPKZy-_KViMSwuM

Details

Other Assignee
ABran-WMF

Event Timeline

ms-fe1014 will need depooling before this work is done (and then repooling afterwards).

There's a staff meeting 15:00-16:30 UTC that day, which it would be nice to avoid clashing with if poss.

cmooney triaged this task as Medium priority.
cmooney updated Other Assignee, added: ABran-WMF.
cmooney updated the task description. (Show Details)

backup1011 is a mediabackups storage server. Ideally, mediabackups are paused during the maintenance to avoid backup errors.

Mentioned in SAL (#wikimedia-operations) [2024-07-11T13:04:38Z] <claime> Cordoning and depooling kubernetes1062.eqiad.wmnet mw1494.eqiad.wmnet mw1495.eqiad.wmnet for T365996

Mentioned in SAL (#wikimedia-operations) [2024-07-11T13:14:02Z] <claime> Uncordoning and depooling kubernetes1062.eqiad.wmnet mw1494.eqiad.wmnet mw1495.eqiad.wmnet that were actually not concerned by T365996

Mentioned in SAL (#wikimedia-analytics) [2024-07-11T13:17:46Z] <btullis> draining dse-k8s-worker1007 ready for T365996

Mentioned in SAL (#wikimedia-analytics) [2024-07-11T13:18:03Z] <btullis> setting cephosd cluster to noout mode for T365996

Mentioned in SAL (#wikimedia-operations) [2024-07-11T13:50:07Z] <Emperor> depool ms-fe1014 and thanos-fe1004 before switch work T365996

ms and thanos frontends depooled, you're good to go from a swift POV.

Icinga downtime and Alertmanager silence (ID=9abb3472-bf69-45f5-8c93-e3c8cfbe9e4e) set by cmooney@cumin1002 for 0:50:00 on 1 host(s) and their services with reason: prep JunOS upgrade lsw1-f1-eqiad

lsw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=d7f08b17-a319-4077-a271-a0ef15a438a3) set by cmooney@cumin1002 for 0:30:00 on 4 host(s) and their services with reason: JunOS upgrade lsw1-f1-eqiad

lsw1-f1-eqiad,lsw1-f1-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=1d5a6d4b-345e-4f18-8342-05572d6411e7) set by cmooney@cumin1002 for 0:30:00 on 23 host(s) and their services with reason: JunOS upgrade lsw1-f1-eqiad

an-coord1004.eqiad.wmnet,an-mariadb1002.eqiad.wmnet,an-presto[1011-1012].eqiad.wmnet,an-worker[1144,1148,1155].eqiad.wmnet,backup1011.eqiad.wmnet,cephosd1004.eqiad.wmnet,db1193.eqiad.wmnet,dbproxy1027.eqiad.wmnet,dse-k8s-worker1007.eqiad.wmnet,dumpsdata1007.eqiad.wmnet,elastic[1096-1097,1106].eqiad.wmnet,kafka-jumbo1013.eqiad.wmnet,logstash1037.eqiad.wmnet,ml-cache1003.eqiad.wmnet,ms-be1070.eqiad.wmnet,ms-fe1014.eqiad.wmnet,thanos-fe1004.eqiad.wmnet,titan1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-07-11T14:12:57Z] <godog> depool titan1001 for switch work T365996

Mentioned in SAL (#wikimedia-operations) [2024-07-11T14:15:06Z] <topranks> rebooting lsw1-f1-eqiad to install updated JunOS version T365996

Icinga downtime and Alertmanager silence (ID=de50ae5f-fec9-4347-b2ef-225a3af373f6) set by cmooney@cumin1002 for 0:30:00 on 23 host(s) and their services with reason: JunOS upgrade lsw1-f1-eqiad

an-coord1004.eqiad.wmnet,an-mariadb1002.eqiad.wmnet,an-presto[1011-1012].eqiad.wmnet,an-worker[1144,1148,1155].eqiad.wmnet,backup1011.eqiad.wmnet,cephosd1004.eqiad.wmnet,db1193.eqiad.wmnet,dbproxy1027.eqiad.wmnet,dse-k8s-worker1007.eqiad.wmnet,dumpsdata1007.eqiad.wmnet,elastic[1096-1097,1106].eqiad.wmnet,kafka-jumbo1013.eqiad.wmnet,logstash1037.eqiad.wmnet,ml-cache1003.eqiad.wmnet,ms-be1070.eqiad.wmnet,ms-fe1014.eqiad.wmnet,thanos-fe1004.eqiad.wmnet,titan1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-07-11T14:25:44Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'T365996 - depool db1193 - s8', diff saved to https://phabricator.wikimedia.org/P66293 and previous config saved to /var/cache/conftool/dbconfig/20240711-142544-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-11T14:25:48Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 1:30:00 on backup1011.eqiad.wmnet,db1193.eqiad.wmnet,dbproxy1027.eqiad.wmnet with reason: T365996

Mentioned in SAL (#wikimedia-operations) [2024-07-11T14:25:51Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on backup1011.eqiad.wmnet,db1193.eqiad.wmnet,dbproxy1027.eqiad.wmnet with reason: T365996

Switch upgrade complete, all looks good hosts are online and responding to ping again. Thanks for the assistance!

Mentioned in SAL (#wikimedia-operations) [2024-07-11T14:35:05Z] <godog> pool titan1001 for switch work T365996

Mentioned in SAL (#wikimedia-operations) [2024-07-11T14:35:42Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1193 (re)pooling @ 5%: post T365996 repool', diff saved to https://phabricator.wikimedia.org/P66294 and previous config saved to /var/cache/conftool/dbconfig/20240711-143541-arnaudb.json

dbhost repooling
dbproxy reloaded
backuphost checked and looks green

Mentioned in SAL (#wikimedia-operations) [2024-07-11T14:50:48Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1193 (re)pooling @ 10%: post T365996 repool', diff saved to https://phabricator.wikimedia.org/P66297 and previous config saved to /var/cache/conftool/dbconfig/20240711-145047-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-11T14:55:56Z] <Emperor> repool ms-fe1014 and thanos-fe1004 before switch work T365996

Swift and thanos frontends repooled, all seems OK.

Mentioned in SAL (#wikimedia-operations) [2024-07-11T15:05:53Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1193 (re)pooling @ 25%: post T365996 repool', diff saved to https://phabricator.wikimedia.org/P66299 and previous config saved to /var/cache/conftool/dbconfig/20240711-150553-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-11T15:20:59Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1193 (re)pooling @ 50%: post T365996 repool', diff saved to https://phabricator.wikimedia.org/P66301 and previous config saved to /var/cache/conftool/dbconfig/20240711-152058-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-11T15:36:04Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1193 (re)pooling @ 75%: post T365996 repool', diff saved to https://phabricator.wikimedia.org/P66306 and previous config saved to /var/cache/conftool/dbconfig/20240711-153604-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-11T15:51:10Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1193 (re)pooling @ 100%: post T365996 repool', diff saved to https://phabricator.wikimedia.org/P66308 and previous config saved to /var/cache/conftool/dbconfig/20240711-155109-arnaudb.json