Page MenuHomePhabricator

Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad
Closed, ResolvedPublic

Description

Postponed until Wed 2024-07-10, 15:00 UTC
on this rack: https://netbox.wikimedia.org/dcim/racks/81/

  • backup1010
  • db1190 - s4
  • dbproxy1026 - m3
  • ms-be1068
  • ms-fe1012
  • an-coord1003
  • an-mariadb1001
  • an-presto1006
  • an-presto1007
  • an-worker1147
  • an-worker1153
  • an-worker1142
  • cephosd1001
  • dumpsdata1006
  • kafka-jumbo1010
  • stat1010
  • elastic1104
  • elastic1089
  • elastic1090
  • dse-k8s-worker1005
  • ml-cache1001
  • logstash1036
  • kubernetes1059
  • lvs1013
  • lvs1014
  • lvs1015

Teams Involved: Data Persistence, Data Platform, Machine Learning, Search, Observability, Service Ops, Traffic

Expected outage: 15-30 minutes

Please use the below sheet to detail any actions that are required in advance of the work:

https://docs.google.com/spreadsheets/d/1pLPpzGBmdExXxQ_0_eGXpO0VlUU5oPKZy-_KViMSwuM

Details

Event Timeline

ABran-WMF updated Other Assignee, added: MatthewVernon.
MatthewVernon subscribed.

I'm on annual leave this day, so someone else will have to handle the ms frontend, which needs depooling beforehand and repooling afterwards. Once we know the planned time, we can find someone suitable.

cmooney triaged this task as Medium priority.
cmooney updated the task description. (Show Details)
cmooney updated the task description. (Show Details)

@Eevans would you be OK to handle this as well, please? It's a bit more involved as you'll need to run sudo depool on the ms-fe node beforehand (and then sudo pool afterwards) as well as a quick cluster health check for the backend node.

@Eevans would you be OK to handle this as well, please? It's a bit more involved as you'll need to run sudo depool on the ms-fe node beforehand (and then sudo pool afterwards) as well as a quick cluster health check for the backend node.

Sure.

backup1010 is in intermittent usage to support mediabackups disk space, but mostly idle at the time, so unless it situtation changes by july and finally gets pooled for bacula, it will require no action.

No action will be needed for backup1010 in the end.

Change #1052965 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Fail over hive and presto services to the standby coordinator

https://gerrit.wikimedia.org/r/1052965

Change #1052965 merged by Btullis:

[operations/dns@master] Fail over hive and presto services to the standby coordinator

https://gerrit.wikimedia.org/r/1052965

Mentioned in SAL (#wikimedia-operations) [2024-07-10T14:02:24Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'T365993 - depool db1190 - s4', diff saved to https://phabricator.wikimedia.org/P66129 and previous config saved to /var/cache/conftool/dbconfig/20240710-140224-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-10T14:02:56Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on backup1010.eqiad.wmnet,db1190.eqiad.wmnet,dbproxy1026.eqiad.wmnet with reason: T365993

Mentioned in SAL (#wikimedia-operations) [2024-07-10T14:03:11Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1010.eqiad.wmnet,db1190.eqiad.wmnet,dbproxy1026.eqiad.wmnet with reason: T365993

Icinga downtime and Alertmanager silence (ID=9ca0faf1-4b9d-4345-9bb8-9c7153e17163) set by cmooney@cumin1002 for 1:30:00 on 1 host(s) and their services with reason: prep JunOS upgrade lsw1-e1-eqiad

lsw1-e1-eqiad.mgmt

Mentioned in SAL (#wikimedia-operations) [2024-07-10T14:34:10Z] <cmooney@cumin1002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1104,elastic1089,elastic1090 for ban elastic nodes before switch upgrade rack E1 - cmooney@cumin1002 - T365993

Mentioned in SAL (#wikimedia-operations) [2024-07-10T14:34:13Z] <cmooney@cumin1002> END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic1104,elastic1089,elastic1090 for ban elastic nodes before switch upgrade rack E1 - cmooney@cumin1002 - T365993

Mentioned in SAL (#wikimedia-operations) [2024-07-10T14:55:46Z] <cmooney@cumin1002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1104*,elastic1089*,elastic1090* for T365993 - cmooney@cumin1002

Mentioned in SAL (#wikimedia-operations) [2024-07-10T14:56:01Z] <cmooney@cumin1002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1104*,elastic1089*,elastic1090* for T365993 - cmooney@cumin1002

Icinga downtime and Alertmanager silence (ID=5386f05e-734c-49b0-a4c5-1acbef4c187a) set by cmooney@cumin1002 for 0:30:00 on 4 host(s) and their services with reason: JunOS upgrade lsw1-e1-eqiad

lsw1-e1-eqiad,lsw1-e1-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=9475b2b6-bc5f-41f8-97d1-970eb62b38bc) set by cmooney@cumin1002 for 0:30:00 on 26 host(s) and their services with reason: JunOS upgrade lsw1-e1-eqiad

an-coord1003.eqiad.wmnet,an-mariadb1001.eqiad.wmnet,an-presto[1006-1007].eqiad.wmnet,an-worker[1142,1147,1153].eqiad.wmnet,backup1010.eqiad.wmnet,cephosd1001.eqiad.wmnet,db1190.eqiad.wmnet,dbproxy1026.eqiad.wmnet,dse-k8s-worker1005.eqiad.wmnet,dumpsdata1006.eqiad.wmnet,elastic[1089-1090,1104].eqiad.wmnet,kafka-jumbo1010.eqiad.wmnet,kubernetes1059.eqiad.wmnet,logstash1036.eqiad.wmnet,lvs[1013-1015].eqiad.wmnet,ml-cache1001.eqiad.wmnet,ms-be1068.eqiad.wmnet,ms-fe1012.eqiad.wmnet,stat1010.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-07-10T15:24:16Z] <topranks> rebooting lsw1-e1-eqiad to install updated JunOS version T365993

Switch upgraded successfully and all hosts back online/pinging. Thanks everyone for the assistance!

Mentioned in SAL (#wikimedia-operations) [2024-07-10T15:46:15Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1190 (re)pooling @ 5%: post T365993 repool', diff saved to https://phabricator.wikimedia.org/P66146 and previous config saved to /var/cache/conftool/dbconfig/20240710-154615-arnaudb.json

db1190 repooling
dbproxy reloaded

everything looks OK

ms-fe1012 repooled, and everything looks good.

Mentioned in SAL (#wikimedia-operations) [2024-07-10T16:01:31Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1190 (re)pooling @ 10%: post T365993 repool', diff saved to https://phabricator.wikimedia.org/P66149 and previous config saved to /var/cache/conftool/dbconfig/20240710-160120-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-10T16:16:26Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1190 (re)pooling @ 25%: post T365993 repool', diff saved to https://phabricator.wikimedia.org/P66152 and previous config saved to /var/cache/conftool/dbconfig/20240710-161626-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-10T16:31:32Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1190 (re)pooling @ 50%: post T365993 repool', diff saved to https://phabricator.wikimedia.org/P66157 and previous config saved to /var/cache/conftool/dbconfig/20240710-163131-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-10T16:46:38Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1190 (re)pooling @ 75%: post T365993 repool', diff saved to https://phabricator.wikimedia.org/P66160 and previous config saved to /var/cache/conftool/dbconfig/20240710-164637-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-07-10T17:01:43Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1190 (re)pooling @ 100%: post T365993 repool', diff saved to https://phabricator.wikimedia.org/P66162 and previous config saved to /var/cache/conftool/dbconfig/20240710-170143-arnaudb.json