Page MenuHomePhabricator

Upgrade asw1-eqsin
Closed, ResolvedPublic

Description

asw1-eqsin generated a coredump for l2cpd

After investigation with JTAC, it matches an internal PR where a specific LLDP frame received on em0 made the process crash. The bug is fixed on any recent Junos version.

The main risk here is that l2cpd also controls LACP, so a crash could cause use facing issues. However the previous crash didn't, so setting the priority to medium.

Another reason to upgrade is fix the mgmt_junos bug detailed in T327862

Event Timeline

ayounsi triaged this task as Medium priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

FYI, the mgmt_junos bug (also present on the fasw) might not be fixed by an upgrade, but maybe with the solution exposed in https://www.reddit.com/r/Juniper/comments/mvq8hf/comment/j7gd6hq/
set interface em0.0 family inet address 10.XXX.XXX.XXX/XX master-only

ayounsi moved this task from Next quarter to This quarter on the netops board.

Change 987741 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Revert "Disable Telemetry on eqsin switches"

https://gerrit.wikimedia.org/r/987741

Latest Junos recommended has been copied to /var/tmp/
Next steps: downtime the site and proceed with the upgrade : https://wikitech.wikimedia.org/wiki/Juniper_switch_upgrade#Virtual_Chassis_switches

Change 988400 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] Depool eqsin for switch upgrade

https://gerrit.wikimedia.org/r/988400

Change 988400 merged by Ayounsi:

[operations/dns@master] Depool eqsin for switch upgrade

https://gerrit.wikimedia.org/r/988400

Mentioned in SAL (#wikimedia-operations) [2024-01-08T09:03:15Z] <XioNoX> depool eqsin for switch upgrade - T332395

Icinga downtime and Alertmanager silence (ID=6bec1528-7372-478d-856a-a08325eb04f0) set by ayounsi@cumin1002 for 2:00:00 on 35 host(s) and their services with reason: eqsin switch upgrade

bast5004.wikimedia.org,cp[5017-5032].eqsin.wmnet,dns[5003-5004].wikimedia.org,doh[5001-5002].wikimedia.org,durum[5001-5002].eqsin.wmnet,ganeti[5004-5007].eqsin.wmnet,install5002.wikimedia.org,lvs[5004-5006].eqsin.wmnet,ncredir[5001-5002].eqsin.wmnet,netflow5002.eqsin.wmnet,prometheus5002.eqsin.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-01-08T09:24:46Z] <XioNoX> start install process on asw1-eqsin - T332395

Mentioned in SAL (#wikimedia-operations) [2024-01-08T09:54:53Z] <XioNoX> asw1-eqsin> request system reboot - T332395

Change 988248 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] Repool eqsin for switch upgrade

https://gerrit.wikimedia.org/r/988248

Change 988416 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Enable mgmt_junos on asw1-eqsin

https://gerrit.wikimedia.org/r/988416

Change 987741 merged by jenkins-bot:

[operations/homer/public@master] Revert "Disable Telemetry on eqsin switches"

https://gerrit.wikimedia.org/r/987741

Change 988416 merged by jenkins-bot:

[operations/homer/public@master] Enable mgmt_junos on asw1-eqsin

https://gerrit.wikimedia.org/r/988416

Change 988248 merged by Ayounsi:

[operations/dns@master] Repool eqsin for switch upgrade

https://gerrit.wikimedia.org/r/988248

All done. ~10min downtime.