Page MenuHomePhabricator

Upgrade EVPN switches Eqiad row E-F to JunOS 22.2
Open, MediumPublic

Description

Currently we are running JunOS 20.4R3 on our EVPN switches in Eqiad rows E and F.

There are several improvements and bugfixes in more recent releases that we need to upgrade to overcome, detailed in T306421, T358488 and T365204, as well as some security fixes. So we need to upgrade the devices in these rows over the next while.

We are using the below master gsheet to co-ordinate with other SRE teams on depool and similar actions that are needed to enable this work:

https://docs.google.com/spreadsheets/d/1pLPpzGBmdExXxQ_0_eGXpO0VlUU5oPKZy-_KViMSwuM

Current proposed schedule as follows:

DeviceDateTasksTeamsNotes
ssw1-e1-eqiadThur Jun 5thT366361Traffic
ssw1-f1-eqiadThur Jun 5thT366361Traffic
lsw1-f5-eqiadTue Jun 11thT365982Data Platform, Search, Data Persistence
lsw1-f6-eqiadThur Jun 13thT365983Data Platform, Data Persistence
lsw1-f7-eqiadTue Jun 18thT365984Data Platform, Data Persistence
lsw1-e6-eqiadTue Jun 20thT365987Data Platform, Search, Data Persistenceswapped with T365986
lsw1-e5-eqiadThur Jun 25thT365986Data Platform, Search, Data Persistence, Service Opsswapped with T365987
lsw1-e7-eqiadThur Jun 27thT365988Data Platform, Search, Data Persistence
lsw1-e1-eqiadTue Jul 2ndT365993Data Platform, Search, Data Persistence, Machine Learning, Observability, Service Ops
lsw1-e2-eqiadWed Jul 3rdT365994Data Platform, Search, Data Persistence, Machine Learning, Observability, Service Ops
lsw1-e3-eqiadTue Jul 9thT365995Data Platform, Search, Data Persistence, Machine Learning, Observability, Service Ops
lsw1-f1-eqiadThu Jul 11thT365996Data Platform, Search, Data Persistence, Machine Learning, Observability
lsw1-f2-eqiadTue Jul 16thT365997Data Platform, Search, Data Persistence, Machine Learning, Observability, Service Ops
lsw1-f3-eqiadThu Jul 18thT365998Data Platform, Search, Data Persistence, Machine Learning, Service Ops

Related Objects

Event Timeline

cmooney created this task.

Change 966195 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Change IP for new Eqiad switches to MGMT IP

https://gerrit.wikimedia.org/r/966195

Change 966195 abandoned by Cathal Mooney:

[operations/puppet@production] Change IP for new Eqiad switches to MGMT IP

Reason:

we'll just ack alerts

https://gerrit.wikimedia.org/r/966195

it seems this limitation does not apply to 22.2 which we are using in codfw.

An update on this. It seems that we do have this bug in 22.2, but we don't always trigger it which is why I thought it had been fixed.

On a new switch, which has never had an IRB interface configured, pinging between the lo0.5000 interfaces in the routing-instance (and thus VXLAN encap) works just fine. If you subsequently configure an IRB interface on the box, without an UP port in the associated vlan (thus irb int is state up/down), pings between the loopback begin to fail. Further if you remove the IRB interface and vlan completely, once this state has been entered, pings continue to fail.

It is possible this is the exact same in 20.4 but we just didn't notice that things are ok if an IRB has never been configured on the box.

Anyway we should still upgrade, but it looks like we are stuck with this annoying bug.

Just an update here, the restriction still exists however I think I know how I went wrong.

In order for the irb interface to be "up" the associated vlan needs to also be "up". However the vlan will come up if there are any MAC addresses in it's table it seems.

Specifically in the VXLAN/EVPN context it means the irb interface will be in state "up" if it belongs to a stretched-vlan, and any other port in that vlan is up across the network. This is probably something we can take advantage of to ensure we've an "up" irb on otherwise empty switches, and ensure we can ping/monitor the VRF loopbacks.

cmooney@lsw1-b5-codfw> show interfaces descriptions | match irb.2002 
irb.2002        up    up   Subnet public1-b-codfw

No local ports are up but it knows remote MACs:

cmooney@lsw1-b5-codfw> show ethernet-switching table vlan-id 2002 

MAC flags (S - static MAC, D - dynamic MAC, L - locally learned, P - Persistent static
           SE - statistics enabled, NM - non configured MAC, R - remote PE MAC, O - ovsdb MAC)


Ethernet switching table : 17 entries, 17 learned
Routing instance : default-switch
   Vlan                MAC                 MAC      Logical                SVLBNH/      Active
   name                address             flags    interface              VENH Index   source
   public1-b-codfw     14:23:f2:4d:cd:60   DR       vtep.32772                          10.192.252.11                 
   public1-b-codfw     14:23:f2:4e:fc:00   DR       vtep.32771                          10.192.252.4                  
   public1-b-codfw     14:23:f2:4f:0d:f1   DR       vtep.32769                          10.192.252.2                  
   public1-b-codfw     14:23:f2:4f:30:21   DR       vtep.32770                          10.192.252.1                  
   public1-b-codfw     64:87:88:f2:6d:b0   DR       vtep.32770                          10.192.252.1                  
   public1-b-codfw     a8:d0:e5:e3:81:b0   DR       vtep.32769                          10.192.252.2                  
   public1-b-codfw     aa:00:00:0e:a4:81   DR       esi.1827               1826         00:00:00:00:01:02:00:00:00:02 
   public1-b-codfw     aa:00:00:26:f2:33   DR       esi.1827               1826         00:00:00:00:01:02:00:00:00:02 
   public1-b-codfw     aa:00:00:31:50:4c   DR       esi.1827               1826         00:00:00:00:01:02:00:00:00:02 
   public1-b-codfw     aa:00:00:53:f8:01   DR       esi.1827               1826         00:00:00:00:01:02:00:00:00:02 
   public1-b-codfw     aa:00:00:71:23:8b   DR       esi.1827               1826         00:00:00:00:01:02:00:00:00:02 
   public1-b-codfw     aa:00:00:99:be:5a   DR       esi.1827               1826         00:00:00:00:01:02:00:00:00:02 
   public1-b-codfw     aa:00:00:d2:18:cf   DR       esi.1827               1826         00:00:00:00:01:02:00:00:00:02 
   public1-b-codfw     b0:4f:13:b9:89:6c   DR       esi.1827               1826         00:00:00:00:01:02:00:00:00:02 
   public1-b-codfw     b0:4f:13:bb:e6:08   DR       esi.1827               1826         00:00:00:00:01:02:00:00:00:02 
   public1-b-codfw     b0:4f:13:bc:e8:c2   DR       esi.1827               1826         00:00:00:00:01:02:00:00:00:02 
   public1-b-codfw     c4:5a:b1:a3:b6:f7   DR       esi.1827               1826         00:00:00:00:01:02:00:00:00:02
cmooney raised the priority of this task from Low to Medium.Feb 26 2024, 7:23 PM

Actually a different need to upgrade has now become clear, relating to the issue detailed in T358488

The solution to that requires the switch inserting sub-option 5 into the DHCP option 82 information, with the forwarding-options dhcp-relay relay-option-82 link-selection statement, however this was only introduced in 21.2R1.

Right now we've no public vlan stretched in eqiad so there is no massive urgency, but another reason to get it done.

We also now have the issue from T365204 that we can resolve with an upgrade of JunOS. Not essential in eqiad but still I think we need to stop procrastinating here.

I'd propose to tackle the devices in the following order, starting with the Spines as they are route-reflectors, so I think it's best they not be behind any of the clients in terms of OS release. Also to do those we only need to co-ordinate with Traffic (for the LVS links that land there), rather than multiple SRE teams. After that tackle the racks that most recently came online, as they have only a few devices, but devices are being added all the time so best do them when they're as empty as possible.

<moved to task description>

@cmooney can you give me a time estimate for when you're going to be doing these, please? I'd like to put notes in my calendar.

@ABran-WMF thanks for creating all the tasks! Really appreciated, I did not expect to come back and see that :)

@cmooney can you give me a time estimate for when you're going to be doing these, please? I'd like to put notes in my calendar.

Hi @MatthewVernon sorry for the late response I was on leave earlier in the week.

Thinking about it I'd probably aim to start the upgrade at 15:00 UTC (so 4pm our time UK/Ireland) to maximise coverage in terms of people on both sides of the Atlantic. I can add entries in the SRE maintenance calendar for them, but if that time doesn't work for you let me know.

This comment was removed by cmooney.

Mentioned in SAL (#wikimedia-operations) [2024-06-10T13:57:30Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1107 for T348977 - bking@cumin2002

Mentioned in SAL (#wikimedia-operations) [2024-06-10T13:57:34Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic1107 for T348977 - bking@cumin2002

Mentioned in SAL (#wikimedia-operations) [2024-06-10T13:57:47Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1107.eqiad.wmnet for T348977 - bking@cumin2002

Mentioned in SAL (#wikimedia-operations) [2024-06-10T13:57:50Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1107.eqiad.wmnet for T348977 - bking@cumin2002

Mentioned in SAL (#wikimedia-operations) [2024-06-20T19:18:45Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1105 for T348977 - bking@cumin2002

Mentioned in SAL (#wikimedia-operations) [2024-06-20T19:18:50Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic1105 for T348977 - bking@cumin2002

Mentioned in SAL (#wikimedia-operations) [2024-06-20T19:18:54Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1105* for T348977 - bking@cumin2002

Mentioned in SAL (#wikimedia-operations) [2024-06-20T19:18:58Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1105* for T348977 - bking@cumin2002

Mentioned in SAL (#wikimedia-operations) [2024-06-20T20:53:02Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on elastic1105.eqiad.wmnet with reason: T348977

Mentioned in SAL (#wikimedia-operations) [2024-06-20T20:53:17Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on elastic1105.eqiad.wmnet with reason: T348977