Page MenuHomePhabricator

Upgrade cr1/cr2-eqiad JunOS
Closed, ResolvedPublic

Description

To install the new SCBs and linecards (T140764), we first need to upgrade to a more recent JunOS. We are targeting JTAC's current recommended release, 13.3R9.

Due to redundancy reasons, we cannot do cr1-eqiad until the new eqiad-esams link is provisioned (T136717). Hence, we will first start with cr2-eqiad.

The process is roughly the following:

  1. Drain cr2-eqiad
    • Adjust VRRP priorities
    • Adjust OSPF priorities
    • Drain lvs* pair (stop pybal)
    • Disable transits/peering
    • Wait 5-10 minutes
  2. Upgrade cr2-eqiad (both REs) and reboot
  3. set chassis network-services enhanced-ip
  4. Reboot again
  5. Un-drain cr2-eqiad (i.e. reverse (2))
  6. Undrain eqiad from gdnsd

(With the last two steps possibly happening after we also install the new SCBs, just to be on the safe side)

Event Timeline

OK, today we upgraded JunOS on cr2-eqiad to 13.3R9, as well as swapped the two SCBs with the new SCBE2s.

The JunOS upgrade all generally worked without many issues and took about ~2hrs. The SCB swap took another 1h30 and didn't go as smoothly.

We had the following issues during the JunOS upgrade:

  • lutetium, a frack host, stopped pinging. This was investigated and it was actually its NATed, public IP, which is configured on pfw-eqiad that did not work. I investigated this on-and-off during the whole window and found nothing wrong in either cr1/2-eqiad nor pfw-eqiad. My theory was an SRX bug, so I tried various different combinations (including removing and readding pieces of the config etc.), but to no avail. It's possible that T111463 would help, but we should definitely re-test again (it should be easy — just turn the cr2-eqiad<->pfw-eqiad link down). lutetium only became reachable again after cr2-eqiad was back.
  • In the middle of the RE upgrade, we got a major alert that stated Host 1 failed to mount /var off HDD, emergency /var created. Thankfully, this was an artifact of the upgrade and not a real issue that went away when the upgrade was completed (as documented in PR1177571).
  • Early on after the upgrade of each RE and for a few minutes we had significant IPv6 packet loss in eqiad. This turned out to be caused by both cr1 and cr2 thinking they are VRRP masters for all IPv6 subnets. In turn, this was caused by the fact that cr1's JunOS (11.4R6) and cr2's JunOS (13.3R9) implement the protocol differently, since 11.4R6's code was written before the RFC was finalized. Juniper's documentation on the subject recommended to set protocols vrrp checksum-without-pseudoheader, a hidden command, on the newer JunOS, which promptly fixed the issue.

After all of this was done, I proceeded to perform the SCB upgrade, following Juniper's procedure. Everything went well until I ran the final step of the process, request chassis fabric upgrade-bandwidth fpc all. This had the following effects:

  • First, by running fpc all instead of fpc 4 and 5 separately, we very briefly lost all asw aggregate bundles. I did not think that would matter since cr2-eqiad was drained, but this created some major flaps and packet loss in eqiad.
  • The JunOS kernel crashed and dumped core:
Jul 21 13:42:28  re0.cr2-eqiad /kernel: Live kernel core dump is triggered by: ksyncd pid: 3105
Jul 21 13:42:28  re0.cr2-eqiad /kernel: kern_request_live_dump_proc: Triggered dump
Jul 21 13:42:28  re0.cr2-eqiad /kernel: Dumping 344 MB: 329
Jul 21 13:42:42  re0.cr2-eqiad /kernel: Dump complete
  • We had flakiness across all of our network, including icinga losing reachability with cr1-eqord, BGP sessions reset between ulsfo and cr1-eqiad(!), icinga losing IPv6 reachability with mr1-eqiad(?!), as well as puppet runs on various servers failing, including in esams servers (possibly DNS-related?). All the former could possibly be explained by OSPF metric misconfiguration while draining, that made some traffic route through cr2-eqiad; the latter can't be easily explained and it could be perhaps caused by loops or other FPC misbehaviors.
  • After the kernel crash, cr2-eqiad started printing messages such as fpc4 MQCHIP(2) FI Reorder cell timeout and fpc4 MQCHIP(2) FI Reorder cell timeout for both FPC 4 and FPC 5 every few seconds and for many minutes.

After all this, we decided to go with the nuclear route: we set disable all asw-* interfaces, as well as all cross-DC transport links (eqord/codfw), effectively leaving only the link to cr1-eqiad as up. We then proceeded with halting both routing engines and then instructing Chris to physically power off the whole box and power it back on again.

The cleanly booted system didn't emit any errors, after which we gradually started turning on the interfaces and BGP sessions one by one. This was all successfull and cr2-eqiad is now the VRRP master again.

The maintenance was finished at 14:44 UTC, ~15 minutes after the originally scheduled window end, however the window was interrupted for 22 minutes (between the JunOS upgrade and the SCB upgrade) due to another unrelated outage that kept Chris busy.

faidon claimed this task.

The cr1-eqiad upgrade was performed today and lasted from 10:51 UTC to 14:38 UTC, including all the preparatory and cleanup work. This task is now resolved!

This time a separate process was devised, in order to not experience last time's issues. It's all in today's SAL but in short, the process this time was:

  1. Deprioritize VRRP (no-op, as cr1-eqiad should have been a backup anyway)
  2. Disable links to cr2-knams and cr1-codfw (no-op, both inactive)
  3. Deactivate BGP sessions to Pybal/LVS
  4. Deactivate transits, private peers and fundraising BGP peerings
  5. Disable transit/fundraising interfaces
  6. Disable all asw row interfaces
  7. That leaves only the links to cr2-eqiad up, and thus no traffic flowing through cr1 at all and no possibility for e.g. a VRRP bug to make it reclaim traffic
  8. Remove the VRRPv3 compatibility statement from cr2-eqiad (set protocols vrrp checksum-without-pseudoheader)
  9. Upgrade JunOS: first upgrade re0, reboot re0, master switchover (re1->re0), upgrade re1, reboot re1
  10. Set chassis network-services enhanced-ip (in anticipation for the SCB upgrade) and reboot both REs concurrently, saving us some time.

This was an easier/safer plan, however we had the following issues:

  • After deactivating the BGP sessions to Pybal, I noticed that cr1-eqiad was falling back to the static backup routes, instead of (i)BGP to cr2-eqiad, just for two of all of our routes, selected probably randomly (208.80.154.224/32 & 2620:0:861:ed1a::2:d/128). This was investigated extensively (30mins or so), as it could be potentially catastrophic (the first IP is text-lb.eqiad IPv4). After some experimentation, this was fixed by disabling all the static backup routes. The current theory is:
<paravoid> so it's probably confused by some static->OSPF->BGP loop I think 
<paravoid> I think it's a race
<paravoid> cr1-eqiad has the static, advertises it over OSPF
<paravoid> cr2-eqiad receives it over OSPF
<paravoid> picks it over BGP and thus the BGP route becomes inactive
<paravoid> and never gets redistributed into iBGP
  • Early on, we detected that VRRP was master for some of the row D subnets (all the IPv4 ones, one IPv6 one) in both routers. We never figured out why (and this didn't reoccur after the upgrade/reboot cycles), so the current theory is that it was a JunOS MX bug, as we have seen something similar before in esams.
  • We proceeded despite the above nevertheless. Because of this bug, when performing step 6 above (disable all asw row interfaces), I purposefully deactivated row A first, followed by rows B/C, as these were drained of traffic at the time. Unfortunately, after doing that, both cr1-eqiad's loopback and (probably?) all row D->A/B/C communication was lost, causing issues in a number of different services (ElasticSearch, changeprop etc.) as well as 5xxs. We saw strange behavior at the time, such as ping 208.80.154.196 source 208.80.154.131 not pinging from cr2-eqiad, and ping bast1001 not working from cr1-eqiad). Due to site issues we had to quickly revert, unfortunately, so we couldn't debug this extensively. Later, we found that rpd had crashed during all this, so this maybe a (or another) JunOS bug:
< mark> Aug 18 12:04:16  re1.cr1-eqiad /kernel: BAD_PAGE_FAULT: pid 2040 (rpd), uid 0: pc 0x8200343 got a read fault at 0x40024, x86 fault flags = 0x4
< mark> Aug 18 12:04:16  re1.cr1-eqiad /kernel: Trapframe Register Dump:
< mark> Aug 18 12:04:16  re1.cr1-eqiad /kernel: Ieax: 00040000Iecx: 090cc000Iedx: 00000000Iebx: 00000000
< mark> Aug 18 12:04:16  re1.cr1-eqiad /kernel: Iesp: bfbedaa0Iebp: bfbedab8Iesi: 00000000Iedi: 00040000
< mark> Aug 18 12:04:16  re1.cr1-eqiad /kernel: Ieip: 08200343Ieflags: 00010206
< mark> Aug 18 12:04:16  re1.cr1-eqiad /kernel: Ics: 0033Iss: 003bIds: bfbe003bIes: 930003b
< mark> Aug 18 12:04:16  re1.cr1-eqiad /kernel: Ifs: 003bItrapno: 0000000cIerr: 00000004
< mark> Aug 18 12:04:16  re1.cr1-eqiad /kernel: Page table info for PC address 0x8200343: PDE = 0x9e500067, PTE = ad964425
< mark> Aug 18 12:04:16  re1.cr1-eqiad /kernel: Dumping 16 bytes starting at PC address 0x8200343:
< mark> Aug 18 12:04:16  re1.cr1-eqiad /kernel:     8b 70 24 85 f6 0f 84 96 00 00 00 8b 56 08 8b 4e
  • Finally on a more minor one that also happened during the cr2-eqiad upgrade, the reboot failed on one of the REs, which was then stuck with no console output or any other activity for quite some time and until the watchdog kicked in to reboot it. This is another JunOS bug. It didn't happen in subsequent reboots, however, and planned reboots are quite rare and, by their nature, scheduled so we won't spend more time on it.