Page MenuHomePhabricator

roll out sensible flow-table-sizes to Juniper core routers with sampling enabled
Closed, ResolvedPublic

Description

After some back-and-forth with Juniper on https://my.juniper.net/#dashboard/srdetails/2020-0310-0320 I've verified that the default flow table size across the fleet is just 1024 entries for IPv4 and 1024 entries for IPv6:

cdanis@cr2-eqsin> show services accounting status inline-jflow fpc-slot 0
  Status information
    FPC Slot: 0
    IPV4 export format: Version-IPFIX, IPV6 export format: Version-IPFIX
    BRIDGE export format: Not set, MPLS export format: Not set
    IPv4 Route Record Count: 787045, IPv6 Route Record Count: 78759, MPLS Route Record Count: 0
    Route Record Count: 865804, AS Record Count: 554724
    Route-Records Set: Yes, Config Set: Yes
    Service Status: PFE-0: Steady 
    Using Extended Flow Memory?: PFE-0: No 
    Flex Flow Sizing ENABLED?: PFE-0: No 
    IPv4 MAX FLOW Count: 1024, IPv6 MAX FLOW Count: 1024
    BRIDGE MAX FLOW Count: 1024, MPLS MAX FLOW Count: 1024

This absurdly small sizing is apparently the default since Junos 15.1F2, per docs.

This means that we routinely overflow these tables -- a guess from manual polling is several hundred flows/second even at nadir:

cdanis@cr2-eqsin> show services accounting errors inline-jflow fpc-slot 0 | match "Flow Creation Failures"    
    Flow Creation Failures: 1146233714
    IPv4 Flow Creation Failures: 1111175982
    IPv6 Flow Creation Failures: 35057732

cdanis@cr2-eqsin> show services accounting errors inline-jflow fpc-slot 0 | match "Flow Creation Failures"    
    Flow Creation Failures: 1146234132
    IPv4 Flow Creation Failures: 1111176365
    IPv6 Flow Creation Failures: 35057767

This means that our netflow data is probably pretty unreliable, likely one of the causes of T246618 (and see also T246618#5946544 for some more background).

The MX series supports a flex-flow-sizing option that doesn't require a manual sizing between IPv4 tables and IPv6 tables; however, according to docs it is broken on the MX204, where manual sizing is required.

On non-MX204s, let's roll out flex-flow-sizing. On MX204s, let's roll out this split:

chassis fpc 0 inline-services flow-table-size ipv4-flow-table-size 11
chassis fpc 0 inline-services flow-table-size ipv6-flow-table-size 4

which provides 11*256K flows for IPv4 and 4*256K flows for IPv6. This lines up reasonably well with data gathered on flow allocation failure counters, where our MX204s show a ratio of about 93% IPv4 allocation failures. This should probably overprovision for each kind of active flow; we can check after we've ran with new sizes for a while.

Deployment plan

Pretty sure that changing these options requires an FPC restart, so we'll have to work at off-peak times and geodns-depool sites beforehand.
As it turns out, no FPC/linecard/router restart is required, merely issuing a new configuration and waiting a few minutes.

For ease of validation, you can issue a clear services accounting statistics inline-jflow fpc-slot N after the FPC is in the new state, to reset error counters to 0.

Let's start with two routers at low-risk / low-edge-traffic sites, one MX204 and one MX480:

  • First: cr2-eqsin (MX204)

After success there:

  • cr1-codfw (MX480; has more transit traffic than cr2-codfw, but cr2 has the transport link to eqiad, which seems more disruptive to disable when depooled)

After reconfiguring each router, look for any obvious changes in netflow data at those sites, check if the allocation failure counters reset, check if they continue incrementing. Also make sure netflow data stored in druid is not increasing by too much

Assuming all goes well, continue on with:

  • cr3-ulsfo
  • cr4-ulsfo
  • cr1-eqsin
  • cr2-codfw
  • cr3-knams
  • cr2-esams
  • cr3-esams
  • cr1-eqiad
  • cr2-eqiad

Grouped by site as we'll want to do just one maintenance window/site once we're confident.

No work needed on cr2-eqdfw and cr2-eqord at this time, because sampling isn't enabled.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Discussed it with Chris on IRC, LGTM.

Change 583134 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] depool eqsin for router maintenance

https://gerrit.wikimedia.org/r/583134

Change 583134 merged by CDanis:
[operations/dns@master] depool eqsin for router maintenance

https://gerrit.wikimedia.org/r/583134

Mentioned in SAL (#wikimedia-operations) [2020-03-24T20:26:37Z] <cdanis> commit flow-table-size on cr2-eqsin T248394

Change 583140 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] Revert "depool eqsin for router maintenance"

https://gerrit.wikimedia.org/r/583140

Change 583140 merged by CDanis:
[operations/dns@master] Revert "depool eqsin for router maintenance"

https://gerrit.wikimedia.org/r/583140

Deployed on cr2-eqsin:

cdanis@cr2-eqsin> show services accounting status inline-jflow fpc-slot 0    
  Status information
    FPC Slot: 0
    IPV4 export format: Version-IPFIX, IPV6 export format: Version-IPFIX
    BRIDGE export format: Not set, MPLS export format: Not set
    IPv4 Route Record Count: 787796, IPv6 Route Record Count: 78793, MPLS Route Record Count: 0
    Route Record Count: 866589, AS Record Count: 204864
    Route-Records Set: Yes, Config Set: Yes
    Service Status: PFE-0: Steady 
    Using Extended Flow Memory?: PFE-0: No 
    Flex Flow Sizing ENABLED?: PFE-0: No 
    IPv4 MAX FLOW Count: 3843279, IPv6 MAX FLOW Count: 1397556
    BRIDGE MAX FLOW Count: 1024, MPLS MAX FLOW Count: 1024

cdanis@cr2-eqsin> show services accounting errors inline-jflow fpc-slot 0    
  Error information
    FPC Slot: 0
    Flow Creation Failures: 0
    Route Record Lookup Failures: 25, AS Lookup Failures: 25
    Export Packet Failures: 0
    Memory Overload: No, Memory Alloc Fail Count: 0

    IPv4:
    IPv4 Flow Creation Failures: 0
    IPv4 Route Record Lookup Failures: 0, IPv4 AS Lookup Failures: 0
    IPv4 Export Packet Failures: 0

    IPv6:
    IPv6 Flow Creation Failures: 0
    IPv6 Route Record Lookup Failures: 25, IPv6 AS Lookup Failures: 25
    IPv6 Export Packet Failures: 0

Flow creation failure counters still haven't increased from 0 since 20 minutes after the repool, so that's nice.

Interestingly, despite post-maintenance traffic being 7/8ths of pre-maintenance (according to the frontend traffic ATS/Varnish stats), because we're deeper in the site's nadir, the # of bytes and packets reported on netflow have both increased by about 15-20%. So maybe it worked?

Change 583309 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] depool codfw for router maintenance

https://gerrit.wikimedia.org/r/583309

Change 583309 merged by CDanis:
[operations/dns@master] depool codfw for router maintenance

https://gerrit.wikimedia.org/r/583309

Mentioned in SAL (#wikimedia-operations) [2020-03-25T11:35:03Z] <cdanis> depool codfw for router maintenance T248394

Mentioned in SAL (#wikimedia-operations) [2020-03-25T11:50:22Z] <cdanis> cr1-codfw: set chassis fpc 5 inline-services flex-flow-sizing and request chassis fpc restart slot 5 T248394

Change 583354 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] Revert "depool codfw for router maintenance"

https://gerrit.wikimedia.org/r/583354

Change 583354 merged by CDanis:
[operations/dns@master] Revert "depool codfw for router maintenance"

https://gerrit.wikimedia.org/r/583354

Deployed to codfw but caused an outage. Incident report in progress

Druid disk usage is not greatly increased, routers seem happy. Will reconfigure another router or two, and work on Homer-izing the change, today

Oh, for posterity, it definitely worked -- here's bytes + packets reported by netflow with dst IP == any eqsin loadbalancer IP:

image.png (799×1 px, 88 KB)

Change 583740 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/homer/public@master] phased rollout of sensible flow-table-sizes

https://gerrit.wikimedia.org/r/583740

Change 583748 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] depool ulsfo for router maintenance

https://gerrit.wikimedia.org/r/583748

Change 583748 merged by CDanis:
[operations/dns@master] depool ulsfo for router maintenance

https://gerrit.wikimedia.org/r/583748

Change 583755 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] Revert "depool ulsfo for router maintenance"

https://gerrit.wikimedia.org/r/583755

Mentioned in SAL (#wikimedia-operations) [2020-03-26T21:12:56Z] <cdanis> applied flow-table-size configuration to cr4-ulsfo which did not need a reboot to apply it T248394

Change 583755 merged by CDanis:
[operations/dns@master] Revert "depool ulsfo for router maintenance"

https://gerrit.wikimedia.org/r/583755

Mentioned in SAL (#wikimedia-operations) [2020-03-26T21:32:36Z] <cdanis> cdanis@re0.cr1-eqsin# set chassis afeb slot 0 inline-services flex-flow-sizing cdanis@re0.cr1-eqsin# commit comment "flex-flow-sizing T248394"

Mentioned in SAL (#wikimedia-operations) [2020-03-30T12:26:14Z] <cdanis> cdanis@re0.cr2-codfw# set chassis fpc 5 inline-services flex-flow-sizing cdanis@re0.cr2-codfw# commit comment "flex-flow-sizing T248394"

Mentioned in SAL (#wikimedia-operations) [2020-03-30T23:08:18Z] <cdanis> cdanis@cr3-knams# commit comment "sensible flow table sizes T248394"

Mentioned in SAL (#wikimedia-operations) [2020-03-30T23:16:41Z] <cdanis> cr2-esams: commit flex-flow-sizing T248394

Mentioned in SAL (#wikimedia-operations) [2020-03-30T23:30:11Z] <cdanis> cr3-esams: commit flex-flow-sizing T248394

Mentioned in SAL (#wikimedia-operations) [2020-03-31T15:01:09Z] <cdanis> cr2-eqiad: commit flex-flow-sizing T248394

Mentioned in SAL (#wikimedia-operations) [2020-03-31T15:05:49Z] <cdanis> cr1-eqiad: commit flex-flow-sizing T248394

Change 583740 merged by jenkins-bot:
[operations/homer/public@master] completed rollout of sensible flow-table-sizes

https://gerrit.wikimedia.org/r/583740

Homer-ized and done.

Mentioned in SAL (#wikimedia-operations) [2020-07-01T17:14:58Z] <XioNoX> set flex-flow-sizing to cr2-eqsin - T248394

flex-flow-sizing has been fixed in a more recent Junos version https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1356072
eqsin is now using it.

Change 609109 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Use flex-flow-sizing on MX204

https://gerrit.wikimedia.org/r/c/operations/homer/public/ /609109

Change 609109 merged by jenkins-bot:
[operations/homer/public@master] Use flex-flow-sizing on MX204

https://gerrit.wikimedia.org/r/c/operations/homer/public/ /609109

Mentioned in SAL (#wikimedia-operations) [2020-07-02T08:59:32Z] <XioNoX> deploy flex flow for MX204s - T248394