Change Details

After some back-and-forth with Juniper on https://my.juniper.net/#dashboard/srdetails/2020-0310-0320 I've verified that the default flow table size across the fleet is just 1024 entries for IPv4 and 1024 entries for IPv6: ```cdanis@cr2-eqsin> show services accounting status inline-jflow fpc-slot 0 Status information FPC Slot: 0 IPV4 export format: Version-IPFIX, IPV6 export format: Version-IPFIX BRIDGE export format: Not set, MPLS export format: Not set IPv4 Route Record Count: 787045, IPv6 Route Record Count: 78759, MPLS Route Record Count: 0 Route Record Count: 865804, AS Record Count: 554724 Route-Records Set: Yes, Config Set: Yes Service Status: PFE-0: Steady Using Extended Flow Memory?: PFE-0: No Flex Flow Sizing ENABLED?: PFE-0: No IPv4 MAX FLOW Count: 1024, IPv6 MAX FLOW Count: 1024 BRIDGE MAX FLOW Count: 1024, MPLS MAX FLOW Count: 1024 ``` This absurdly small sizing is apparently the default since Junos 15.1F2, per docs. This means that we routinely overflow these tables -- a guess from manual polling is several hundred flows/second even at nadir: ```cdanis@cr2-eqsin> show services accounting errors inline-jflow fpc-slot 0 | match "Flow Creation Failures" Flow Creation Failures: 1146233714 IPv4 Flow Creation Failures: 1111175982 IPv6 Flow Creation Failures: 35057732 cdanis@cr2-eqsin> show services accounting errors inline-jflow fpc-slot 0 | match "Flow Creation Failures" Flow Creation Failures: 1146234132 IPv4 Flow Creation Failures: 1111176365 IPv6 Flow Creation Failures: 35057767 ``` This means that our netflow data is probably pretty unreliable, likely one of the causes of T246618 (and see also T246618#5946544 for some more background). The MX series supports a [[ https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/flex-flow-sizing-edit-chassis.html | flex-flow-sizing ]] option that doesn't require a manual sizing between IPv4 tables and IPv6 tables; however, according to docs it is broken on the MX204, where manual sizing is required. On non-MX204s, let's roll out `flex-flow-sizing`. On MX204s, let's roll out this split: ```chassis fpc 0 inline-services flow-table-size ipv4-flow-table-size 11 chassis fpc 0 inline-services flow-table-size ipv6-flow-table-size 4 ``` which provides 11*256K flows for IPv4 and 4*256K flows for IPv6. This lines up reasonably well with [[ https://docs.google.com/spreadsheets/d/1PSxr24ZIjx7361NC7zxAf2lQ-4oSUl_gssccqwYvZi8/edit#gid=0 | data gathered on flow allocation failure counters ]], where our MX204s show a ratio of about 93% IPv4 allocation failures. This should probably overprovision for each kind of active flow; we can check after we've ran with new sizes for a while. == Deployment plan Pretty sure that changing these options requires an FPC restart, so we'll have to work at off-peak times and geodns-depool sites beforehand. Let's start with two routers at low-risk / low-edge-traffic sites, one MX204 and one MX480: [x] First: cr2-eqsin (MX204) After success there: [x] cr1-codfw (MX480; has more transit traffic than cr2-codfw, but cr2 has the transport link to eqiad, which seems more disruptive to disable when depooled) After reconfiguring each router, look for any obvious changes in netflow data at those sites, check if the allocation failure counters reset, check if they continue incrementing. Also make sure [[ https://grafana.wikimedia.org/d/000000538/druid?orgId=1&refresh=1m&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_analytics&var-druid_datasource=wmf_netflow&fullscreen&panelId=31&from=now-2d&to=now | netflow data stored in druid ]] is not increasing by too much Assuming all goes well, continue on with: [x] cr3-ulsfo [x] cr4-ulsfo [x] cr1-eqsin [ ] cr2-codfw [ ] cr3-knams [ ] cr2-esams [ ] cr3-esams [ ] cr1-eqiad [ ] cr2-eqiad Grouped by site as we'll want to do just one maintenance window/site once we're confident. No work needed on cr2-eqdfw and cr2-eqord at this time, because sampling isn't enabled.

After some back-and-forth with Juniper on https://my.juniper.net/#dashboard/srdetails/2020-0310-0320 I've verified that the default flow table size across the fleet is just 1024 entries for IPv4 and 1024 entries for IPv6: ```cdanis@cr2-eqsin> show services accounting status inline-jflow fpc-slot 0 Status information FPC Slot: 0 IPV4 export format: Version-IPFIX, IPV6 export format: Version-IPFIX BRIDGE export format: Not set, MPLS export format: Not set IPv4 Route Record Count: 787045, IPv6 Route Record Count: 78759, MPLS Route Record Count: 0 Route Record Count: 865804, AS Record Count: 554724 Route-Records Set: Yes, Config Set: Yes Service Status: PFE-0: Steady Using Extended Flow Memory?: PFE-0: No Flex Flow Sizing ENABLED?: PFE-0: No IPv4 MAX FLOW Count: 1024, IPv6 MAX FLOW Count: 1024 BRIDGE MAX FLOW Count: 1024, MPLS MAX FLOW Count: 1024 ``` This absurdly small sizing is apparently the default since Junos 15.1F2, per docs. This means that we routinely overflow these tables -- a guess from manual polling is several hundred flows/second even at nadir: ```cdanis@cr2-eqsin> show services accounting errors inline-jflow fpc-slot 0 | match "Flow Creation Failures" Flow Creation Failures: 1146233714 IPv4 Flow Creation Failures: 1111175982 IPv6 Flow Creation Failures: 35057732 cdanis@cr2-eqsin> show services accounting errors inline-jflow fpc-slot 0 | match "Flow Creation Failures" Flow Creation Failures: 1146234132 IPv4 Flow Creation Failures: 1111176365 IPv6 Flow Creation Failures: 35057767 ``` This means that our netflow data is probably pretty unreliable, likely one of the causes of T246618 (and see also T246618#5946544 for some more background). The MX series supports a [[ https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/flex-flow-sizing-edit-chassis.html | flex-flow-sizing ]] option that doesn't require a manual sizing between IPv4 tables and IPv6 tables; however, according to docs it is broken on the MX204, where manual sizing is required. On non-MX204s, let's roll out `flex-flow-sizing`. On MX204s, let's roll out this split: ```chassis fpc 0 inline-services flow-table-size ipv4-flow-table-size 11 chassis fpc 0 inline-services flow-table-size ipv6-flow-table-size 4 ``` which provides 11*256K flows for IPv4 and 4*256K flows for IPv6. This lines up reasonably well with [[ https://docs.google.com/spreadsheets/d/1PSxr24ZIjx7361NC7zxAf2lQ-4oSUl_gssccqwYvZi8/edit#gid=0 | data gathered on flow allocation failure counters ]], where our MX204s show a ratio of about 93% IPv4 allocation failures. This should probably overprovision for each kind of active flow; we can check after we've ran with new sizes for a while. == Deployment plan ~~Pretty sure that changing these options requires an FPC restart, so we'll have to work at off-peak times and geodns-depool sites beforehand.~~ As it turns out, no FPC/linecard/router restart is required, merely issuing a new configuration and waiting a few minutes. For ease of validation, you can issue a `clear services accounting statistics inline-jflow fpc-slot N` after the FPC is in the new state, to reset error counters to 0. Let's start with two routers at low-risk / low-edge-traffic sites, one MX204 and one MX480: [x] First: cr2-eqsin (MX204) After success there: [x] cr1-codfw (MX480; has more transit traffic than cr2-codfw, but cr2 has the transport link to eqiad, which seems more disruptive to disable when depooled) After reconfiguring each router, look for any obvious changes in netflow data at those sites, check if the allocation failure counters reset, check if they continue incrementing. Also make sure [[ https://grafana.wikimedia.org/d/000000538/druid?orgId=1&refresh=1m&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_analytics&var-druid_datasource=wmf_netflow&fullscreen&panelId=31&from=now-2d&to=now | netflow data stored in druid ]] is not increasing by too much Assuming all goes well, continue on with: [x] cr3-ulsfo [x] cr4-ulsfo [x] cr1-eqsin [x] cr2-codfw [ ] cr3-knams [ ] cr2-esams [ ] cr3-esams [ ] cr1-eqiad [ ] cr2-eqiad Grouped by site as we'll want to do just one maintenance window/site once we're confident. No work needed on cr2-eqdfw and cr2-eqord at this time, because sampling isn't enabled.

After some back-and-forth with Juniper on https://my.juniper.net/#dashboard/srdetails/2020-0310-0320 I've verified that the default flow table size across the fleet is just 1024 entries for IPv4 and 1024 entries for IPv6: ```cdanis@cr2-eqsin> show services accounting status inline-jflow fpc-slot 0 Status information FPC Slot: 0 IPV4 export format: Version-IPFIX, IPV6 export format: Version-IPFIX BRIDGE export format: Not set, MPLS export format: Not set IPv4 Route Record Count: 787045, IPv6 Route Record Count: 78759, MPLS Route Record Count: 0 Route Record Count: 865804, AS Record Count: 554724 Route-Records Set: Yes, Config Set: Yes Service Status: PFE-0: Steady Using Extended Flow Memory?: PFE-0: No Flex Flow Sizing ENABLED?: PFE-0: No IPv4 MAX FLOW Count: 1024, IPv6 MAX FLOW Count: 1024 BRIDGE MAX FLOW Count: 1024, MPLS MAX FLOW Count: 1024 ``` This absurdly small sizing is apparently the default since Junos 15.1F2, per docs. This means that we routinely overflow these tables -- a guess from manual polling is several hundred flows/second even at nadir: ```cdanis@cr2-eqsin> show services accounting errors inline-jflow fpc-slot 0 | match "Flow Creation Failures" Flow Creation Failures: 1146233714 IPv4 Flow Creation Failures: 1111175982 IPv6 Flow Creation Failures: 35057732 cdanis@cr2-eqsin> show services accounting errors inline-jflow fpc-slot 0 | match "Flow Creation Failures" Flow Creation Failures: 1146234132 IPv4 Flow Creation Failures: 1111176365 IPv6 Flow Creation Failures: 35057767 ``` This means that our netflow data is probably pretty unreliable, likely one of the causes of T246618 (and see also T246618#5946544 for some more background). The MX series supports a [[ https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/flex-flow-sizing-edit-chassis.html | flex-flow-sizing ]] option that doesn't require a manual sizing between IPv4 tables and IPv6 tables; however, according to docs it is broken on the MX204, where manual sizing is required. On non-MX204s, let's roll out `flex-flow-sizing`. On MX204s, let's roll out this split: ```chassis fpc 0 inline-services flow-table-size ipv4-flow-table-size 11 chassis fpc 0 inline-services flow-table-size ipv6-flow-table-size 4 ``` which provides 11*256K flows for IPv4 and 4*256K flows for IPv6. This lines up reasonably well with [[ https://docs.google.com/spreadsheets/d/1PSxr24ZIjx7361NC7zxAf2lQ-4oSUl_gssccqwYvZi8/edit#gid=0 | data gathered on flow allocation failure counters ]], where our MX204s show a ratio of about 93% IPv4 allocation failures. This should probably overprovision for each kind of active flow; we can check after we've ran with new sizes for a while. == Deployment plan ~~Pretty sure that changing these options requires an FPC restart, so we'll have to work at off-peak times and geodns-depool sites beforehand.~~ As it turns out, no FPC/linecard/router restart is required, merely issuing a new configuration and waiting a few minutes. For ease of validation, you can issue a `clear services accounting statistics inline-jflow fpc-slot N` after the FPC is in the new state, so we'll have to work at off-peak times and geodns-depool sites beforehandto reset error counters to 0. Let's start with two routers at low-risk / low-edge-traffic sites, one MX204 and one MX480: [x] First: cr2-eqsin (MX204) After success there: [x] cr1-codfw (MX480; has more transit traffic than cr2-codfw, but cr2 has the transport link to eqiad, which seems more disruptive to disable when depooled) After reconfiguring each router, look for any obvious changes in netflow data at those sites, check if the allocation failure counters reset, check if they continue incrementing. Also make sure [[ https://grafana.wikimedia.org/d/000000538/druid?orgId=1&refresh=1m&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_analytics&var-druid_datasource=wmf_netflow&fullscreen&panelId=31&from=now-2d&to=now | netflow data stored in druid ]] is not increasing by too much Assuming all goes well, continue on with: [x] cr3-ulsfo [x] cr4-ulsfo [x] cr1-eqsin [ x] cr2-codfw [ ] cr3-knams [ ] cr2-esams [ ] cr3-esams [ ] cr1-eqiad [ ] cr2-eqiad Grouped by site as we'll want to do just one maintenance window/site once we're confident. No work needed on cr2-eqdfw and cr2-eqord at this time, because sampling isn't enabled.