After some back-and-forth with Juniper on https://my.juniper.net/#dashboard/srdetails/2020-0310-0320 I've verified that the default flow table size across the fleet is just 1024 entries for IPv4 and 1024 entries for IPv6:
```cdanis@cr2-eqsin> show services accounting status inline-jflow fpc-slot 0
Status information
FPC Slot: 0
IPV4 export format: Version-IPFIX, IPV6 export format: Version-IPFIX
BRIDGE export format: Not set, MPLS export format: Not set
IPv4 Route Record Count: 787045, IPv6 Route Record Count: 78759, MPLS Route Record Count: 0
Route Record Count: 865804, AS Record Count: 554724
Route-Records Set: Yes, Config Set: Yes
Service Status: PFE-0: Steady
Using Extended Flow Memory?: PFE-0: No
Flex Flow Sizing ENABLED?: PFE-0: No
IPv4 MAX FLOW Count: 1024, IPv6 MAX FLOW Count: 1024
BRIDGE MAX FLOW Count: 1024, MPLS MAX FLOW Count: 1024
```
This absurdly small sizing is apparently the default since Junos 15.1F2, per docs.
This means that we routinely overflow these tables -- a guess from manual polling is several hundred flows/second even at nadir:
```cdanis@cr2-eqsin> show services accounting errors inline-jflow fpc-slot 0 | match "Flow Creation Failures"
Flow Creation Failures: 1146233714
IPv4 Flow Creation Failures: 1111175982
IPv6 Flow Creation Failures: 35057732
cdanis@cr2-eqsin> show services accounting errors inline-jflow fpc-slot 0 | match "Flow Creation Failures"
Flow Creation Failures: 1146234132
IPv4 Flow Creation Failures: 1111176365
IPv6 Flow Creation Failures: 35057767
```
This means that our netflow data is probably pretty unreliable, likely one of the causes of T246618 (and see also T246618#5946544 for some more background).
The MX series supports a [[ https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/flex-flow-sizing-edit-chassis.html | flex-flow-sizing ]] option that doesn't require a manual sizing between IPv4 tables and IPv6 tables; however, according to docs it is broken on the MX204, where manual sizing is required.
On non-MX204s, let's roll out `flex-flow-sizing`. On MX204s, let's roll out this split:
```chassis fpc 0 inline-services flow-table-size ipv4-flow-table-size 11
chassis fpc 0 inline-services flow-table-size ipv6-flow-table-size 4
```
which provides 11*256K flows for IPv4 and 4*256K flows for IPv6. This lines up reasonably well with [[ https://docs.google.com/spreadsheets/d/1PSxr24ZIjx7361NC7zxAf2lQ-4oSUl_gssccqwYvZi8/edit#gid=0 | data gathered on flow allocation failure counters ]], where our MX204s show a ratio of about 93% IPv4 allocation failures. This should probably overprovision for each kind of active flow; we can check after we've ran with new sizes for a while.
== Deployment plan
~~Pretty sure that changing these options requires an FPC restart, so we'll have to work at off-peak times and geodns-depool sites beforehand.~~
As it turns out, no FPC/linecard/router restart is required, merely issuing a new configuration and waiting a few minutes.
For ease of validation, you can issue a `clear services accounting statistics inline-jflow fpc-slot N` after the FPC is in the new state, to reset error counters to 0.
Let's start with two routers at low-risk / low-edge-traffic sites, one MX204 and one MX480:
[x] First: cr2-eqsin (MX204)
After success there:
[x] cr1-codfw (MX480; has more transit traffic than cr2-codfw, but cr2 has the transport link to eqiad, which seems more disruptive to disable when depooled)
After reconfiguring each router, look for any obvious changes in netflow data at those sites, check if the allocation failure counters reset, check if they continue incrementing. Also make sure [[ https://grafana.wikimedia.org/d/000000538/druid?orgId=1&refresh=1m&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_analytics&var-druid_datasource=wmf_netflow&fullscreen&panelId=31&from=now-2d&to=now | netflow data stored in druid ]] is not increasing by too much
Assuming all goes well, continue on with:
[x] cr3-ulsfo
[x] cr4-ulsfo
[x] cr1-eqsin
[x] cr2-codfw
[ ] cr3-knams
[ ] cr2-esams
[ ] cr3-esams
[ ] cr1-eqiad
[ ] cr2-eqiad
Grouped by site as we'll want to do just one maintenance window/site once we're confident.
No work needed on cr2-eqdfw and cr2-eqord at this time, because sampling isn't enabled.