Page MenuHomePhabricator

Ingestion semantic for netflow data sent to kafka generates late-data
Closed, ResolvedPublic

Description

Here is an example of events received in kafka on 2020-06-05 around 11:20am:

{"event_type": "purge", "tag2": 0, "as_src": 13335, "as_dst": 2914, "as_path": "2914", "peer_as_src": 0, "peer_as_dst": 2914, "peer_ip_src": "103.102.166.130", "ip_src": "162.159.64.7", "ip_dst": "116.51.26.210", "port_src": 0, "port_dst": 0, "country_ip_src": "", "country_ip_dst": "SG", "tcp_flags": "0", "ip_proto": "gre", "stamp_inserted": "2020-06-04 22:25:00", "stamp_updated": "2020-06-05 11:22:01", "packets": 126000, "bytes": 12626000, "writer_id": "default_kafka/30598"}

{"event_type": "purge", "tag2": 0, "as_src": 13335, "as_dst": 2914, "as_path": "2914", "peer_as_src": 0, "peer_as_dst": 2914, "peer_ip_src": "103.102.166.130", "ip_src": "162.159.64.7", "ip_dst": "116.51.26.210", "port_src": 0, "port_dst": 0, "country_ip_src": "", "country_ip_dst": "SG", "tcp_flags": "0", "ip_proto": "gre", "stamp_inserted": "2020-06-04 22:25:00", "stamp_updated": "2020-06-05 11:23:01", "packets": 152000, "bytes": 13717000, "writer_id": "default_kafka/30652"}

We currently use the stamp_inserted field as ingestion-date for both Camus and Druid, and in that specific case this data is seen as belonging to day 2020-06-04 hour 22.
I can see examples of that pattern in other hours as well, for instance on day 2020-06-03 at hour 21

These late data are small but happen regularly: I computed the number of seconds difference between stamp_inserted and stamp_updated, and aggregated it using FLOOR(LOG10(diff)) per hour for the month of May 2020. This computation shows that hours containing at least one row where the difference is more than 10k seconds (more than 2hours and a half) are frequent (~10 per day).

2 solutions I can think of:

  • Use stamp_updated instead of stamp_inserted as main ingestion timestamp
  • Fix a limit for data late-acceptance (for instance, don't accept events having stamp_inserted later than X hours.

Event Timeline

@ayounsi is it ok to set the "Time" field to stamp_updated rather than stamp_inserted ? As Joseph pointed out we have some weird situations like the one depicted in the description:

"stamp_inserted": "2020-06-04 22:25:00", "stamp_updated": "2020-06-05 11:22:01"

Does it make a big change or is it ok? I don't recall the difference between the two fields..

Hopefully Arzhel understands this better than I do, but here's my rough understanding:

  • 'Normally' routers send netflow just on the end of a flow (e.g. TCP FIN).
  • If the flow is especially long-lived -- as these gre flows are! -- when a router decides to send netflow for a long-lived flow depends on the 'active [flow] timeout'. See also this PDF page 6 and flow-active-timeout config setting.
    • We set these values to 10 seconds, the minimum allowed.
  • In nfacctd's config we set kafka_history: 1m which buckets everything into 1 minute bins.
    • stamp_inserted is the first time the collector saw the flow, stamp_updated is the last timestamp at which it received data for it before exporting it.
    • I am pretty sure stamp_inserted comes from the start time recorded in the flow by the router, which per Juniper docs is preserved across "active timeouts" of the same flow (pg 6 of the same PDF above).

So I think we also want to set nfacctd_time_new, which looks like it will ignore the possibly-long-ago flow start time recorded by the router, and instead time-bucket based upon when the netflow packet arrived at nfacctd.

This ought to remove the spikes we see for GRE, while also being more temporally-accurate in general (for all flows that last longer than a minute).

Change 602718 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] pmacct: set nfacctd_time_new for accurate long-lived flow times

https://gerrit.wikimedia.org/r/602718

Change 602718 merged by CDanis:
[operations/puppet@production] pmacct: set nfacctd_time_new for accurate long-lived flow times

https://gerrit.wikimedia.org/r/602718

Mentioned in SAL (#wikimedia-operations) [2020-06-05T15:39:48Z] <cdanis> disabling puppet on netflow* and trying I6598d8f8 on netflow3001 first T254574

Mentioned in SAL (#wikimedia-operations) [2020-06-05T15:54:07Z] <cdanis> enabling & rerunning puppet on netflow* T254574

CDanis closed this task as Resolved.EditedJun 5 2020, 4:09 PM

The change looks effective: previously, GRE traffic was predominantly reported as a ginormous-to-the-point-of-physical-impossibility impulse function:

Screenshot_2020-06-05_12-02-56.png (829×1 px, 16 KB)

The largest of these spikes is 65.3Tbyte/minute =~ 8707 Gigabit/second, which is many orders of magnitude larger than our total ingress bandwidth.

I believe this was because of routers treating GRE as (mostly) one flow, preserving the start timestamp, and nfacctd passing along the start timestamp as stamp_inserted, which we then bucketed upon for reporting (and thus also causing the late-data condition as well, as that many bytes of traffic represents something like 9-10 days of GRE traffic at our usual rate).

nfacctd_time_new: true does seem to cause nfacctd to instead ignore the router's flow start timestamp, and report bytes/packets for each flow just in the minute in which the netflow packet was received -- we now see this GRE traffic smeared out over time, as it is in reality:

Screenshot_2020-06-05_12-07-28.png (953×1 px, 88 KB)

(The first increase in gre / temporary dip in tcp traffic is when the change is applied on netflow3001, and the second is when it was applied on the rest of the netflow collectors.)

As a last note: this should improve the accuracy of all long-lived flows (any with a duration longer than a minute); the GRE ones were just the most prominent such flows.

Great explanation @CDanis - This change should also resolve hour late-data issue - One stone two birds :)