Page MenuHomePhabricator

Add more dimensions in the netflow/pmacct/Druid pipeline
Open, HighPublic

Description

For peering planning purposes it'd be useful to include a few more dimensions In our netflow/pmacct/Druid pipeline. Specifically, and in order of usefulness:

  • BGP communities, so that we can build queries that answer the question "how much of traffic for ASN X flows through transit". Communities is a essentially a tag-based system (each route can have multiple dimensions applied to it), that we can control on the routers, so that will be quite powerful. This begs the question of how would we store this best in Druid and query with Turnilo. Druid's documentation mentions multi-value dimensions, which seems appropriate here, but not sure if this would work and how :)
  • Region/site (eqiad, esams etc.): we currently have "exporter IP" which can be (ab)used for this purpose, but having the region/site is arguably more useful. If adding it to the pmacct pipeline is too much of a trouble, I wonder if we could use something like Druid's lookups? Perhaps too fragile and thus a terrible idea, though :)
  • AS names, e.g. coming from the MaxMind GeoIP ASN database. I think we've used that database before e.g. in the webrequest Druid database. Could we perhaps use Druid lookups for this to avoid adding another (identical) dimension to the data set?
  • Not sure if this is possible, but a dimension with the network prefix, rather in addition to the individual IP address could be super useful as well.
  • Address family (IPv4 or IPv6)

Event Timeline

faidon created this task.Jun 3 2020, 9:40 AM
Restricted Application added a project: Operations. · View Herald TranscriptJun 3 2020, 9:40 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Some more on the Druid aspect of things:

  • We have used multi-value dimensions in Druid without problem - Data needs to be an array and that's it.
  • We have not used Druid lookups, but if there is not-too-big direct match between a set of IPs and a region, then it seems an appropriate use-case.
Dzahn triaged this task as Medium priority.Jun 4 2020, 9:14 AM
Milimetric raised the priority of this task from Medium to High.Jun 4 2020, 4:18 PM
Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

To add to the above, I'm also wondering how difficult it would be to also include AS *names*, e.g. coming from the MaxMind GeoIP ASN database. I think we've used that database before, maybe for pageview data? Could we perhaps use Druid lookups for this to avoid adding another (identical) dimension to the data set?

It's not a hard requirement or anything - it's just that for the less common ASNs, I find myself having to alt+tab to another tab or terminal to do the lookup by hand, and it'd be nice if that was viewable in Turnilo instead :)

faidon updated the task description. (Show Details)Jul 2 2020, 5:22 PM

So - how do we make progress here? Any thoughts on who/how? :) Some of these features could really make a tremendous amount of difference to our network operations and future planning, so I'm super excited about seeing these into fruition!

elukey added a subscriber: elukey.Jul 3 2020, 6:52 AM

Adding my 2c :)

  • BGP communities - if pmacct supports adding them to the Kafka JSON message directly, it should be very easy to support from the Analytics point of view (as Joseph mentioned). We'd need to change some configs but it should take few time.
  • Region/site/AS-names - I don't love the Druid lookups idea for two reasons: 1) the data would be augmented only in Druid, not in Hive, so in the future if we decide to use Spark for more complicated queries/reports/etc.. then we'd have the same problem to solve 2) lookups would be something "dynamic" to maintain for Druid, something that we can absolutely add but it would add extra maintenance and things to remember when making changes (nothing terrible of course but mentioning out loud in the cons). Maybe we could think about augment the data during Refine, like we do for webrequest (sort of). Joseph what do you think?
  • network prefix / address family - we need to check if pmacct can add these info to the Kafka JSON message, if so it will be just like the BGP communities (some new config to update and that's it).

Region/site/AS-names - I don't love the Druid lookups idea for two reasons: 1) the data would be augmented only in Druid, not in Hive, so in the future if we decide to use Spark for more complicated queries/reports/etc.. then we'd have the same problem to solve 2) lookups would be something "dynamic" to maintain for Druid, something that we can absolutely add but it would add extra maintenance and things to remember when making changes (nothing terrible of course but mentioning out loud in the cons). Maybe we could think about augment the data during Refine, like we do for webrequest (sort of). Joseph what do you think?

I think it's a good idea to augment the data we store on hdfs. Whether we use lookups or not it will allow to keep historical data as good as we can (recomputing old data IPs with a current maxmind DB leads to imprecisions).
The concern I see here is about lambda loading - We would have the augmented data only after realtime-data is reindexed (I think it's daily). If we want to have realtime data also augmented, we need to add a streaming job in between the current stream and the druid ingestion.
trying to summarize options with pros and cons:

  • lookups - lookup table load and regular update jobs to create, imprecise for historical views, doesn't help on the HDFS side (data in HDFS not augmented)
  • HDFS data augmentation only - Data is as precise as we can (weekly maxmind update already in place), same values in HDFS and in druid, realtime ingested data isn't augmented with the fields
  • HDFS data augmentation + streaming job - the above plus realtime data being augmented in the same way HDFS data is.
Nuria added a subscriber: Nuria.EditedAug 3 2020, 9:32 PM

@faidon says this is quite useful for DOS prevention/troubleshooting so putting it on our next up kanban for this quarter

Nuria assigned this task to fdans.Aug 25 2020, 4:26 PM
fdans added a comment.Aug 27 2020, 2:20 PM

@JAllemandou and I just had a chat about these changes. Before proceeding with any of the ways Joseph described above, @faidon: how important is it that this dataset remains real time? Nuria mentioned DOS prevention so presumably it's important to keep it real time. In any case this task will require adding a data augmentation step before ingesting to druid, so using Druid lookups to get the region/site dimension won't be necessary.

As Joseph also mentioned, we could also add the augmentation job but leaving the dimensions requested null until refine runs and reloads data on druid. But knowing what would be the most valuable thing would be the first thing to do right now.

It's critical that this data remain real-time, even if some of the fields aren't available in the real-time data.

@CDanis that makes sense. In that case what we propose is adding an intermediate data augmentation step to add these dimensions about 6-7 hours after they are added in real time, with the intention of adding a streaming job that adds them real time at a later stage.

Would this still provide value?

Yes, it would. There's two use cases here:

  • DoS attack analysis, for which real-time is essential. Here, the augmented data would be helpful, but it's not required or as important as real-time
  • Historical analysis of our traffic flows with other networks, so we can propose peering with them. Here the augmented data would be very helpful.

Does that make sense?

fdans added a comment.Aug 28 2020, 1:28 PM

Thanks for clarifying. A correction from my end: the extra dimensions would actually take significantly less then 6 hours since they would not be included as part of refine, but as part of the augmentation job we would be adding, which would run as soon as the hour events are available in hive.

mforns moved this task from Next Up to Paused on the Analytics-Kanban board.Sep 14 2020, 4:17 PM
Nuria reassigned this task from fdans to mforns.Mon, Sep 21, 4:30 PM
Nuria added a subscriber: fdans.
mforns moved this task from Paused to In Progress on the Analytics-Kanban board.Mon, Oct 5, 2:48 PM
mforns added a comment.Tue, Oct 6, 8:38 PM

Hi all!

I believe we can use a Refine transform function to add the requested fields (except for BGP communities IIUC) at refine time.
Plus, it seems to me the new dimensions are not introducing data explosion, because they are already coded within the exiting fields (except for BGP again).

But I'd like to understand the context a bit better. Can @faidon or @CDanis help me please? :]

BGP communities, so that we can build queries that answer the question "how much of traffic for ASN X flows through transit". Communities is a essentially a tag-based system (each route can have multiple dimensions applied to it), that we can control on the routers, so that will be quite powerful. This begs the question of how would we store this best in Druid and query with Turnilo. Druid's documentation mentions multi-value dimensions, which seems appropriate here, but not sure if this would work and how :)

This, I assume, needs to be added to the pmacct producer? OR is this already represented in the Hive data?

Region/site (eqiad, esams etc.): we currently have "exporter IP" which can be (ab)used for this purpose, but having the region/site is arguably more useful. If adding it to the pmacct pipeline is too much of a trouble, I wonder if we could use something like Druid's lookups? Perhaps too fragile and thus a terrible idea, though :)

Which field stores the "exporter ip" in the Hive netflow table (couldn't find it by name)? Is there a direct equivalent from a given exporter ip to a region (eqiad, esams, etc)? Or a set of rules that the transform function could implement? Are there any related docs that I can read? Sorry for my ignorance in this.

Thanks a lot!

Nuria added a comment.Tue, Oct 6, 8:40 PM

@mforns: it could also be a second job run after the refined one (similar to how we do virtual-pageviews) as we probably do not want to create special refine functions for just one dataset

Change 632603 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Pmacct add standard BGP community to flows

https://gerrit.wikimedia.org/r/632603

This, I assume, needs to be added to the pmacct producer?

That's correct, the issue is that a flow can have several BGP communities, which would be represented in the form:
"comms": "14907:0_14907:2_14907:3"
To mean that the flow has the 3 communities 14907:0 14907:2 and 14907:3.
So we can easily add it to the pmacct producer (and let me know when is a good time to do so). But I believe Faidon's question is about how to use it in Druid/Turnilo to for example filter only on 14907:2.

Which field stores the "exporter ip" in the Hive netflow table (couldn't find it by name)?

peer_ip_src

Is there a direct equivalent from a given exporter ip to a region (eqiad, esams, etc)?
Or a set of rules that the transform function could implement?

Yes, but not sure what the most robust way to proceed is.
For example map the above IP to a static list (for example if the exporter IP is in 103.102.166.0/24 mark it as eqsin). If it's the best option I can provide you with the full list.
Or per IP explicit mapping (less preferred as it will be
Other option would for example be to resolve the IP to use the DC string from its hostname.

Are there any related docs that I can read? Sorry for my ignorance in this.

Not sure what do you're looking for. If it doesn't exist we can probably create/update it.
Doc about Netflow: https://wikitech.wikimedia.org/wiki/Netflow
Netbox and Puppet will have some forms of IP prefixes to sites mapping.

Thanks @ayounsi

"comms": "14907:0_14907:2_14907:3"
To mean that the flow has the 3 communities 14907:0 14907:2 and 14907:3.
So we can easily add it to the pmacct producer (and let me know when is a good time to do so). But I believe Faidon's question is about how to use it in Druid/Turnilo to for example filter only on 14907:2.

Awesome. Yes, as you said, Druid allows for multi-value dimensions. Either the Refine job or a subsequent job can transform BGP strings like "14907:0_14907:2_14907:3" into a list like ["14907:0", "14907:2", "14907:3"] and that would be ingested by Druid easily. In Turnilo's UI you would just use the drop-down filter with check-boxes to select those communities that you want to see (1 or more). I saw your patch, and think that whenever that gets merged, the current Refine job will automagically add that field to the refined netflow table (@Nuria correct me if I'm wrong).

For example map the above IP to a static list (for example if the exporter IP is in 103.102.166.0/24 mark it as eqsin). If it's the best option I can provide you with the full list.

I think that would be the easiest! The only downside I can see is we'd have to maintain the mapping updated in our code. Do you think this mapping list is likely to change frequently?

Not sure what do you're looking for. If it doesn't exist we can probably create/update it.
Doc about Netflow: https://wikitech.wikimedia.org/wiki/Netflow
Netbox and Puppet will have some forms of IP prefixes to sites mapping.

Thanks a lot for that, will read!


Also, @Nuria
I think your idea of having an extra job that expands the refined data into yet another netflow data set is a good one!
We'd have another data set to maintain (deletion job, etc.), but I think it would be less error prone.
Mabye... we could output the new "expanded" netflow data set inside the event database.
Thus allowing for it to be sanitized within the same eventlogging whitelist?

Awesome. Yes, as you said, Druid allows for multi-value dimensions. Either the Refine job or a subsequent job can transform BGP strings like "14907:0_14907:2_14907:3" into a list like ["14907:0", "14907:2", "14907:3"] and that would be ingested by Druid easily. In Turnilo's UI you would just use the drop-down filter with check-boxes to select those communities that you want to see (1 or more). I saw your patch, and think that whenever that gets merged, the current Refine job will automagically add that field to the refined netflow table (@Nuria correct me if I'm wrong).

Ok to merge anytime or should I sync up with you?

I think that would be the easiest! The only downside I can see is we'd have to maintain the mapping updated in our code. Do you think this mapping list is likely to change frequently?

Could you base your list from https://github.com/wikimedia/puppet/blob/production/modules/network/data/data.yaml#L32
Which is then exposed with https://github.com/wikimedia/puppet/blob/production/modules/network/manifests/constants.pp#L11
And can be accessed like https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/flowspec.pp#L16
Otherwise yes, we would have to keep updating it manually, which doesn't happen often.

OK, after a very interesting chat with Joseph, here's our conclusions:

  • It would be cool to have the core of the required transformations (ASN to AS name via MaxMind, exporter IP to region, etc.) available as Java core methods. Those could be used from sparkSQL, from Hive (as UDFs) or from Flink. Thus, super interesting if we want to apply those transformations in a lambda architecture.
  • We are not sure whether to just expand the netflow data set in place with 1 Refine transform function that uses the mentioned core methods, or have an extra oozie job that expands the current netflow dataset into another data set i.e. netflow_expanded. Maybe leaning towards using a transform function, because code would be shorter and less moving pieces?
  • We thought it would be OK to move the netflow data and table to the event database, so that it can be sanitized as part of the event sanitization (the sanitized version of netflow would be stored in event_sanitized).
mforns added a comment.Wed, Oct 7, 1:18 PM

@ayounsi

Ok to merge anytime or should I sync up with you?

I believe it's OK to merge, and that Refine should identify the new field and automagically evolve netflow's Hive schema. But let me confirm later today!

Could you base your list from https://github.com/wikimedia/puppet/blob/production/modules/network/data/data.yaml#L32
Which is then exposed with https://github.com/wikimedia/puppet/blob/production/modules/network/manifests/constants.pp#L11
And can be accessed like https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/flowspec.pp#L16
Otherwise yes, we would have to keep updating it manually, which doesn't happen often.

OK, I will try to pass those as a parameter of the transform function from puppet itself.

Nuria added a comment.Wed, Oct 7, 2:58 PM

Maybe leaning towards using a transform function, because code would be shorter and less moving pieces?

I think having very specific code on refine to apply to just one job is an anti-pattern, albeit (you are right) shorter but on my opinion, much more brittle.

But I believe Faidon's question is about how to use it in Druid/Turnilo to for example filter only on 14907:2

Yes, this would work (if that field is ingested as a multi value dimension). This is similar to revision_tags in edit_hourly. https://github.com/wikimedia/analytics-refinery/blob/master/oozie/edit/hourly/edit_hourly.hql (tags are ingested as an array of strings)

See: https://druid.apache.org/docs/latest/querying/multi-value-dimensions.html

mforns added a comment.Wed, Oct 7, 5:32 PM

@ayounsi Confirmed that you can merge the changes that add BGP communities to pmacct!
We'll be monitoring the kafka topic. Thanks!

Change 632603 merged by Ayounsi:
[operations/puppet@production] Pmacct add standard BGP community to flows

https://gerrit.wikimedia.org/r/632603

Done! And confirmed with kafkacat, eg: "comms": "2914:420_2914:1008_2914:2000_2914:3000_14907:4"
As well as no drops in Turnilo.

Awesome!
The size of the events has increased in about 25-30%, which is considerable, but I believe sustainable for now.
When we sanitize this data set for long term retention, we'll have to think about the size of the remaining data.

Wow, that 's more then expected indeed! If it's an issue down the road we could think of filtering out some communities (for example only keeping ours).

mforns added a comment.Thu, Oct 8, 1:37 PM

After discussing with the team, we think it's fine for now.
If we want to add more fields or increase the sampling ratio,
then we should indeed make some calculations to make sure we're ok :]

mforns added a comment.Fri, Oct 9, 2:08 PM

Hi @ayounsi, can you help me? I have some more questions:

  • What is the field that we want to extract the AS name for? I see as_src, as_dst, peer_as_src, peer_as_dst?
  • Regarding network prefix: I assume IPs can have different network prefix lengths, no? If so, how can we determine the length?
  • Regarding address family (IPv4 or IPv6), what ip field do we want that from?

Thanks!

  • What is the field that we want to extract the AS name for? I see as_src, as_dst, peer_as_src, peer_as_dst?

Ideally all of them, but at least as_src and as_dst note that because traffic sampled is to/from our network if the as_src is a public AS (that you can lookup), as_dst will most likely be a private one (not present in the maxmind DB) and the other way around. We could either keep them empty or feed them a static list of ASN (see https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations#Private_AS).

  • Regarding network prefix: I assume IPs can have different network prefix lengths, no? If so, how can we determine the length?

We would need to configure Pmacct to also export src_net + dst_net, and maybe src_mask + dst_mask (to be tested first). If the mask is needed we would probably need to do some transformations to have a single CIDR notation (xxxxx/yy). Let me know when it's fine to merge the relevant change (for src_net + dst_net at least).

  • Regarding address family (IPv4 or IPv6), what ip field do we want that from?

src_host or dst_host.

Thanks

mforns added a comment.Fri, Oct 9, 4:54 PM

What is the field that we want to extract the AS name for? I see as_src, as_dst, peer_as_src, peer_as_dst?

Ideally all of them, but at least as_src and as_dst note that because traffic sampled is to/from our network if the as_src is a public AS (that you can lookup), as_dst will most likely be a private one (not present in the maxmind DB) and the other way around. We could either keep them empty or feed them a static list of ASN (see https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations#Private_AS).

OK, noted.

Regarding network prefix: I assume IPs can have different network prefix lengths, no? If so, how can we determine the length?

We would need to configure Pmacct to also export src_net + dst_net, and maybe src_mask + dst_mask (to be tested first). If the mask is needed we would probably need to do some transformations to have a single CIDR notation (xxxxx/yy). Let me know when it's fine to merge the relevant change (for src_net + dst_net at least).

IIUC src_net and dst_net already come in CIDR notation. Yes, let's add them to the events. We can merge them next Monday, so that we are able to react, in case we need to increase the Kafka partitions for that topic.

Regarding address family (IPv4 or IPv6), what ip field do we want that from?

src_host or dst_host.

I don't see those fields in Hive. Do we need to add them from pmacct as well?
Or maybe I misunderstood what needs to be done here... I assumed we want to determine whether the given IP is v4 or v6. But which IP would we be talking about? ip_src, ip_dst, both?

Or maybe I misunderstood what needs to be done here... I assumed we want to determine whether the given IP is v4 or v6. But which IP would we be talking about? ip_src, ip_dst, both?

Indeed, my bad, ip_src or ip_dst. Both should always be of the same version, so either we only look at one of them. Or we could set it to "missmatch/0/-1" if they're not the same.

Change 633510 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Nfacctd, add src_net, dst_net

https://gerrit.wikimedia.org/r/633510

@ayounsi

Let me know when it's fine to merge the relevant change (for src_net + dst_net at least).

Please, merge whenever you are ready. Thanks!

Change 633510 merged by Ayounsi:
[operations/puppet@production] Nfacctd, add src_net, dst_net

https://gerrit.wikimedia.org/r/633510

Merged, note that it's not in a CIDR notation, so src_mask + dst_mask would be needed to generate the CIDR form.

Change 633737 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Nfacct: add src_mask + dst_mask

https://gerrit.wikimedia.org/r/633737

Yes, please merge when ready, thanks!

Change 633737 merged by Ayounsi:
[operations/puppet@production] Nfacct: add src_mask + dst_mask

https://gerrit.wikimedia.org/r/633737

Change 634328 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] Add Refine transform function for Netflow data set

https://gerrit.wikimedia.org/r/634328

Change 634946 had a related patch set uploaded (by Faidon Liambotis; owner: Faidon Liambotis):
[operations/puppet@production] turnilo: add exporter hostname and region for netflow

https://gerrit.wikimedia.org/r/634946