Druid pipeline
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	faidon
	Jun 3 2020, 9:40 AM

Description

For peering planning purposes it'd be useful to include a few more dimensions In our netflow/pmacct/Druid pipeline. Specifically, and in order of usefulness:

BGP communities, so that we can build queries that answer the question "how much of traffic for ASN X flows through transit". Communities is a essentially a tag-based system (each route can have multiple dimensions applied to it), that we can control on the routers, so that will be quite powerful. This begs the question of how would we store this best in Druid and query with Turnilo. Druid's documentation mentions multi-value dimensions, which seems appropriate here, but not sure if this would work and how :)
Region/site (eqiad, esams etc.): we currently have "exporter IP" which can be (ab)used for this purpose, but having the region/site is arguably more useful. If adding it to the pmacct pipeline is too much of a trouble, I wonder if we could use something like Druid's lookups? Perhaps too fragile and thus a terrible idea, though :)
AS names, e.g. coming from the MaxMind GeoIP ASN database. I think we've used that database before e.g. in the webrequest Druid database. Could we perhaps use Druid lookups for this to avoid adding another (identical) dimension to the data set?
Not sure if this is possible, but a dimension with the network prefix, rather in addition to the individual IP address could be super useful as well.
Address family (IPv4 or IPv6)

Details

Subject	Repo	Branch	Lines +/-
analytics::refinery::job::druid_load.pp: add fields to netflow long term	operations/puppet	production	+1 -1
turnilo: add export mappings for network devices via query_resources	operations/puppet	production	+62 -41
analytics::refinery::job::druid_load.pp: reduce netflow retention	operations/puppet	production	+2 -2
turnilo: add exporter hostname and region for netflow	operations/puppet	production	+48 -3
turnilo::templates::config.yaml.erb: add new fields to netflow config	operations/puppet	production	+27 -6
analytics::refinery::job::druid_load.pp: Add fields to wmf_netflow	operations/puppet	production	+2 -2
refine netflow - put spark_extra_files as spark opts, not job config	operations/puppet	production	+2 -1
analytics::refinery::job::refine.pp: Add transform function to netflow	operations/puppet	production	+11 -2
Add Refine transform function for Netflow data set	analytics/refinery/source	master	+260 -0
Add ::profile::analytics::refinery::network_region_config	operations/puppet	production	+20 -0
Nfacct: add src_mask + dst_mask	operations/puppet	production	+1 -1
Nfacctd, add src_net, dst_net	operations/puppet	production	+1 -1
Pmacct add standard BGP community to flows	operations/puppet	production	+1 -1

Related Objects
Search...

Status	Assigned	Task
Resolved	odimitrijevic	T257554 Netflow data pipeline
Resolved	mforns	T254332 Add more dimensions in the netflow/pmacct/Druid pipeline
Resolved	elukey	T263290 Turnilo: per-second rates for wmf_netflow bytes + packets

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Region/site/AS-names - I don't love the Druid lookups idea for two reasons: 1) the data would be augmented only in Druid, not in Hive, so in the future if we decide to use Spark for more complicated queries/reports/etc.. then we'd have the same problem to solve 2) lookups would be something "dynamic" to maintain for Druid, something that we can absolutely add but it would add extra maintenance and things to remember when making changes (nothing terrible of course but mentioning out loud in the cons). Maybe we could think about augment the data during Refine, like we do for webrequest (sort of). Joseph what do you think?

I think it's a good idea to augment the data we store on hdfs. Whether we use lookups or not it will allow to keep historical data as good as we can (recomputing old data IPs with a current maxmind DB leads to imprecisions).
The concern I see here is about lambda loading - We would have the augmented data only after realtime-data is reindexed (I think it's daily). If we want to have realtime data also augmented, we need to add a streaming job in between the current stream and the druid ingestion.
trying to summarize options with pros and cons:

lookups - lookup table load and regular update jobs to create, imprecise for historical views, doesn't help on the HDFS side (data in HDFS not augmented)
HDFS data augmentation only - Data is as precise as we can (weekly maxmind update already in place), same values in HDFS and in druid, realtime ingested data isn't augmented with the fields
HDFS data augmentation + streaming job - the above plus realtime data being augmented in the same way HDFS data is.

elukey added a parent task: T257554: Netflow data pipeline.Jul 9 2020, 10:13 AM

@faidon says this is quite useful for DOS prevention/troubleshooting so putting it on our next up kanban for this quarter

• Nuria added a project: Analytics-Kanban.Aug 3 2020, 9:32 PM

• Nuria assigned this task to • fdans.Aug 25 2020, 4:26 PM

@JAllemandou and I just had a chat about these changes. Before proceeding with any of the ways Joseph described above, @faidon: how important is it that this dataset remains real time? Nuria mentioned DOS prevention so presumably it's important to keep it real time. In any case this task will require adding a data augmentation step before ingesting to druid, so using Druid lookups to get the region/site dimension won't be necessary.

As Joseph also mentioned, we could also add the augmentation job but leaving the dimensions requested null until refine runs and reloads data on druid. But knowing what would be the most valuable thing would be the first thing to do right now.

It's critical that this data remain real-time, even if some of the fields aren't available in the real-time data.

@CDanis that makes sense. In that case what we propose is adding an intermediate data augmentation step to add these dimensions about 6-7 hours after they are added in real time, with the intention of adding a streaming job that adds them real time at a later stage.

Would this still provide value?

Yes, it would. There's two use cases here:

DoS attack analysis, for which real-time is essential. Here, the augmented data would be helpful, but it's not required or as important as real-time
Historical analysis of our traffic flows with other networks, so we can propose peering with them. Here the augmented data would be very helpful.

Does that make sense?

Thanks for clarifying. A correction from my end: the extra dimensions would actually take significantly less then 6 hours since they would not be included as part of refine, but as part of the augmentation job we would be adding, which would run as soon as the hour events are available in hive.

mforns moved this task from Next Up to Paused on the Analytics-Kanban board.Sep 14 2020, 4:17 PM

• Nuria reassigned this task from • fdans to mforns.Sep 21 2020, 4:30 PM

• Nuria added a subscriber: • fdans.

faidon mentioned this in T263277: Collect netflow data for internal traffic.Sep 24 2020, 3:27 PM

assigning to @mforns

mforns moved this task from Paused to In Progress on the Analytics-Kanban board.Oct 5 2020, 2:48 PM

• fdans added a subtask: T263290: Turnilo: per-second rates for wmf_netflow bytes + packets.Oct 5 2020, 4:27 PM

Hi all!

I believe we can use a Refine transform function to add the requested fields (except for BGP communities IIUC) at refine time.
Plus, it seems to me the new dimensions are not introducing data explosion, because they are already coded within the exiting fields (except for BGP again).

But I'd like to understand the context a bit better. Can @faidon or @CDanis help me please? :]

BGP communities, so that we can build queries that answer the question "how much of traffic for ASN X flows through transit". Communities is a essentially a tag-based system (each route can have multiple dimensions applied to it), that we can control on the routers, so that will be quite powerful. This begs the question of how would we store this best in Druid and query with Turnilo. Druid's documentation mentions multi-value dimensions, which seems appropriate here, but not sure if this would work and how :)

This, I assume, needs to be added to the pmacct producer? OR is this already represented in the Hive data?

Region/site (eqiad, esams etc.): we currently have "exporter IP" which can be (ab)used for this purpose, but having the region/site is arguably more useful. If adding it to the pmacct pipeline is too much of a trouble, I wonder if we could use something like Druid's lookups? Perhaps too fragile and thus a terrible idea, though :)

Which field stores the "exporter ip" in the Hive netflow table (couldn't find it by name)? Is there a direct equivalent from a given exporter ip to a region (eqiad, esams, etc)? Or a set of rules that the transform function could implement? Are there any related docs that I can read? Sorry for my ignorance in this.

Thanks a lot!

@mforns: it could also be a second job run after the refined one (similar to how we do virtual-pageviews) as we probably do not want to create special refine functions for just one dataset

Change 632603 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Pmacct add standard BGP community to flows

https://gerrit.wikimedia.org/r/632603

gerritbot added a project: Patch-For-Review.Oct 7 2020, 6:15 AM

In T254332#6523115, @mforns wrote:

This, I assume, needs to be added to the pmacct producer?

That's correct, the issue is that a flow can have several BGP communities, which would be represented in the form:
"comms": "14907:0_14907:2_14907:3"
To mean that the flow has the 3 communities 14907:0 14907:2 and 14907:3.
So we can easily add it to the pmacct producer (and let me know when is a good time to do so). But I believe Faidon's question is about how to use it in Druid/Turnilo to for example filter only on 14907:2.

Which field stores the "exporter ip" in the Hive netflow table (couldn't find it by name)?

peer_ip_src

Is there a direct equivalent from a given exporter ip to a region (eqiad, esams, etc)?
Or a set of rules that the transform function could implement?

Yes, but not sure what the most robust way to proceed is.
For example map the above IP to a static list (for example if the exporter IP is in 103.102.166.0/24 mark it as eqsin). If it's the best option I can provide you with the full list.
Or per IP explicit mapping (less preferred as it will be
Other option would for example be to resolve the IP to use the DC string from its hostname.

Are there any related docs that I can read? Sorry for my ignorance in this.

Not sure what do you're looking for. If it doesn't exist we can probably create/update it.
Doc about Netflow: https://wikitech.wikimedia.org/wiki/Netflow
Netbox and Puppet will have some forms of IP prefixes to sites mapping.

Thanks @ayounsi

"comms": "14907:0_14907:2_14907:3"
To mean that the flow has the 3 communities 14907:0 14907:2 and 14907:3.
So we can easily add it to the pmacct producer (and let me know when is a good time to do so). But I believe Faidon's question is about how to use it in Druid/Turnilo to for example filter only on 14907:2.

Awesome. Yes, as you said, Druid allows for multi-value dimensions. Either the Refine job or a subsequent job can transform BGP strings like "14907:0_14907:2_14907:3" into a list like ["14907:0", "14907:2", "14907:3"] and that would be ingested by Druid easily. In Turnilo's UI you would just use the drop-down filter with check-boxes to select those communities that you want to see (1 or more). I saw your patch, and think that whenever that gets merged, the current Refine job will automagically add that field to the refined netflow table (@Nuria correct me if I'm wrong).

For example map the above IP to a static list (for example if the exporter IP is in 103.102.166.0/24 mark it as eqsin). If it's the best option I can provide you with the full list.

I think that would be the easiest! The only downside I can see is we'd have to maintain the mapping updated in our code. Do you think this mapping list is likely to change frequently?

Not sure what do you're looking for. If it doesn't exist we can probably create/update it.
Doc about Netflow: https://wikitech.wikimedia.org/wiki/Netflow
Netbox and Puppet will have some forms of IP prefixes to sites mapping.

Thanks a lot for that, will read!

Also, @Nuria
I think your idea of having an extra job that expands the refined data into yet another netflow data set is a good one!
We'd have another data set to maintain (deletion job, etc.), but I think it would be less error prone.
Mabye... we could output the new "expanded" netflow data set inside the event database.
Thus allowing for it to be sanitized within the same eventlogging whitelist?

Awesome. Yes, as you said, Druid allows for multi-value dimensions. Either the Refine job or a subsequent job can transform BGP strings like "14907:0_14907:2_14907:3" into a list like ["14907:0", "14907:2", "14907:3"] and that would be ingested by Druid easily. In Turnilo's UI you would just use the drop-down filter with check-boxes to select those communities that you want to see (1 or more). I saw your patch, and think that whenever that gets merged, the current Refine job will automagically add that field to the refined netflow table (@Nuria correct me if I'm wrong).

Ok to merge anytime or should I sync up with you?

I think that would be the easiest! The only downside I can see is we'd have to maintain the mapping updated in our code. Do you think this mapping list is likely to change frequently?

Could you base your list from https://github.com/wikimedia/puppet/blob/production/modules/network/data/data.yaml#L32
Which is then exposed with https://github.com/wikimedia/puppet/blob/production/modules/network/manifests/constants.pp#L11
And can be accessed like https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/flowspec.pp#L16
Otherwise yes, we would have to keep updating it manually, which doesn't happen often.

OK, after a very interesting chat with Joseph, here's our conclusions:

It would be cool to have the core of the required transformations (ASN to AS name via MaxMind, exporter IP to region, etc.) available as Java core methods. Those could be used from sparkSQL, from Hive (as UDFs) or from Flink. Thus, super interesting if we want to apply those transformations in a lambda architecture.

We are not sure whether to just expand the netflow data set in place with 1 Refine transform function that uses the mentioned core methods, or have an extra oozie job that expands the current netflow dataset into another data set i.e. netflow_expanded. Maybe leaning towards using a transform function, because code would be shorter and less moving pieces?

We thought it would be OK to move the netflow data and table to the event database, so that it can be sanitized as part of the event sanitization (the sanitized version of netflow would be stored in event_sanitized).

@ayounsi

Ok to merge anytime or should I sync up with you?

I believe it's OK to merge, and that Refine should identify the new field and automagically evolve netflow's Hive schema. But let me confirm later today!

Could you base your list from https://github.com/wikimedia/puppet/blob/production/modules/network/data/data.yaml#L32
Which is then exposed with https://github.com/wikimedia/puppet/blob/production/modules/network/manifests/constants.pp#L11
And can be accessed like https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/flowspec.pp#L16
Otherwise yes, we would have to keep updating it manually, which doesn't happen often.

OK, I will try to pass those as a parameter of the transform function from puppet itself.

mforns mentioned this in T231339: Set up automatic deletion/snitization for netflow data set in Hive.Oct 7 2020, 1:20 PM

Maybe leaning towards using a transform function, because code would be shorter and less moving pieces?

I think having very specific code on refine to apply to just one job is an anti-pattern, albeit (you are right) shorter but on my opinion, much more brittle.

But I believe Faidon's question is about how to use it in Druid/Turnilo to for example filter only on 14907:2

Yes, this would work (if that field is ingested as a multi value dimension). This is similar to revision_tags in edit_hourly. https://github.com/wikimedia/analytics-refinery/blob/master/oozie/edit/hourly/edit_hourly.hql (tags are ingested as an array of strings)

See: https://druid.apache.org/docs/latest/querying/multi-value-dimensions.html

@ayounsi Confirmed that you can merge the changes that add BGP communities to pmacct!
We'll be monitoring the kafka topic. Thanks!

Change 632603 merged by Ayounsi:
[operations/puppet@production] Pmacct add standard BGP community to flows

https://gerrit.wikimedia.org/r/632603

Done! And confirmed with kafkacat, eg: "comms": "2914:420_2914:1008_2914:2000_2914:3000_14907:4"
As well as no drops in Turnilo.

Maintenance_bot removed a project: Patch-For-Review.Oct 8 2020, 11:10 AM

Awesome!
The size of the events has increased in about 25-30%, which is considerable, but I believe sustainable for now.
When we sanitize this data set for long term retention, we'll have to think about the size of the remaining data.

Wow, that 's more then expected indeed! If it's an issue down the road we could think of filtering out some communities (for example only keeping ours).

After discussing with the team, we think it's fine for now.
If we want to add more fields or increase the sampling ratio,
then we should indeed make some calculations to make sure we're ok :]

Hi @ayounsi, can you help me? I have some more questions:

What is the field that we want to extract the AS name for? I see as_src, as_dst, peer_as_src, peer_as_dst?

Regarding network prefix: I assume IPs can have different network prefix lengths, no? If so, how can we determine the length?

Regarding address family (IPv4 or IPv6), what ip field do we want that from?

Thanks!

What is the field that we want to extract the AS name for? I see as_src, as_dst, peer_as_src, peer_as_dst?

Ideally all of them, but at least as_src and as_dst note that because traffic sampled is to/from our network if the as_src is a public AS (that you can lookup), as_dst will most likely be a private one (not present in the maxmind DB) and the other way around. We could either keep them empty or feed them a static list of ASN (see https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations#Private_AS).

Regarding network prefix: I assume IPs can have different network prefix lengths, no? If so, how can we determine the length?

We would need to configure Pmacct to also export src_net + dst_net, and maybe src_mask + dst_mask (to be tested first). If the mask is needed we would probably need to do some transformations to have a single CIDR notation (xxxxx/yy). Let me know when it's fine to merge the relevant change (for src_net + dst_net at least).

Regarding address family (IPv4 or IPv6), what ip field do we want that from?

src_host or dst_host.

Thanks

What is the field that we want to extract the AS name for? I see as_src, as_dst, peer_as_src, peer_as_dst?

Ideally all of them, but at least as_src and as_dst note that because traffic sampled is to/from our network if the as_src is a public AS (that you can lookup), as_dst will most likely be a private one (not present in the maxmind DB) and the other way around. We could either keep them empty or feed them a static list of ASN (see https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations#Private_AS).

OK, noted.

Regarding network prefix: I assume IPs can have different network prefix lengths, no? If so, how can we determine the length?

We would need to configure Pmacct to also export src_net + dst_net, and maybe src_mask + dst_mask (to be tested first). If the mask is needed we would probably need to do some transformations to have a single CIDR notation (xxxxx/yy). Let me know when it's fine to merge the relevant change (for src_net + dst_net at least).

IIUC src_net and dst_net already come in CIDR notation. Yes, let's add them to the events. We can merge them next Monday, so that we are able to react, in case we need to increase the Kafka partitions for that topic.

Regarding address family (IPv4 or IPv6), what ip field do we want that from?

src_host or dst_host.

I don't see those fields in Hive. Do we need to add them from pmacct as well?
Or maybe I misunderstood what needs to be done here... I assumed we want to determine whether the given IP is v4 or v6. But which IP would we be talking about? ip_src, ip_dst, both?

Or maybe I misunderstood what needs to be done here... I assumed we want to determine whether the given IP is v4 or v6. But which IP would we be talking about? ip_src, ip_dst, both?

Indeed, my bad, ip_src or ip_dst. Both should always be of the same version, so either we only look at one of them. Or we could set it to "missmatch/0/-1" if they're not the same.

Change 633510 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Nfacctd, add src_net, dst_net

https://gerrit.wikimedia.org/r/633510

gerritbot added a project: Patch-For-Review.Oct 12 2020, 9:30 AM

@ayounsi

Let me know when it's fine to merge the relevant change (for src_net + dst_net at least).

Please, merge whenever you are ready. Thanks!

Change 633510 merged by Ayounsi:
[operations/puppet@production] Nfacctd, add src_net, dst_net

https://gerrit.wikimedia.org/r/633510

Merged, note that it's not in a CIDR notation, so src_mask + dst_mask would be needed to generate the CIDR form.

Change 633737 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Nfacct: add src_mask + dst_mask

https://gerrit.wikimedia.org/r/633737

Yes, please merge when ready, thanks!

Change 633737 merged by Ayounsi:
[operations/puppet@production] Nfacct: add src_mask + dst_mask

https://gerrit.wikimedia.org/r/633737

Maintenance_bot removed a project: Patch-For-Review.Oct 14 2020, 3:10 PM

Change 634328 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] Add Refine transform function for Netflow data set

https://gerrit.wikimedia.org/r/634328

gerritbot added a project: Patch-For-Review.Oct 15 2020, 7:44 PM

Change 634946 had a related patch set uploaded (by Faidon Liambotis; owner: Faidon Liambotis):
[operations/puppet@production] turnilo: add exporter hostname and region for netflow

https://gerrit.wikimedia.org/r/634946

mforns moved this task from In Progress to In Code Review on the Analytics-Kanban board.Oct 21 2020, 2:46 PM

Change 634946 merged by Elukey:
[operations/puppet@production] turnilo: add exporter hostname and region for netflow

https://gerrit.wikimedia.org/r/634946

• fdans moved this task from In Code Review to In Progress on the Analytics-Kanban board.Oct 26 2020, 4:28 PM

mforns moved this task from In Progress to In Code Review on the Analytics-Kanban board.Oct 29 2020, 2:37 PM

Change 637559 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Add ::profile::analytics::refinery::network_infra_config

https://gerrit.wikimedia.org/r/637559

• fdans closed subtask T263290: Turnilo: per-second rates for wmf_netflow bytes + packets as Resolved.Oct 29 2020, 9:23 PM

Change 637559 merged by Ottomata:
[operations/puppet@production] Add ::profile::analytics::refinery::network_region_config

https://gerrit.wikimedia.org/r/637559

Change 634328 merged by jenkins-bot:
[analytics/refinery/source@master] Add Refine transform function for Netflow data set

https://gerrit.wikimedia.org/r/634328

Maintenance_bot removed a project: Patch-For-Review.Nov 2 2020, 10:10 PM

mforns moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.Nov 3 2020, 3:38 PM

Change 641754 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job::refine.pp: Add transform function to netflow

https://gerrit.wikimedia.org/r/641754

gerritbot added a project: Patch-For-Review.Nov 18 2020, 3:47 PM

Change 641754 merged by Ottomata:
[operations/puppet@production] analytics::refinery::job::refine.pp: Add transform function to netflow

https://gerrit.wikimedia.org/r/641754

Change 641819 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] refine netflow - put spark_extra_files as spark opts, not job config

https://gerrit.wikimedia.org/r/641819

Change 641819 merged by Ottomata:
[operations/puppet@production] refine netflow - put spark_extra_files as spark opts, not job config

https://gerrit.wikimedia.org/r/641819

Hi @ayounsi,

The new data is already in Hive's netflow table. And I'm about to enable Druid loading for the new fields.
I have one question, though: From the new fields:

parsed_comms, net_cidr_src, net_cidr_dst, as_name_src, as_name_dst, ip_version, region

Which ones do we want to keep indefinitely in Druid?
Note this is independent from what we discussed in T231339.
Druid has the same privacy restrictions plus a stronger data-size restriction.

Could you tell which ones would you like to keep in Druid after 90 days?
Thanks!

I don't understand the difference :) Why can't we do the same as T231339#6612105 ?

We could do the same as T231339#6612105 if necessary.

However, netflow's datasource in Druid is already big (1.9TB). It's bigger than datasources webrequest_sampled_128, pageviews_hourly and pageviews_daily all together. Adding new fields for the 90 day period (before sanitization) doesn't bother me much, because they will be dropped eventually. But adding them indefinitely will make Druid's netflow datasource grow even more, at a quicker pace.

I just wanted to make sure that having the new fields both in Hive and in Druid is necessary, given that Hive and Druid have different use cases, IIUC. So, let me know if you also need to keep all fields in T231339#6612105 indefinitely in Druid as well. If so, I will discuss with the team.

Thanks!

More thoughts...

The addition of a field to a Druid datasource can have different effects to the data size.

If the field is just adding more information to existing Druid aggregated rows, then the size increment will be proportional to the row size increment. That's the case of region, as_name_dist, as_name_src, ip_version, those will increase the size of the datasource by a factor between 1 and 2 (My guess would be around ~1.2). I think we'll be fine with this.
If the field is forcing Druid to de-aggregate rows to accommodate the new information, then the size increment will be (worst case) a multiplication of the current size, by the cardinality of the new dimension. That's the case (IIUC) of parsed_comms. It is possible that this field addition multiplies the size of the data set by a higher factor, especially because it's an array (high cardinality).

On another subject:
Currently 1 day of netflow unsanitized (raw) data in Druid occupies ~20GB. After we sanitized it (nullify some fields) it occupies ~0.5GB.
So, 90 days of unsanitized data makes 1.8TB (most of current datasource size), while 1 year of sanitized data makes <0.2TB.
This makes me think that we should also try to reduce the size of the unsanitized data, given that it's going to grow when we add the new fields...

So, here's my suggestion:

Keep region, as_name_dist, as_name_src, ip_version indefinitely in Druid. And drop parsed_comms after the retention period (it would still be in Hive indefinitely).
Reduce the retention period of unsanitized Netflow data in Druid to 60 (or 45?) days. Hive's retention of unsanitized data would be still of 90 days.

What do you think @ayounsi? CC: @JAllemandou

And still more thoughts :)

After playing a bit more with the datasource, I think the data is very granular, it is likely that even a field like parsed_comms does not add a lot of de-aggregation.
So maybe a path to go is to keep all fields discussed in T231339#6612105, but apply the reduction of the Druid retention period (90 -> 60 or 45) to reduce the impact of the data size growth.

I think we can remove:
Everything that we remove from T231339#6612105 (which mean net_cidr_src, net_cidr_dst as well)
Plus as_name_dist, as_name_src, as_path
Then only keep the parsed_comms that start with our ASN (14907), so it's only a few values (see https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations#BGP_communities)
Then keep druid at 90 days, but I'm open to reducing it to 60 if really needed.

OK, cool.
Knowing that you'd be open to reducing the retention period for Druid storage if necessary, what I'll do is:

move on with Druid ingestion of new fields.
Measure how much the size of the datasource grows.
Implement selective sanitization and/or retention period reduction depending on results.

We as a team would like to meet with you guys and discuss other alternatives for querying and visualizing this data.
@JAllemandou has prepared a presentation that might be interesting to you guys. Maybe we can meed in the upcoming weeks?

Change 643342 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Add new fields to wmf_netflow Druid datasource

https://gerrit.wikimedia.org/r/643342

Change 643342 merged by Elukey:
[operations/puppet@production] analytics::refinery::job::druid_load.pp: Add fields to wmf_netflow

https://gerrit.wikimedia.org/r/643342

Change 643531 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] turnilo::templates::config.yaml.erb: add new fields to netflow config

https://gerrit.wikimedia.org/r/643531

Change 643531 merged by Elukey:
[operations/puppet@production] turnilo::templates::config.yaml.erb: add new fields to netflow config

https://gerrit.wikimedia.org/r/643531

@ayounsi
New fields are in Druid (starting 2020-11-25T03:00:00) :] I've checked that all looks OK, but please do check as well.
Note that the last couple hours do not yet have the new fields, because they come from the streaming job.
In a couple days we can evaluate the size increment and adjust.

I guess this can be removed now, although it could stay if you wish!
https://github.com/wikimedia/puppet/blob/production/modules/turnilo/templates/config.yaml.erb#L2214-L2255

Thanks, it looks great!

In T254332#6649704, @mforns wrote:

Note that the last couple hours do not yet have the new fields, because they come from the streaming job.

Is it going to be permanently like that? Or the backlog with keep up at some point?

@mforns I had a chat with Arzhel, the el-to-druid job is configured like this:

--since $(date --date '-6hours' -u +'%Y-%m-%dT%H:00:00') --until $(date --date '-5hours' -u +'%Y-%m-%dT%H:00:00') "${@}"

Is there are reason to lag all these hours, or can we reduce the gap?

jbond subscribed.Nov 26 2020, 10:01 AM

Change 643703 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] turnilo: add export mappings for network devices via query_resources

https://gerrit.wikimedia.org/r/643703

@ayounsi @elukey
The reason of the gap is that the streaming job that is ingesting the data into Druid does not have those fields yet.
The batch job (the one we just modified) is adding the new fields, but only after data has been collected from Kafka into HDFS, and then processed by Refine.
This whole process takes about 4 hours. That's why we have the 5 hour lag in batch Druid ingestion. Reducing the gap is a bit risky, because we might accidentally ingest incomplete data into Druid, and not notice it...

I believe the better solution would be to add the new fields to the streaming job as well. That would eliminate the gap completely.
Let's finish this addition to the batch job, and then go for the streaming job, no?

mforns moved this task from Ready to Deploy to In Code Review on the Analytics-Kanban board.Nov 30 2020, 4:05 PM

@ayounsi
Regarding netflow data size in Druid:

We Analytics took a look at the data size after adding the new fields, and it has grown roughly 50%.
This means that for the 90-day period where we store unsanitized data we'll need 1.8TB * 150% = 2.7TB Druid storage.
This would actually be feasible in terms of space, Druid is not at it's capacity limit yet.
But, we're worried that such amounts of data will generate query timeouts.

So, ideally, we'd like to reduce the retention period for unsanitized data to 60 days.
This way we'd go back to storing only about 1.8TB, which we know it's not a problem.
Now, if you critically rely upon this 3rd month of unsanitized data in Druid, then we could keep it as is, it would be fine for now.
And if we see timeouts or have storage problems, we can revisit this later.

So, please let us know if you're OK with reducing to 60 or you'd rather keep the 90.
Keep in mind that unsanitized data will always be available for 90 days in Hive, available for querying and graphing (with jupyter notebooks or superset).

And as I mentioned before, we Analytics are available to explain all the nits of the data pipeline, helping with any problem/question of using our data tools.
In fact, @JAllemandou put together an awesome presentation that we think you (@ayounsi, @faidon, @CDanis) would be interested in!
Would you guys be willing to arrange a meeting to go over it with some of us?

Cheers!

So, please let us know if you're OK with reducing to 60 or you'd rather keep the 90.

OK!

Would you guys be willing to arrange a meeting to go over it with some of us?

Of course!

Cool! Thanks :] Will do.
I'll let @JAllemandou coordinate with you on a good date and time for the team presentation!

Change 644569 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job::druid_load.pp: reduce netflow retention

https://gerrit.wikimedia.org/r/644569

Change 644569 merged by Elukey:
[operations/puppet@production] analytics::refinery::job::druid_load.pp: reduce netflow retention

https://gerrit.wikimedia.org/r/644569

Change 644862 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job::druid_load.pp: add fields to netflow long term

https://gerrit.wikimedia.org/r/644862

Change 643703 merged by Jbond:
[operations/puppet@production] turnilo: add export mappings for network devices via query_resources

https://gerrit.wikimedia.org/r/643703

Change 644862 merged by CDanis:
[operations/puppet@production] analytics::refinery::job::druid_load.pp: add fields to netflow long term

https://gerrit.wikimedia.org/r/644862

mforns moved this task from In Code Review to In Progress on the Analytics-Kanban board.Jan 5 2021, 3:32 PM

Moving this task to DONE.

mforns moved this task from In Progress to Done on the Analytics-Kanban board.Jan 18 2021, 3:14 PM

• fdans closed this task as Resolved.Jan 25 2021, 7:01 PM

Add more dimensions in the netflow/pmacct/Druid pipelineClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add more dimensions in the netflow/pmacct/Druid pipeline
Closed, ResolvedPublic
Actions

Related Objects
Search...