Page MenuHomePhabricator

Set up automatic deletion/snitization for netflow data set in Hive
Closed, ResolvedPublic

Description

After the latest additions to it, the netflow data set contains a pair of privacy-sensitive fields.
This means we should either delete or sanitize this data set after 90 days after collection.

This task is to implement such deletion/sanitization for the netflow data in Hive/HDFS.
This task won't affect the netflow data in Druid.

Possible solutions:

  1. Delete the data completely after 90 days. This would be done by setting up a timer in puppet::data_purge.pp. Quite easy, but big downside: we'd lose the data very early.
  2. Sanitize the data using a process similar to EventLoggingSanitization. This would nullify the privacy-sensitive fields after 90 days and keep the rest intact indefinitely. This approach carries a bit of work, but if we implement it generically, it would be useful for other data sets in the same database that need sanitization. Will comment on this task with ideas of a solution.

Event Timeline

Hi @ayounsi!

Please, review this task and see if it makes sense.
I assume from what Luca told me, unless you tell me the opposite, that keeping the data in Hive/HDFS for only 3 months is not enough.

Thanks!

So, the idea for solution #2 (see task description) is the following:

Use WhitelistSanitization to sanitize the wmf database.
And have a white-list that, for now, includes the netflow table only.
We could create a Spark/Scala job analog to EventLoggingSanitization (wrapper of WhitelistSanitization).
The only caveat I see, is: We'd need another database where to copy the sanitized data to. Like wmf_sanitized or a better name.

I really would like to avoid setting up custom sanitization, can we rather move netflow data to events database and use the event sanitization process?

Similar to T229674.

It would be great to have an aggregated/sanitized set of the data for as long as possible.

In addition to ip_src and ip_dst it's fine to remove port_src port_dst tcp_flags ip_proto. Leaving AS information, direction, and future country information. Hopefully that helps keeping the DB size reasonable.
Similarly, aggregating the data with a 5min or more granularity is fine.

@ayounsi Before we can sanitize this data it needs to be abide to a schema (a schema is a description in json of your data like https://meta.wikimedia.org/wiki/Schema:NavigationTiming) and the intake of events need to be moved to a different workflow, to be clear this means that if the inflow of data changes the schema needs to change. In the future you will be able to annotate the schema with fields to be sanitized/not. We are not quite there yet so at this time sanitization is a whitelist over your schema, see the whitelist for NavigationTiming: https://github.com/wikimedia/analytics-refinery/blob/master/static_data/eventlogging/whitelist.yaml#L134

@Nuria, who would be responsible for migrating netflow data set to be ingested into the event pipeline?

@ayounsi, I understand that once the netflow ingestion is migrated to the event pipeline (this includes defining the schema, as Nuria mentions),
then the sanitization will be quite a straightforward task, we'll only need to add a couple lines of config to the mentioned white-list.

@mforns: We can do it but let's tackle that once we have done the dataset releases we have as high priority for this quarter. Does that sound good?
@ayounsi on your end this means that any modifications to the dataset need to be reflected on the schema, so once we create it your task would be to modify it when data gets modified. Let us know if this makes sense. You can see examples of what schemas looks like here as well: https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/resource_change

@Nuria

We can do it but let's tackle that once we have done the dataset releases we have as high priority for this quarter. Does that sound good?

Sure! Makes sense.

Makes sens as well! Out of curiosity, what would happen if the schema is incorrect? (forgot to update, typoed, etc.)

If schema and data do not match, unhappy customer cause there is no data.

Nuria triaged this task as Medium priority.Aug 29 2019, 5:07 PM
Nuria moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
elukey raised the priority of this task from Medium to High.

Netflow data is not sanitized or dropped, we forgot to follow up on this task.

I had a chat with SRE and ideally they'd like to keep historical/sanitized data. By now a lot of router in our infra are sending sampled traffic to kafka, and we have reached a sizeable amount of daily data:

247.5 G  742.4 G  /wmf/data/wmf/netflow/year=2020

6.1 G  18.2 G  /wmf/data/wmf/netflow/year=2020/month=1/day=28

This means ~20G per day, ~500G per month. Seems nothing to really worry about, but in the future SRE might want to change sampling and send more data (but we'll be asked beforehand in case of course, I mentioned as FYI).

What is the ideal next step for sanitization then?

Nuria renamed this task from Set up automatic deletion for netflow data set in Hive to Set up automatic deletion/snatization for netflow data set in Hive.Feb 14 2020, 5:25 PM

Per our conversation in standup we will deleting the raw data and planning on switching this pippeline to the eventgate/refine workflow: T245287: Setup refinment/sanitization on netflow data similar to how it happens for other event-based data

@elukey FYI that the filepath you listed above is now "raw" data but rather already refined data. The wm.netflow table is on top of: hdfs://analytics-hadoop/wmf/data/wmf/netflow

We normally call "raw data" data that is not been refined. In this case this data is at: /wmf/data/raw/netflow/netflow

nuria@stat1007:~$ hdfs dfs -du -h /wmf/data/raw/netflow/netflow/hourly
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
6.9 G 20.7 G /wmf/data/raw/netflow/netflow/hourly/2018
309.1 G 927.2 G /wmf/data/raw/netflow/netflow/hourly/2019
536.2 G 1.6 T /wmf/data/raw/netflow/netflow/hourly/2020

That raw data is normally deleted after 30/60 days, it only needs to be available to ensure we can rerun jobs should there be an issue. So deleting this should free about 1TB, I think.

T248865 seems to be doable long term, but for the moment I'd proceed with dropping old data periodically keeping netflow as "custom" for the time being.

Change 628895 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] [WIP] Drop /wmf/data/raw/mediawiki_job and /wmf/data/raw/netflow after 90 days

https://gerrit.wikimedia.org/r/628895

Nuria renamed this task from Set up automatic deletion/snatization for netflow data set in Hive to Set up automatic deletion/snitization for netflow data set in Hive.Sep 22 2020, 2:23 PM

Change 628895 merged by Elukey:
[operations/puppet@production] Drop /wmf/data/raw/mediawiki_job and /wmf/data/raw/netflow after 90 days

https://gerrit.wikimedia.org/r/628895

Raw data for mediawiki_job and netflow *older than 90 days* has been deleted with the script,
and periodical deletion jobs have been deployed.

A comment on sanitization:
I was looking at T254332 and one option is to move the netflow data and table to the event database.
This way, we could use the already present sanitization scheme (event -> event_sanitized), just by adding a couple lines to the sanitization include-list.

After discussing it with @Ottomata, it seems that moving netflow to the event database is feasible.
This means we can apply a per-field sanitization to netflow.
@ayounsi, can you list the fields that you would be interested in keeping indefinitely?
I will also review the data set for privacy on my side.
Thanks!

T231339#5442835 the ones we can drop.

The logic is that everything we only use for DDoS detection we can get rid of quite quickly, but what we need for traffic engineering we need to keep for as long as possible, but we can loosen its granularity as we will only use it for historical trends.

T231339#5442835 the ones we can drop.

The logic is that everything we only use for DDoS detection we can get rid of quite quickly, but what we need for traffic engineering we need to keep for as long as possible, but we can loosen its granularity as we will only use it for historical trends.

Thanks @ayounsi. Considering that, I looked at Netflow's schema in Hive and here's what I think we should do, and also some questions:


I think that the following fields are privacy-sensitive, and that they should not be kept.

ip_dst              	string
ip_src              	string              	                    
port_dst            	bigint              	                    
port_src            	bigint              	                    
net_dst             	string              	                    
net_src             	string              	                    
mask_dst            	bigint              	                    
mask_src            	bigint              	                    
net_cidr_src            string
net_cidr_dst            string

Please, can you confirm that you are OK not keeping those after 90 days?


I'm not sure whether the following fields are privacy-sensitive or not.

comms               	string
parsed_comms        	array<string>

Please, LMK if you'd like to keep them (probably just 1 of them - cause they are the same), and if so, if you think they're privacy-sensitive.


These ones seem unused, so I would prefer not to keep them.

export_proto_sysid  	bigint          (all NULLs)
sampling_direction  	string          (all NULLs)
tag                 	bigint          (all NULLs)
event_type          	string          (always=purge)
peer_as_src         	bigint          (always=0)

Please, confirm whether it's OK to drop these after 90 days or not?


The following fields seem redundant; they are alternate formats of other fields. They seem non-sensitive, so we can keep them if we want! But considering Netflow has a high volume of data, by dropping them we could reduce the overall size of the data set, and be able to keep the data that we decide for longer.

region                  string          derived from peer_ip_src
as_name_src             string          derived from as_src
as_name_dst             string          derived from as_dst

Please, can you confirm whether you want to keep those?


These ones are fine to keep I think, but not sure if you're interested in keeping them long term. Again, if we can save space, we'll be able to keep data for longer.

stamp_updated       	string          we keep stamp_inserted so not sure this needs to be kept?
ip_version              string
writer_id           	string
tcp_flags           	string
peer_as_dst         	bigint
ip_proto            	string
peer_ip_src         	string

Can you confirm what you'd like to do with these?


Finally, these are the fields I think should be kept for sure.

stamp_inserted      	string
tag2                	bigint        (used for direction in Turnilo)
country_ip_dst      	string
country_ip_src      	string
as_src              	bigint
as_dst              	bigint
as_path             	string
bytes               	bigint        (measure)
packets             	bigint        (measure)
year                	bigint        (partition)
month               	bigint        (partition)
day                 	bigint        (partition)
hour                	bigint        (partition)

Please, LMK if this looks good.


Thanks a lot @ayounsi! Sorry for the huge comment.

I think that the following fields are privacy-sensitive, and that they should not be kept.

ip_dst              	string
ip_src              	string              	                    
port_dst            	bigint              	                    
port_src            	bigint              	                    
net_dst             	string              	                    
net_src             	string              	                    
mask_dst            	bigint              	                    
mask_src            	bigint              	                    
net_cidr_src            string
net_cidr_dst            string

Please, can you confirm that you are OK not keeping those after 90 days?

Only ip_ and potentially the net_cidr_ are privacy sensitive, but I'm OK to not keep any of the listed above.


I'm not sure whether the following fields are privacy-sensitive or not.

comms               	string
parsed_comms        	array<string>

Please, LMK if you'd like to keep them (probably just 1 of them - cause they are the same), and if so, if you think they're privacy-sensitive.

They're not privacy-sensitive, and I'd like to keep parsed_comms.


These ones seem unused, so I would prefer not to keep them.

export_proto_sysid  	bigint          (all NULLs)
sampling_direction  	string          (all NULLs)
tag                 	bigint          (all NULLs)
event_type          	string          (always=purge)
peer_as_src         	bigint          (always=0)

Please, confirm whether it's OK to drop these after 90 days or not?

OK


The following fields seem redundant; they are alternate formats of other fields. They seem non-sensitive, so we can keep them if we want! But considering Netflow has a high volume of data, by dropping them we could reduce the overall size of the data set, and be able to keep the data that we decide for longer.

region                  string          derived from peer_ip_src
as_name_src             string          derived from as_src
as_name_dst             string          derived from as_dst

Please, can you confirm whether you want to keep those?

Keep please, it will make it easier to dig through the data.


These ones are fine to keep I think, but not sure if you're interested in keeping them long term. Again, if we can save space, we'll be able to keep data for longer.

stamp_updated       	string          we keep stamp_inserted so not sure this needs to be kept?
ip_version              string
writer_id           	string
tcp_flags           	string
peer_as_dst         	bigint
ip_proto            	string
peer_ip_src         	string

Can you confirm what you'd like to do with these?

Keep peer_as_dst, ip_version and ip_proto.


Finally, these are the fields I think should be kept for sure.

stamp_inserted      	string
tag2                	bigint        (used for direction in Turnilo)
country_ip_dst      	string
country_ip_src      	string
as_src              	bigint
as_dst              	bigint
as_path             	string
bytes               	bigint        (measure)
packets             	bigint        (measure)
year                	bigint        (partition)
month               	bigint        (partition)
day                 	bigint        (partition)
hour                	bigint        (partition)

Please, LMK if this looks good.

LGTM!


Thanks for clarifying it.

Change 643748 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] Add datasource argument to HiveToDruid

https://gerrit.wikimedia.org/r/643748

Change 643765 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] [WIP] analytics::refinery::job: Move netflow to event database

https://gerrit.wikimedia.org/r/643765

@ayounsi

As discussed in earlier comments, we Analytics will migrate netflow data in HDFS from /wmf/data/wmf/netflow to /wmf/data/event/netflow, and migrate the Hive table from wmf.netflow to event.netflow. This will allow us to apply our current sanitization scheme by just adding netflow to the sanitization include-list with the fields discussed in T231339#6612105.

NOTE: If the migration goes well, it should not take longer than 30 minutes, as we're not really moving data, just altering meta-data.

What will be off during the migration?

  • The netflow Hive table will be moving.
  • The netflow HDFS data will be moving.
  • Refinement of netflow raw data might be delayed.
  • Druid batch ingestion might be delayed.

What will be normal during the migration?

  • Netflow data in Druid / Turnilo will be available and up-to-date.
  • Netflow direct ingestion from Kafka to Druid will be working.

What will change permanently

  • Netflow HDFS data will be at /wmf/data/event/netflow.
  • Netflow Hive table will be at event.netflow.

So, please, let us know if you guys have any periodic jobs that consume either of those 2 sources (HDFS data or Hive table), so we can arrange a good time to migrate.
Otherwise, it should be pretty transparent to you, and we'll migrate at some point during next week, pinging you when we do it.

Cheers!

@JAllemandou @elukey

Here's a sketch of the migration plan, mostly a reference for myself! Please raise flags if something is missing.


  1. Make sure https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/643748 has been deployed under v0.0.140.
  1. Ping stakeholders and !log.
  1. Ask an Analytics SRE to stop puppet agent in an-launcher1002.eqiad.wmnet.
  1. Stop netflow systemd timers in an-launcher1002.eqiad.wmnet.
systemctl stop eventlogging_to_druid_netflow_hourly.timer
systemctl stop eventlogging_to_druid_netflow_daily.timer
systemctl stop eventlogging_to_druid_netflow-sanitization_daily.timer
systemctl stop refine_netflow.timer
systemctl stop monitor_refine_netflow_failure_flags.timer
  1. Drop hive table.
sudo -u analytics kerberos-run-command analytics hive
drop table wmf.netflow;
  1. Move data.
sudo -u analytics kerberos-run-command analytics hdfs dfs -mv /wmf/data/wmf/netflow /wmf/data/event/
  1. Recreate table.
sudo -u analytics kerberos-run-command analytics hive
CREATE EXTERNAL TABLE `event.netflow`(
  `as_dst` bigint,
  `as_path` string,
  `bytes` bigint,
  `packets` bigint,
  `peer_as_dst` bigint,
  `stamp_inserted` string,
  `stamp_updated` string,
  `tag` bigint,
  `event_type` string,
  `writer_id` string,
  `as_src` bigint,
  `ip_dst` string,
  `ip_proto` string,
  `ip_src` string,
  `peer_as_src` bigint,
  `port_dst` bigint,
  `port_src` bigint,
  `sampling_direction` string,
  `tag2` bigint,
  `tcp_flags` string,
  `country_ip_dst` string,
  `country_ip_src` string,
  `peer_ip_src` string,
  `export_proto_sysid` bigint,
  `comms` string,
  `parsed_comms` array<string>,
  `net_dst` string,
  `net_src` string,
  `mask_dst` bigint,
  `mask_src` bigint,
  `as_name_dst` string,
  `as_name_src` string,
  `ip_version` string,
  `net_cidr_dst` string,
  `net_cidr_src` string,
  `region` string)
PARTITIONED BY (
  `year` bigint,
  `month` bigint,
  `day` bigint,
  `hour` bigint)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://analytics-hadoop/wmf/data/event/netflow'
;
  1. Repair table partitions.
msck repair table event.netflow;
  1. Ask an Analytics SRE to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/643765, to update Refine and HiveToDruid jobs.
  1. Ask an Analytics SRE to enable puppet agent and force a puppet run (this will restart timers).
  1. Check that both Refine and HiveToDruid are running fine.
  1. Ping stakeholders and !log.
NOTE: It could be, if the migration took long enough, that Druid loading happens before Refine can populate the delayed data. Thus ingesting incomplete data into Druid. If this happens, wait for Refine to catch up and then do the next step.

Backfill Druid with correct data (adapt since and until params).

sudo -u analytics kerberos-run-command analytics /usr/bin/spark2-submit \
--name backfill_netflow_to_druid_hourly \
--class org.wikimedia.analytics.refinery.job.HiveToDruid \
--master yarn \
--queue production \
--deploy-mode client \
--driver-memory 2G \
--conf spark.dynamicAllocation.maxExecutors=32 \
--conf spark.driver.extraClassPath=/usr/lib/hive/lib/hive-jdbc.jar:/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-common.jar:/usr/lib/hive/lib/hive-service.jar \
--files /etc/hive/conf/hive-site.xml,/etc/refinery/eventlogging_to_druid/eventlogging_to_druid_netflow_hourly.properties \
/srv/deployment/analytics/refinery/artifacts/org/wikimedia/analytics/refinery/refinery-job-0.0.140.jar \
--config_file /etc/refinery/eventlogging_to_druid/eventlogging_to_druid_netflow_hourly.properties \
--since $(date --date '-6hours' -u +'%Y-%m-%dT%H:00:00') \
--until $(date --date '-5hours' -u +'%Y-%m-%dT%H:00:00')

So, please, let us know if you guys have any periodic jobs that consume either of those 2 sources

Nop.

Thanks for the very detailed comment though!

Here's a sketch of the migration plan, mostly a reference for myself! Please raise flags if something is missing.

@mforns This plan looks great to me :) Thanks for detailing!

Change 643748 merged by Mforns:
[analytics/refinery/source@master] Add datasource argument to HiveToDruid

https://gerrit.wikimedia.org/r/643748

Change 643765 merged by Elukey:
[operations/puppet@production] analytics::refinery::job: Move netflow to event database

https://gerrit.wikimedia.org/r/643765

Change 644889 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job::druid_load.pp: correct druid_datasource param

https://gerrit.wikimedia.org/r/644889

Change 644889 merged by Elukey:
[operations/puppet@production] analytics::refinery::job::druid_load.pp: correct druid_datasource param

https://gerrit.wikimedia.org/r/644889

Change 644893 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job: Bump up jar versions for netflow

https://gerrit.wikimedia.org/r/644893

Change 644893 merged by Elukey:
[operations/puppet@production] analytics::refinery::job: Bump up jar versions for netflow

https://gerrit.wikimedia.org/r/644893

Change 645419 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add netflow to eventlogging sanitization include-list

https://gerrit.wikimedia.org/r/645419

Change 645419 merged by Mforns:
[analytics/refinery@master] Add netflow to eventlogging sanitization include-list

https://gerrit.wikimedia.org/r/645419

Change 655120 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics:refinery:job:data_purge Activate netflow auto-purging

https://gerrit.wikimedia.org/r/655120

Change 655120 merged by Elukey:
[operations/puppet@production] analytics:refinery:job:data_purge Activate netflow auto-purging

https://gerrit.wikimedia.org/r/655120