In the parent task we were wondering how to set up automatic sanitization for netflow data, and we realized that we shouldn't really keep using an ad-hoc configuration when we could use something like Eventgate Analytics.
The current dataflow is the following:
routers -> pmacct -> kafka -> hdfs (via gobblin) -> Hive (via Refine job) -> Druid (via indexation job)
When data is on Druid, then is can be queried by Superset and Turnilo.
We should move to something like:
routers -> pmacct -> eventgate-analytics (via HTTP POST) -> kafka -> hdfs (via camus) -> Hive (via Refine job) -> Druid (via indexation job)
The bit that changes is the ingestion from pmacct to kafka, that should go under Eventgate. This will allow us to re-use the same automation jobs that we'll use for all other analytics events, rather than keeping special configs for netflow.
If this is too hard, at the very least we could create a schema and declare a stream and then use all the same ingestion automation we use for other streams.
Data in fact needs to be sanitized (to comply with our retention guidelines) in multiple places:
- raw data (un-refined, from kafka to hdfs)
- refined data (data present in Hive etc..)
- indexed data on Druid
Few notes about Eventgate:
- the event needs to have a fixed json schema committed to the Event schemas repository
- the event will be POSTed to Eventgate via a well formed JSON event, that will be validated before being ingested in Kafka.
I can definitely help with the Analytics part, it should be relatively simple, but from a quick chat with Arzhel it seems that our dear pmacct doesn't support json over HTTP, so we should figure out what to do.