This task is meant to be a generic parent to track all the work done by Infra Foundations / Analytics to ingest/process/publish Netflow data.
The overall data flow is the following:
- The data is sent from the routers to pmacct in each datacenter, and then forwarded to a kafka topic in Kafka-Jumbo (via TLS).
- The netflow topic is periodically pulled onto HDFS by Camus. We call this "raw" data.
- The data is then "refined" (where we can apply changes / etc.) to a new dataset that is also exposed via Hive.
- The refined data is indexed periodically to Druid, where it can be queried from Turnilo/Superset/etc..