Page MenuHomePhabricator

Neflow data pipeline
Open, MediumPublic

Description

This task is meant to be a generic parent to track all the work done by Infra Foundations / Analytics to ingest/process/publish Netflow data.

The overall data flow is the following:

  • The data is sent from the routers to pmacct in each datacenter, and then forwarded to a kafka topic in Kafka-Jumbo (via TLS).
  • The netflow topic is periodically pulled onto HDFS by Camus. We call this "raw" data.
  • The data is then "refined" (where we can apply changes / etc.) to a new dataset that is also exposed via Hive.
  • The refined data is indexed periodically to Druid, where it can be queried from Turnilo/Superset/etc..