Maniphest T257554

Netflow data pipeline
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Jul 9 2020, 10:12 AM

Tags

Referenced Files

None

Subscribers

Description

This task is meant to be a generic parent to track all the work done by Infra Foundations / Analytics to ingest/process/publish Netflow data.

The overall data flow is the following:

The data is sent from the routers to pmacct in each datacenter, and then forwarded to a kafka topic in Kafka-Jumbo (via TLS).
The netflow topic is periodically pulled onto HDFS by Camus. We call this "raw" data.
The data is then "refined" (where we can apply changes / etc.) to a new dataset that is also exposed via Hive.
The refined data is indexed periodically to Druid, where it can be queried from Turnilo/Superset/etc..

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		odimitrijevic	T257554 Netflow data pipeline
		Resolved		mforns	T254332 Add more dimensions in the netflow/pmacct/Druid pipeline
		Resolved		elukey	T263290 Turnilo: per-second rates for wmf_netflow bytes + packets
		Resolved		mforns	T231339 Set up automatic deletion/snitization for netflow data set in Hive
		Duplicate		None	T245287 Setup refinment/sanitization on netflow data similar to how it happens for other event-based data
		Open		None	T248865 Automate ingestion of netflow event stream
		Resolved		elukey	T248980 Move netflow to TLS encryption/authentication via librdkafka

Event Timeline

elukey triaged this task as Medium priority.Jul 9 2020, 10:12 AM

elukey created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 9 2020, 10:12 AM

elukey added subtasks: T248865: Automate ingestion of netflow event stream, T254332: Add more dimensions in the netflow/pmacct/Druid pipeline, T245287: Setup refinment/sanitization on netflow data similar to how it happens for other event-based data, T231339: Set up automatic deletion/snitization for netflow data set in Hive.Jul 9 2020, 10:13 AM

elukey edited projects, added Analytics-Kanban; removed Analytics-Clusters.Aug 5 2020, 9:56 AM

elukey moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.

• fdans closed subtask T254332: Add more dimensions in the netflow/pmacct/Druid pipeline as Resolved.Jan 25 2021, 7:01 PM

• fdans closed subtask T231339: Set up automatic deletion/snitization for netflow data set in Hive as Resolved.Mar 18 2021, 4:02 PM

odimitrijevic renamed this task from Neflow data pipeline to Netflow data pipeline.Oct 27 2021, 10:24 PM

odimitrijevic removed a project: Analytics-Kanban.

odimitrijevic edited subscribers, added: odimitrijevic; removed: • Nuria.

Aklapper added projects: Infrastructure-Foundations, Analytics.Oct 29 2021, 8:04 PM

odimitrijevic added a project: Data-Engineering.Nov 18 2021, 6:09 AM

odimitrijevic removed a project: Analytics.Jan 12 2022, 12:26 AM

odimitrijevic closed this task as Resolved.Jan 17 2022, 5:28 PM

odimitrijevic claimed this task.