Page MenuHomePhabricator

Move netflow data to Eventgate Analytics
Open, MediumPublic

Description

In the parent task we were wondering how to set up automatic sanitization for netflow data, and we realized that we shouldn't really keep using an ad-hoc configuration when we could use something like Eventgate Analytics.

The current dataflow is the following:

routers -> pmacct -> kafka -> hdfs (via camus) -> Hive (via Refine job) -> Druid (via indexation job)

When data is on Druid, then is can be queried by Superset and Turnilo.

We should move to something like:

routers -> pmacct -> eventgate-analytics (via HTTP POST) -> kafka -> hdfs (via camus) -> Hive (via Refine job) -> Druid (via indexation job)

The bit that changes is the ingestion from pmacct to kafka, that should go under Eventgate. This will allow us to re-use the same automation jobs that we'll use for all other analytics events, rather than keeping special configs for netflow.

Data in fact needs to be sanitized (to comply with our retention guidelines) in multiple places:

  • raw data (un-refined, from kafka to hdfs)
  • refined data (data present in Hive etc..)
  • indexed data on Druid

Few notes about Eventgate:

  • the event needs to have a fixed json schema committed to the Event schemas repository
  • the event will be POSTed to Eventgate via a well formed JSON event, that will be validated before being ingested in Kafka.

I can definitely help with the Analytics part, it should be relatively simple, but from a quick chat with Arzhel it seems that our dear pmacct doesn't support json over HTTP, so we should figure out what to do.

Event Timeline

So the options are:

  • add a plugin to pmacct (eg. https://github.com/pierky/pmacct-to-elasticsearch)
  • replace pmacct with something that can do HTTP POST
  • insert something between pmacct and eventgate-analytics that can do HTTP POST (eg. using pmacct json print to stdout)
  • patch eventgate-analytics to be able to receive one of the existing pmacct export plugins

Did I forget something?

For information https://github.com/pmacct/pmacct/blob/master/QUICKSTART#L71 are all the default plugins.

Hm, if you just want to get the Refine side working, all you really need is for the data in Kafka to look right. EventGate gets you schema validation (and maybe a couple of other things, but not much else). If you are the only producer of this data, producing directly to Kafka is totally fine too.
If it is easier to POST from pmacct than use a Kafka Producer, then by all means use eventgate-analytics. But, for the rest of the pipeline, all you need is a proper schema and the data in Kafka.

Hm, if you just want to get the Refine side working, all you really need is for the data in Kafka to look right. EventGate gets you schema validation (and maybe a couple of other things, but not much else). If you are the only producer of this data, producing directly to Kafka is totally fine too.
If it is easier to POST from pmacct than use a Kafka Producer, then by all means use eventgate-analytics. But, for the rest of the pipeline, all you need is a proper schema and the data in Kafka.

Andrew can you expand a bit how things should look like in kafka to be processed by the same pipeline (refine, I guess event database, druid, sanitization etc..) ? It would be really nice to avoid the HTTP POST part indeed.

Indeed I need some better docs on this, and will be writing them as part of the EventLogging migration. Devs will need to be able to figure this out easily. But, some starters:

https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas
https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines

Basically:

  • Add a schema to the schemas/event/secondary repository using those instructions
  • Produce your data to Kafka conforming to your schema
  • Configure Camus to import the topic(s) and Refine to refine the data to Hive's event db

Likely you should also prefix the topics with their source datacenter so we can use the same Refine job configuration we do for e.g. data from EventBus. EventGate does this part for you, but if you are producing with Kafka directly you will have to set the topic name explicitly.

Got it, so basically do all the process of creating the schemas etc.. but push data from pmacct directly to the kafka topic bypassing Eventgate. The fact that we'll have to have something ad-hoc is not appealing to me, the goal of this task was/is to avoid any special settings for netflow, but it is probably the best compromise (and then later on it should be easy to move to Eventgate validation / kafka topic creation if we see any issue).

Next steps:

  1. I'll check the docs that Andrew pointed out and I'll try to come up with some schemas for netflow (I needed to read that doc anyway so a good excuse :)
  2. Once it is reviwed/merged, we'll come up with kafka topics names and we'll configure pmacct accordingly to produce to the new topics.
  3. We'll need to move current hive data to the new location in the event database, and remove old Druid config (the new one should be automagic in theory, or something similar)

Likely you should also prefix the topics with their source datacenter so we can use the same Refine job configuration we do for e.g. data from EventBus. EventGate does this part for you, but if you are producing with Kafka directly you will have to set the topic name explicitly.

This part is interesting, since we currently have a netflow vm/host on every data center that pushes to kafka-jumbo (single topic). What should we do in this case? Have a separate netflow kafka topic for every datacenter, and push data to Jumbo? Or should we use Kafka/Evengate main and aggregate dcs for eqiad/codfw?

We'll need to move current hive data to the new location in the event database

You might end up just producing to a new topic name anyway. The topic name will eventually map to a Hive table name.

Druid config (the new one should be automagic in theory

This could get tricky. IIRC the automated Druid ingestion jobs only work with analytics EventLogging (i.e. EventCapsule) style data right now. @mforns maybe can say more?

This part is interesting, since we currently have a netflow vm/host on every data center that pushes to kafka-jumbo (single topic). What should we do in this case?

  • Have a separate netflow kafka topic for every datacenter, and push data to Jumbo?

I think is correct. This is actually what eventgate-analytics does. Even thought it pushes to Kafka jumbo only in eqiad, its topics are prefixed with the source DC. Events from codfw still use the codfw topic.

(You actually have just stumbled across an ambiguity with our multi DC kafka topic setup. We topic created prefixes to handle multi DC Kafka MirrorMaker, but actually create the prefixes based on the location of the producer (...except for logging cluster Kafka?) So far this is ok, but it will probably bite us in the future. For now we should just be consistent. We'll revisit this if/when we decide to upgrade to MirrorMaker 2.0, which handles multi DC replication more transparently.)

@Ottomata first n00b question - I am trying to think about where to add the netflow schema to the secondary repository, and I have some doubts about the dir structure. Should it be something like jsonschema -> network -> netflow ? Or jsonschema -> sre -> netflow? Or something else :)

Yeah we don't have a great convention for namepacing. For analytics/instrumentation schemas, we decided to keep things simple and keep the hierarchy mostly flat, e.g. analytics/session_tick, analytics/button_click, etc.

If possible, you should try to name and model the schema something like entity+verb. What is a netflow? What does the event represent? E.g. mediawiki/revision/create, resource-change, etc. (Again, we don't have a repo wide hierarchy convention, but we have been trying to at least conceptually represent entity+verb).

Also, more docs have been written for instrumentation devs, which is kinda like what you are doing:
https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To

@CDanis @ayounsi any suggestion? netflow/flow/something ?

The context is that we are trying to move netflow ingestion to a more "standardized" eventgate structure, and we are trying to come up with a schema for the data. The schema name should be something like entity/verb :)

Ya if possible, the schema should be named and modeled after what the event represents. In this case it sounds like it is something like a 'network connection summary event'.

netflow/flow/record or netflow/flow/observe could be ok?

I'd guess I'd still ask what is a "netflow"? Or a "netflow/flow"? I guess if you could defined 'netflow' as a noun in the description of the schema, that would work! network/netflow/observation

BTW, perhaps noun+verb isn't quite right, I think we mean noun+action. The composite name will be a noun, like 'mediawiki/revision/create' event, 'resource-change' event, 'netflow/observation' event, etc.

Keep in mind that the schema name does not have to match the stream name. The schema is like a data type, whereas the stream is an instance of that datatype.

observe could be ok too, sorry didn't mean to make observation sound better or worse. netflow/observe event sounds a little weird but does seem consistent with e.g. revision/create. Fine with either!

ALSO these are just ideas and thoughts! Schemas in secondary repo SHOULD require less bikeshedding than those in primary :)

Netflow in theory is the name of the technology/protocol (see https://tools.ietf.org/html/rfc3954), and IIUC it defines a "flow" as the bytes/packets exchanged between two parties (in this case, a client and us). What pmacct does is to aggregate multiple flows into one, sum all the values and ship the details to kafka. So instead of network I still use netflow, since it should more precisely indicate what we are talking about, but not sure :(

Ah, makes more sense! Great. If pmacct is aggregating, perhaps summary is good in the name?

analytics/network/netflow/flowset?
analytics/netflow/flowset?

Flowset is how an "event" is called on the protocol linked by @elukey above

I don't think this needs to go in 'analytics', but flowset sounds nice if it is accurate.

Change 608077 had a related patch set uploaded (by Elukey; owner: Elukey):
[schemas/event/secondary@master] WIP - Add netflow/flowset schema

https://gerrit.wikimedia.org/r/608077

Some high level issues that came up while talking about netflow on eventgate:

  • pmacct (the daemon that collects netflow data from routers) seems not very flexible. It doesn't allow to set custom fields in the json message to kafka (for example, it might be difficult to add the common fragment discussed in https://gerrit.wikimedia.org/r/608077) and also it doesn't support emitting HTTP POST to an endpoint (like eventgate).
  • Arzhel asked if possible to augment netflow data using GeoIP, so having a more custom refine, but this seems currently not possible yet (since there will be a single generic refine for all eventgate streams IIUC).

We are not in a hurry in moving netflow to eventgate, maybe we can use this example to figure out what needs to be custom and what can be moved over to eventgate. @Ottomata what do you think about this use case? How should we proceed?

doesn't support emitting HTTP POST to an endpoint (like eventgate).

Well if it doesn't support HTTP POST then you won't be moving it to EventGate :p

As for Refine, unless there is some way that the event can encode its $schema URI and its stream (AKA dataset) name, we can't use the general purpose event Refine job to ingest it into Hive. And if we aren't using the general purpose event Refine job, you might as well make a custom one. If you have a custom one, you could probably even get away with not even using an official JSONSchema, kind of like we do with webrequest. I'd like for webrequest to one day have a canonical JSONSchema too though, so having the schema doesn't hurt, and you should be still able to use a custom Refine job with a custom EventSchemaLoader class that always loads your netflow schema.

For this quarter I'd propose to stop the work on moving netflow to eventgate/mep, keeping the current 'ad-hoc' configuration, and then re-evaluate possibly in Q2? What do you think?

Change 608077 abandoned by Elukey:
[schemas/event/secondary@master] WIP - Add netflow/flowset schema

Reason:

https://gerrit.wikimedia.org/r/608077