Page MenuHomePhabricator

Also intake Network Error Logging events into the Analytics Data Lake
Open, LowPublic3 Estimated Story Points

Description

Currently we receive NEL data via the EventGate instance eventgate-logging-external. That emits events to Kafka for consumption by Logstash and visualization/analysis in Kibana.

It would also be nice to have this data available for analysis in Hive/Spark/Jupyter etc. Same standard data retention of 90 days.

How hard would it be to get these events available in both places?

Event Timeline

This would be easier if T276972: Set up cross DC topic mirroring for Kafka logging clusters was done, but it doesn't look like there's enthusiasm for it.

I'd love to be able to automate ingestion from kafka logging topics in the same way we do everywhere else. For ingestion into Hadoop, we generally only ingest from the Kafka jumbo-eqiad cluster, which has all (relevant) topics mirrored to it.

Also relevant: T291645: Integrate Event Platform and ECS logs

Assuming we aren't going to do T276972: Set up cross DC topic mirroring for Kafka logging clusters, I think our 2 options are:

  1. set up special MirrorMaker instances from Kafka logging-eqiad and logging-codfw into Kafka jumbo-eqiad, and mirror only topics that are prefixed with DC name. We'd just need to figure out where to run this MirrorMaker. Hm, actually, there is no reason this shouldn't run in k8s! It is stateless!
  1. configure a special Gobblin job (Our Kafka->Hadoop ingestor) to consume directly from the Kafka logging clusters.

Option 2. is simpler to set up, but creates a direct dependency between jobs in analytics Hadoop and 'production' (i.e. non aggregate) Kafka clusters. I'd prefer to do Option 1. Option 1. would also help us more with ingesting other topics from Kafka logging, like we'd want for T291645: Integrate Event Platform and ECS logs.

jbond triaged this task as Medium priority.Mar 22 2022, 4:15 PM

(my two cents) agreed option 1. seems preferable, and to clarify my position on T276972: I'm not against it per-se, I am questioning the "dc prefix" pattern for the kafka-logging use case. But at any rate definitely +1 to option 1

How hard is option 1?

I'm starting to think up use cases for NEL data like comparing the ratio of reports/time vs webrequests/time for a given locale or AS number.

It would be awesome to be able to do this inside of Spark / Jupyter.

(I guess something else that *could* work is if there was a way to do Elasticsearch queries from inside Jupyter? But having the data already in Hive makes a bit more sense to me.)

Not hard at all, there is plenty of puppet to support it. Just need to run it somewhere. We currently colocate MirrorMaker on target cluster brokers. We could probably get away with running these 2 new MirrorMaker instances on the Kafka Jumbo brokers.

We could make it harder and say oh! Let's use MirrorMaker 2, its better! And we could run that in k8s. That would be really nice, but is def not a requirement.

I like the look of this task, so I'm going to claim it if noone minds.
Predictably enough, I think that we should use MirrorMaker 2 and run it in k8s on the wikikube clusters :-)

Predictably enough, I think that we should use MirrorMaker 2 and run it in k8s on the wikikube clusters :-)

This would be awesome. I'd be reluctant to set up MirrorMaker 2 and not have a plan to replace MirrorMaker 1 tho. And...I *think* a single MirrorMaker 2 instance (running in via Kafka Connect) can handle replication between all the clusters via config. So it may be possible to replace all the MirrorMaker 1 instances with just one MirrorMaker 2 instance with appropriate config.

So, we could start by setting up MirrorMaker 2 for both logging-eqiad -> jumbo-eqiad and for logging-codfw -> jumbo-eqiad, with the intention of replacing the other ones eventually too.

Or um...we could just set up MirrorMaker 1 in k8s now. This would be pretty easy I think.

I like the look of this task, so I'm going to claim it if noone minds.

Please go right ahead!

I am happy with whatever implementation everyone else is happy with. I simply want to mess with this data in Jupyter+pyspark :)

@BTullis are you still interested in this? Asking because it came up again in some discussion with Traffic about better user latency mapping.

Yes I am still interested. Adding it to our planning board for discussion.

Happy quarterly planning season; I was wondering if there was any updated estimates on when this might happen?

Thanks!

Hi @Ottomata @odimitrijevic @EChetty @lmata @KOfori --

Apologies for the escalation but it has been several months since any update was posted on this task.

I don't think this task will be that much work to resolve.

But if this work doesn't land on someone's OKRs in the next quarter or two, I'm worried that the lack of it it is going to wind up blocking soon-to-come work to come on both Traffic SLO measurements, and on future (D)DoS prevention measures.

Can we get this prioritized soon?

Thanks very much for your time and attention :)

@CDanis thanks for bubbling this up. We'll discuss when we get back In January to understand what the effort entails. We may have some additional questions about your specific use case to understand how to prioritize against the many other requests.

@CDanis thanks for bubbling this up. We'll discuss when we get back In January to understand what the effort entails. We may have some additional questions about your specific use case to understand how to prioritize against the many other requests.

Hi Olja, happy almost-February! Just wanted to check in about this again :)

Hey @CDanis!
Sorry to respond super late - The team and I have been trying to figure out a path forward with this.

Unfortunately, it doesn't look like we will be ready to deploy an instance of MirrorMaker to the DSE anytime this quarter. How urgent is this? The simplest path forward is probably to deploy it to a Ganeti VM instead. @Ottomata could help guide/deploy these instances of MirrorMaker and Ideally @BTullis could help commission a couple VMs for us (but help on getting the VMs commissioned from your end would be greatly appreciated.) Or would you prefer to wait until Q4/Q1 when we may be ready to move it to the DSE?

I think the best path forward right now is:

  1. 4 new ganeti instances in eqiad (I don't think we need more than 2). kafka-mirror100[1234].
  2. New puppet roles: role::kafka::mirror:logging_eqiad_to_jumbo_eqiad and role::kafka::mirror:logging_codfw_to_jumbo_eqiad, each using profile::kafka::mirror. See these hiera params for an example usage of profile::kafka::mirror
  3. Each of those roles applied to 2 of the new ganeti hosts.

If we wanted to spend a little bit of extra time, we could make profile::kafka::mirror multi-instance, and run multiple MirrorMaker instances on a single host. If we did that, then I think only 2 new ganeti VMs and only one new puppet role would be needed.

Thanks! I think a Ganeti VM would be fine. Can I ask, what's the issue with deploying to the DSE cluster? We could also consider the new k8s 'aux' cluster that SRE Infrastructure Foundations recently created.

Can I ask, what's the issue with deploying to the DSE cluster? We could also consider the new k8s 'aux' cluster that SRE Infrastructure Foundations recently created.

No issue at all, that would be great. It would just require helm chart dev work, whereas setting it up in Ganeti via puppet would be as simple as including an existent profile class.

Can I ask, what's the issue with deploying to the DSE cluster?

Its capacity. - Right now we only really have one SRE across all of DE and they are spread quite thin as it is. But given that we could provision a namespace for you on the DSE for you to deploy it to if thats the route you wish to go down. Getting through the helm + dev work here is especially time consuming since we share so much with wikicube.

JArguello-WMF lowered the priority of this task from Medium to Low.Tue, Mar 14, 6:00 PM