Page MenuHomePhabricator

Also intake Network Error Logging events into the Analytics Data Lake
Closed, ResolvedPublic3 Estimated Story Points

Description

Currently we receive NEL data via the EventGate instance eventgate-logging-external. That emits events to Kafka for consumption by Logstash and visualization/analysis in Kibana.

It would also be nice to have this data available for analysis in Hive/Spark/Jupyter etc. Same standard data retention of 90 days.

How hard would it be to get these events available in both places?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Adding @brouberol because I bet he'd be interested! :)

BTullis raised the priority of this task from Low to High.Mar 26 2025, 3:37 PM

Increasing the priority of this, because red tickets go faster.

This request has been discussed in various other channels and we all want to make it happen. I will start working on the mirror-maker part as soon as possible.

Any updates on this work, or any guesses on when there might be time?

FTR: This is data that might potentially be useful for SDS1.3.

Hi @CDanis ! I'm going to take a stab at it. Given that this ticket has been sitting here for a long time, I'm going to attempt to do this "right":

  • use a modern version of kafka / mirrormaker
  • running in our dse-k8s-eqiad kubernetes cluster
  • using external services to define network policies

Just to clarify, what would be the topic names we'd need to mirror?

Change #1192109 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] kafka-mirrormaker: initial scaffolding

https://gerrit.wikimedia.org/r/1192109

Change #1192110 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] kafka-mirrormaker: define business logic

https://gerrit.wikimedia.org/r/1192110

Change #1192111 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] kafka-mirrormaker: define helmfile

https://gerrit.wikimedia.org/r/1192111

Change #1192117 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Define the kafka-mirrromaker kubeconfigs in dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1192117

Change #1192118 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] kafka-mirrormaker: define the namespace in dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1192118

@brouberol it would be useful to mirror all Event Platform compatible topics, e.g. those with streams declared in EventStreamConfig. In this case, those are streams the ones that have destination_event_service=eventgate-logging-external.

E.g.

curl 'https://meta.wikimedia.org/w/api.php?action=streamconfigs&format=json&constraints=destination_event_service=eventgate-logging-external' | jq '.streams[].topics'

[
  "eqiad.mediawiki.client.error",
  "codfw.mediawiki.client.error"
]
[
  "eqiad.kaios_app.error",
  "codfw.kaios_app.error"
]
[
  "eqiad.w3c.reportingapi.network_error",
  "codfw.w3c.reportingapi.network_error"
]
[
  "eqiad.eventgate-logging-external.test.event",
  "codfw.eventgate-logging-external.test.event"
]
[
  "eqiad.eventgate-logging-external.error.validation",
  "codfw.eventgate-logging-external.error.validation"
]

Once they are mirrored to kafka jumbo-eqiad, we can set consumers.analytics_hive_ingestion.enabled: true for them, and they will be auto ingested to Hive via Refine just like all other event streams.

Hm, we probably should also set canary_events_enabled: true for these, but that does mean that artificial canary events would be produced into the streams and show up in logstash. Should be okay, but we should notify any potential users of the streams there.

Cool, I can inject the topics via a script we could commit to deployment-charts to maintain that topic list over time:

#!/usr/bin/env python3

import requests
import yaml
from collections import defaultdict

r = requests.get("https://meta.wikimedia.org/w/api.php?action=streamconfigs&format=json&constraints=destination_event_service=eventgate-logging-external",
             headers={"User-Agent": "Data Platform SRE <data-platform-alerts@wikimedia.org>"})
r.raise_for_status()
topics = defaultdict(list)
for stream, stream_config in r.json()['streams'].items():
    for topic in stream_config['topics']:
        for site in ('eqiad', 'codfw'):
            if topic.startswith(f'{site}.'):
                topics[site].append(topic.replace(f'{site}.', ''))

for site in ('eqiad', 'codfw'):
    config_filename = f'values-logging-{site}-to-jumbo-eqiad.yaml'
    data = {
        "config": {
            "mm2": {
                "name": f"logging-{site}-to-jumbo-eqiad",
                "clusters": f"logging-{site},jumbo-eqiad",
                "source.cluster.alias": f"logging-{site}",
                "source.bootstrap.servers": f"kafka-logging-{site}.external-services.svc.cluster.local:9093",
                "source->destination.topics": site + '.(' + '|'.join(sorted(topics[site])) + ')',
                "destination.cluster.alias": "jumbo-eqiad",
                "destination.bootstrap.servers": "kafka-jumbo-eqiad.external-services.svc.cluster.local:9093",
            }
        },
        "external_services": {
            "kafka": [f"logging-{site}", "jumbo-eqiad"]
        }
    }
    with open(config_filename, 'w') as vf:
        yaml.dump(data, vf)
helmfile.d/dse-k8s-services/kafka-mirrormaker/values-logging-codfw-to-jumbo-eqiad.yaml
config:
  mm2:
    clusters: logging-codfw,jumbo-eqiad
    destination.bootstrap.servers: kafka-jumbo-eqiad.external-services.svc.cluster.local:9093
    destination.cluster.alias: jumbo-eqiad
    name: logging-codfw-to-jumbo-eqiad
    source->destination.topics: codfw.(eventgate-logging-external.error.validation|eventgate-logging-external.test.event|kaios_app.error|mediawiki.client.error|w3c.reportingapi.network_error)
    source.bootstrap.servers: kafka-logging-codfw.external-services.svc.cluster.local:9093
    source.cluster.alias: logging-codfw
external_services:
  kafka:
  - logging-codfw
  - jumbo-eqiad
helmfile.d/dse-k8s-services/kafka-mirrormaker/values-logging-eqiad-to-jumbo-eqiad.yaml
config:
  mm2:
    clusters: logging-eqiad,jumbo-eqiad
    destination.bootstrap.servers: kafka-jumbo-eqiad.external-services.svc.cluster.local:9093
    destination.cluster.alias: jumbo-eqiad
    name: logging-eqiad-to-jumbo-eqiad
    source->destination.topics: eqiad.(eventgate-logging-external.error.validation|eventgate-logging-external.test.event|kaios_app.error|mediawiki.client.error|w3c.reportingapi.network_error)
    source.bootstrap.servers: kafka-logging-eqiad.external-services.svc.cluster.local:9093
    source.cluster.alias: logging-eqiad
external_services:
  kafka:
  - logging-eqiad
  - jumbo-eqiad

I had a look in grafana, and the overall size of the topics is very small.

Screenshot 2025-09-29 at 16.24.44.png (834×2 px, 169 KB)

Now that we have a k8s cluster in codfw, we could deploy a logging->jumbo mirrormaker instance in each cluster, and have them continuously. This way, we would be able to tolerate DC switchovers. That being said, Mirrormaker prefixes mirrored topics with the cluster source name (ex: logging-eqiad.eqiad.eventgate-logging-external). We will probably be able to tweak the final name, but I want to emphasize that we will end up with 2 topics which name will differ from what is in the logging clusters.

Change #1192117 merged by Brouberol:

[operations/puppet@production] Define the kafka-mirrromaker kubeconfigs in all dse-k8s clusters

https://gerrit.wikimedia.org/r/1192117

That being said, Mirrormaker prefixes mirrored topics with the cluster source name

This is MirrorMaker 2, I suppose, ya?
See also T277467: Upgrade to Kafka MirrorMaker 2 and revisit Kafka topic prefix convention and https://wikitech.wikimedia.org/wiki/Kafka#Data_center_topic_prefixing_design_flaw.

I want to emphasize that we will end up with 2 topics which name will differ from what is in the logging clusters.

This will be a problem for automated ingestion. The gobblin+Refine stuff relies on the fact that topics are prefixed with datacenter name.

Ok so in that case I should disable topic prefixing. I'll see how I can do that.

Oh and yes, this is mirrormaker2.

According to https://stackoverflow.com/a/70994290, I can do this by setting

replication.policy.class=org.apache.kafka.connect.mirror.IdentityReplicationPolicy

I'll report back after having run some tests.

Great, thank you! And sorry about that. One magical day in the future we'll fix the flaw and do the cluster name prefixing :/

Change #1192118 merged by Brouberol:

[operations/deployment-charts@master] kafka-mirrormaker: define the namespace in dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1192118

it's all good, no worries at all!

FWIW, 3 of these topics already exist on kafka-jumbo-eqiad (probably already handled by existing mirrormaker instances managed by puppet):

  • eqiad.w3c.reportingapi.network_error
  • codfw.w3c.reportingapi.network_error
  • eqiad.eventgate-logging-external.test.event
brouberol@kafka-jumbo1010:~$ kafka topics --describe --topic eqiad.mediawiki.client.error
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic eqiad.mediawiki.client.error
brouberol@kafka-jumbo1010:~$ kafka topics --describe --topic codfw.mediawiki.client.error
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic codfw.mediawiki.client.error
brouberol@kafka-jumbo1010:~$ kafka topics --describe --topic eqiad.kaios_app.error
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic eqiad.kaios_app.error
brouberol@kafka-jumbo1010:~$ kafka topics --describe --topic codfw.kaios_app.error
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic codfw.kaios_app.error
brouberol@kafka-jumbo1010:~$ kafka topics --describe --topic eqiad.w3c.reportingapi.network_error
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic eqiad.w3c.reportingapi.network_error
Topic:eqiad.w3c.reportingapi.network_error	PartitionCount:1	ReplicationFactor:3	Configs:
	Topic: eqiad.w3c.reportingapi.network_error	Partition: 0	Leader: 1014	Replicas: 1014,1018,1010	Isr: 1014,1018,1010
brouberol@kafka-jumbo1010:~$ kafka topics --describe --topic codfw.w3c.reportingapi.network_error
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic codfw.w3c.reportingapi.network_error
Topic:codfw.w3c.reportingapi.network_error	PartitionCount:1	ReplicationFactor:3	Configs:
	Topic: codfw.w3c.reportingapi.network_error	Partition: 0	Leader: 1016	Replicas: 1016,1010,1014	Isr: 1016,1014,1010
brouberol@kafka-jumbo1010:~$ kafka topics --describe --topic eqiad.eventgate-logging-external.test.event
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic eqiad.eventgate-logging-external.test.event
Topic:eqiad.eventgate-logging-external.test.event	PartitionCount:1	ReplicationFactor:3	Configs:
	Topic: eqiad.eventgate-logging-external.test.event	Partition: 0	Leader: 1012	Replicas: 1012,1018,1017	Isr: 1012,1018,1017
brouberol@kafka-jumbo1010:~$ kafka topics --describe --topic codfw.eventgate-logging-external.test.event
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic codfw.eventgate-logging-external.test.event
brouberol@kafka-jumbo1010:~$ kafka topics --describe --topic eqiad.eventgate-logging-external.error.validation
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic eqiad.eventgate-logging-external.error.validation
brouberol@kafka-jumbo1010:~$ kafka topics --describe --topic codfw.eventgate-logging-external.error.validation
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic codfw.eventgate-logging-external.error.validation
brouberol@kafka-jumbo1010:~$

I wasn't able to run mirrormaker2 against our kafka1.1 clusters. I don't have a firm answer as to what minimal version mm2 can work with, but 1.1 seems to be to old. I was able to make mirrormaker 1 work in Kubernetes though! Now that this is done, we can:

  • mirror the topics @CDanis is interested in
  • deprecate running MM1 on the kafka brokers, and have them run all in the same pod, while consolidating the topic regex

WDYT?

Change #1192109 merged by Brouberol:

[operations/deployment-charts@master] kafka-mirrormaker: initial scaffolding

https://gerrit.wikimedia.org/r/1192109

Change #1192110 merged by jenkins-bot:

[operations/deployment-charts@master] kafka-mirrormaker: define business logic

https://gerrit.wikimedia.org/r/1192110

Change #1192111 merged by Brouberol:

[operations/deployment-charts@master] kafka-mirrormaker: define helmfile

https://gerrit.wikimedia.org/r/1192111

The mirrormaker pods are running in both dse-k8s-eqiad and dse-k8s-codfw:

brouberol@deploy2002:~$ kube_env kafka-mirrormaker dse-k8s-eqiad
brouberol@deploy2002:~$ kubectl get pod
NAME                                                              READY   STATUS    RESTARTS   AGE
kafka-mirrormaker-logging-eqiad-to-jumbo-eqiad-7f75b974c6-2ztsx   1/1     Running   0          8m29s
brouberol@deploy2002:~$ kube_env kafka-mirrormaker dse-k8s-codfw
brouberol@deploy2002:~$ kubectl get pod
NAME                                                            READY   STATUS    RESTARTS   AGE
kafka-mirrormaker-logging-codfw-to-jumbo-eqiad-d5fd8fc4-b6zrc   1/1     Running   0          4m38s

The MM pod in dse-k8s-codfw replicates kafka-logging-codfw to kafka-jumbo-eqiad, and conversely, the MM pod in dse-k8s-eqiad replicates kafka-logging-eqiad to kafka-jumbo-eqiad.

The topic regex I've configured is basically eqiad\..* for kafka-logging-eqiad, and codfw\..* for kafka-logging-codfw.

We can see that the .*.w3c.reportingapi.network_error topics are replicated to kafka-jumbo-eqiad.

brouberol@kafka-jumbo1010:~$ kafka topics --list | grep w3c.reportingapi.network_error
codfw.w3c.reportingapi.network_error
eqiad.w3c.reportingapi.network_error
brouberol@kafka-jumbo1010:~$ kafka topics --describe --topic codfw.w3c.reportingapi.network_error
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic codfw.w3c.reportingapi.network_error
Topic:codfw.w3c.reportingapi.network_error	PartitionCount:1	ReplicationFactor:3	Configs:
	Topic: codfw.w3c.reportingapi.network_error	Partition: 0	Leader: 1016	Replicas: 1016,1010,1014	Isr: 1016,1014,1010
brouberol@kafka-jumbo1010:~$ kafka topics --describe --topic eqiad.w3c.reportingapi.network_error
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic eqiad.w3c.reportingapi.network_error
Topic:eqiad.w3c.reportingapi.network_error	PartitionCount:1	ReplicationFactor:3	Configs:
	Topic: eqiad.w3c.reportingapi.network_error	Partition: 0	Leader: 1014	Replicas: 1014,1018,1010	Isr: 1014,1018,1010

I can see that there's data in the codfw.w3c.reportingapi.network_error topic:

brouberol@kafka-jumbo1010:~$ kafkacat -C -b localhost:9092 -t codfw.w3c.reportingapi.network_error -c 1 | jq .
{
  "age": 24088,
  "body": {
    "elapsed_time": 46,
    "method": "GET",
    "phase": "application",
    "protocol": "h2",
    "referrer": "",
    "sampling_fraction": 0.05,
    "server_ip": "...",
    "status_code": 200,
    "type": "abandoned"
  },
  "type": "network-error",
  "url": "https://www.wikipedia.org/",
  "user_agent": "[REDACTED]",
  "$schema": "/w3c/reportingapi/network_error/1.0.0",
  "meta": {
    "stream": "w3c.reportingapi.network_error",
    "id": "...",
    "dt": "2025-09-24T21:24:05.388Z",
    "request_id": "..."
  },
  "http": {
    "request_headers": {
      "user-agent": "[REDACTED]",
      "x-geoip-isp": "JSC Severen-Telecom",
      "x-geoip-organization": "JSC Severen-Telecom",
      "x-geoip-as-number": "24739",
      "x-geoip-country": "RU",
      "x-geoip-subdivision": "SPE"
    },
    "client_ip": "[REDACTED]"
  }
}

The eqiad.w3c.reportingapi.network_error topic is empty as eqiad is no longer the primary DC.

brouberol@kafka-jumbo1010:~$ kafkacat -C -b localhost:9092 -t eqiad.w3c.reportingapi.network_error -c 1 | jq .
% Reached end of topic eqiad.w3c.reportingapi.network_error [0] at offset 0

@Ottomata Would you mind walking me through what needs to be done to ensure that data is then ingested by gobblin + refine, in order to turn it into a hive table?

The MM pod in dse-k8s-codfw replicates kafka-logging-codfw to kafka-jumbo-eqiad, and conversely

Will probably be fine, especially for these use cases, but: we've generally tried to run MirrorMaker instances (and producer clients) in the same DC as the Kafka cluster they are producing too, avoiding cross-DC produce requests.

E.g. kafka main-codfw -> kafka jumbo-eqiad runs in eqiad.

Consuming can always retry from offsets, but producing retries requires local buffering in the client, and if for some reason if fails hard is harder to recover from. In MirrorMakers case...maybe not so hard because it is a simple mirror, and has the source kafka cluster as a 'buffer' to retry from, but still.

Anyway, just food for thought, in case you want to consider running both MirrorMakers in eqiad.

what needs to be done to ensure that data is then ingested by gobblin + refine

@brouberal of course! Actually, I'll just update docs ;)

https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#consumers.analytics_hive_ingestion_settings

So, analytics_hadoop_ingestion (gobblin) is already true for these streams (by default), so I'd expect that gobblin is already trying to consume this data into HDFS.

And indeed, I think is has started ;)

hdfs dfs -ls /wmf/data/raw/event/codfw.w3c.reportingapi.network_error/year=2025/month=09/day=25/hour=05
-rw-r-----   3 analytics analytics-privatedata-users    1934690 2025-09-30 15:26 /wmf/data/raw/event/codfw.w3c.reportingapi.network_error/year=2025/month=09/day=25/hour=05/part.task_event_default_1759245931007_457_7.txt.gz

So, you just need to remove the analytics_hive_ingestion overrides for these streams in EventStreamConfig. analytics_hive_ingestion also defaults to true, but it is set to false for these streams.

To do so, follow instructions here: https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#Declaring_streams_and_editing_stream_config_settings

The relevant place to edit is https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/%2B/refs/heads/master/wmf-config/ext-EventStreamConfig.php#1188

Oh, and right. As the comment says, we should enable canary events (this will make hive ingestion work better).

Can you ask o11y (maybe @colewhite ?) to set up a logstash filter that drops all events where meta.domain == 'canary'?

@brouberol if you want, I'm happy to add this onto my existing patch against the same file: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1192572

Would be happy to deploy it with you if you haven't done a MW config change before :)

@CDanis yes please! Let's maybe wait for the topic to be caught up?

OK, sounds good to me! I'll merge and deploy my patch now then, and can help you with the other one later.

Change #1192914 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: w3creportingapi drop canary events

https://gerrit.wikimedia.org/r/1192914

Change #1192914 merged by Cwhite:

[operations/puppet@production] logstash: w3creportingapi drop canary events

https://gerrit.wikimedia.org/r/1192914

Can you ask o11y (maybe @colewhite ?) to set up a logstash filter that drops all events where meta.domain == 'canary'?

This filter is in place.

Change #1192950 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] EventStreamConfig - Enable hive ingestion for eventgate-logging-external based streams

https://gerrit.wikimedia.org/r/1192950

@brouberol Let's deploy together Monday, my morning/your afternoon?

Change #1192950 merged by jenkins-bot:

[operations/mediawiki-config@master] EventStreamConfig - Enable hive ingestion for eventgate-logging-external based streams

https://gerrit.wikimedia.org/r/1192950

Mentioned in SAL (#wikimedia-operations) [2025-10-06T13:39:57Z] <cdanis@deploy2002> Started scap sync-world: Backport for [[gerrit:1192950|EventStreamConfig - Enable hive ingestion for eventgate-logging-external based streams (T304373)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-06T13:46:36Z] <cdanis@deploy2002> cdanis, otto: Backport for [[gerrit:1192950|EventStreamConfig - Enable hive ingestion for eventgate-logging-external based streams (T304373)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-10-06T13:52:22Z] <cdanis@deploy2002> Finished scap sync-world: Backport for [[gerrit:1192950|EventStreamConfig - Enable hive ingestion for eventgate-logging-external based streams (T304373)]] (duration: 12m 24s)

@CDanis Sorry, I was on PTO. I can see from SAL that you deployed? How did it go?

@CDanis Sorry, I was on PTO. I can see from SAL that you deployed? How did it go?

Very easy. Mediawiki config deploy (the quick flavor of scap backport), then rolling restart the eventgates with the usual helmfile --state-values mechanism.

select count(*), min(meta.dt), max(meta.dt)
from event.w3c_reportingapi_network_error 

107298993	2025-10-06T09:00:00.000Z	2025-10-17T12:59:59.992Z

Thanks again! So happy to finally have this in Hive

Screenshot 2025-10-20 at 13.48.12.png (1×3 px, 640 KB)
Nice! I can see it in Superset as well. Glad to have helped :)