User Details
- User Since
- Jun 21 2021, 2:34 PM (242 w, 12 h)
- Availability
- Available
- LDAP User
- TChin
- MediaWiki User
- TChin (WMF) [ Global Accounts ]
Fri, Feb 6
Thu, Jan 29
I chatted with @Ottomata about this a little bit, here's what I'm going to attempt:
Mon, Jan 26
Jan 6 2026
Dec 8 2025
Dec 5 2025
Dec 3 2025
@fkaelin How urgent is the need for this stream? We're considering moving off of PyFlink and this would be a good opportunity to spike on a Java pipeline instead of a quick implementation now and then the complexities of dealing with any migration pains later
Dec 2 2025
Nov 14 2025
Would we also need to explicitly create the topics in main? Is auto topic creation enabled there?
Nov 12 2025
what work is required to produce mediawiki.page_content_change.v1 to Kafka main? I'm expecting just some helmfile changes, for example, in the mw-page-content-change-enrich/values-codfw.yaml, and not requiring changes in mediawiki event enrichment code, right?
Nov 11 2025
Nov 3 2025
Oct 31 2025
Oct 28 2025
Talked with Andrew about this more. The main problem is that MediaWiki is active/passive, but eventgate is basically active/active. The external eventgate instances will expect traffic on both DCs, but the internal ones would see activity on the actice DC (but may still get events on the passive DC).
Oct 27 2025
Wow, this was harder than I thought. So what we to need to happen is to detect the active DC mediawiki_wmf_master_datacenter which is indicated by the the datacenter label and only alert on the active datacenter. All metrics have a site label which is (from what I can tell) the datacenter the metric is exported from.
Oct 9 2025
Oct 8 2025
Sep 29 2025
Sep 26 2025
Hmmm ok everything is deployed now and it works fine, but I can't tell if the p99 performance got worse, or the express metrics are broken somehow (either broken beforehand and now fixed or vice-versa). What makes me suspicious is that when you look at the latency quantiles by HTTP method from before the deployment, every deployment and every instance had a GET and POST p99 of almost exactly 9.90ms. After the deployment, it's actually correlated with the amount of events it's received. I'm assuming this means that something was actually fixed somewhere, but because of this, alerts are being fired on the passive DC because of the bursty nature of events there and the latency increase that's correlated with it.
In the logs I spotted another offender
{"@timestamp":"2025-09-26T16:47:14.571Z","ecs.version":"8.10.0","log.level":"info","message":"Overriding meta.dt in event b63f71b4-d6ff-4a1f-8544-e01d11df60c3 of schema at /sparql/query/1.3.0 destined to stream wdqs-external.sparql-query from 2025-09-26T16:47:14.499Z to 2025-09-26T16:47:14.571Z.","service":{"name":"eventgate-analytics"}}Deployed to eventgate-analytics-external and it looks stable. Proceeding to deploy to the remaining instances.
Fully deployed to eventgate-logging-external. Logs seem fine. No log spam, the duplicate dropped fields are fixed, and it's fully ingested into logstash in ECS format now. Metrics also look good. It's a bit fuzzy because this happened during the DC switchover, but looking at the dashboard it seems like all metrics still match except the only one lost is the one I stated before that's under Memory usage (sum over all pods). That doesn't concern me that much since the service is also now being picked up by the new service-utils metrics dashboard so we still have memory reporting.
Sep 23 2025
Also these three metrics stopped being reported, which I don't really know why since from what I can tell it's a Kubernetes metric
nodejs_process_heap_used_bytes
nodejs_process_heap_total_bytes
nodejs_process_heap_rss_bytes
Deployed to eventgate-logging-external for codfw, it works in the sense that it didn't blow up, but will have to fix some stuff before I deploy the rest of it. Logs export fine in ECS, but for some reason a lot of fields are being dropped. Metrics show up in the dashboards but some need renaming, and I also forgot to add metrics for the express routes.
Sep 16 2025
I'm going to try to upgrade mw-content-history-reconcile-enrich-next to Flink 1.20 to see if it magically fixes the issue, but I won't do any work migrating from deprecated config and stuff in this ticket though. If the issue doesn't get fix, at least with the update it includes a feature that allows us to profile the JobManager using the Flink Web UI, which could be useful.
Aug 29 2025
oo very nice!! I wonder how it'd compare to a pure java version of the deequ code. Maybe if we switch to SQL we can take the opportunity to revamp the metrics table schema?
Aug 28 2025
Aug 27 2025
Aug 26 2025
Aug 18 2025
@dcausse fyi I just deployed eventstreams with your patch
Aug 5 2025
Jul 30 2025
Jul 28 2025
assignments is now stringified in KafkaSSE, but in the logs I see that assignments is in normalized.dropped.no_such_field. Is there something I'm missing? @colewhite
Jul 25 2025
Yeah it can be closed out
Jul 21 2025
If you want to drive it, be my guest and I can help you out if needed. Or we can pair program together. Whichever you prefer
Jul 11 2025
Jul 7 2025
I should also mention that metrics eventually recovered, probably due to T383977 having unintentionally doing rolling restarts of all the pods, resetting metrics. Taking a look at that ticket again since I'm already in the code.
Was digging through the logs and found a bunch of these InvalidAssignmentError requests which is caused by setting a malformed last-event-id header. This header when set takes precedence over the url stream parameter (which can lead to another bug if the stream names are different than the topics in the header), and passes all of eventstream's checks until it gets handed to KafkaSSE where it blows up. This somehow fires both the close and finish event.
Jul 2 2025
Jul 1 2025
Jun 30 2025
Adjusted airflow variables to use the new conda artifact. Should be good to go now. Now the only question is how long the metrics computation will take...
Jun 26 2025
Just noticed that in the metrics computation script, it deletes any duplicated metrics WHERE partition_ts = CAST('{args.min_timestamp}' AS TIMESTAMP) in case of reruns. However, for all-of-wiki-time, min_timestamp is always 2000-01-01T00:00:00. We need the partition_ts column to be the max_timestamp for this case.
The metrics table has no unique index I can match on to update rows so I had to match on almost every column but it worked I guess
Jun 24 2025
Since implementing the metrics segregation, we should now update the legacy metrics with the computation class before implementing the monthly metrics
spark-sql (default)> SELECT COUNT(*) AS count
> FROM wmf_data_ops.data_quality_metrics
> WHERE tags['project'] = 'mediawiki_content_history'
> AND (tags['computation_class'] IS NULL OR tags['computation_class'] = '')
> ;
count
577920May 16 2025
I think the solution is to make the code aware of both endpoints, and then pick the correct one inside the SparkSubmitOperator based off of the launcher param before it sets the rest of the config. Right now the endpoint is set by a jinja template, but by the time airflow templates the string it's probably too late?
May 11 2025
Deployed and updated airflow variables to use artifact v0.6.0
May 5 2025
May 2 2025
On the service-utils side, that property should've been filled in by express:
Apr 25 2025
subject_id should be base64-decodable and its length prior to decoding should be at least 22 characters
Also reading into this ticket more, any event that is sent that has an X-Experiment-Enrollments but doesn't have an experiment field in its schema gets dropped? Are there instances where there could be a X-Experiment-Enrollments header on non-experiment events that we want to keep? It would be a no-op anyways?
If I'm reading this correctly, now we want in the stream config:
producers: eventgate: enrich_fields_from_http_headers: 'x-experiment-enrollments': 'x-experiment-enrollments'
EventGate checks if this exists, which adds a x-experiment-enrollments field to the event. If it does then it does the processing listing on the ticket. Then after the processing, it removes the x-experiment-enrollments field before it gets validated by EventGate.
Apr 18 2025
Apr 15 2025
Throwing out a guess, it seems like because where the lineage runs depends on the driver, it needs to somehow be aware of whether or not it's running on k8s and choose the correct kafka bootstrap url. I wonder if there's an easy way to figure this out? This will probably require some refactoring.
Apr 14 2025
Apr 8 2025
Ok so that wasn't it, now we get this error:
25/04/08 19:42:35 ERROR AsyncEventQueue: Listener DatahubSparkListener threw an exception datahub.shaded.org.apache.kafka.common.KafkaException: Failed to construct kafka producer at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:430) at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:298) at datahub.client.kafka.KafkaEmitter.<init>(KafkaEmitter.java:55) at datahub.spark.DatahubEventEmitter.getEmitter(DatahubEventEmitter.java:85) at datahub.spark.DatahubEventEmitter.emitMcps(DatahubEventEmitter.java:401) at datahub.spark.DatahubEventEmitter.emitCoalesced(DatahubEventEmitter.java:190) at datahub.spark.DatahubSparkListener.onApplicationEnd(DatahubSparkListener.java:279) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:57) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1381) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) Caused by: datahub.shaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers at datahub.shaded.org.apache.kafka.clients.ClientUtils.parseAndValidateAddresses(ClientUtils.java:84) at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:408) ... 20 more
Apr 7 2025
Apr 4 2025
Seems to be working, webrequest_analyzer dag runs normally and I can see the columns being filled in the table:
Altered table:
ALTER TABLE wmf_data_ops.data_quality_alerts ADD COLUMNS ( dataset_date BIGINT COMMENT 'AWS Deequ resultKey: key insertion time.', tags MAP<STRING,STRING> COMMENT 'AWS Deequ resultKey: key tags.' );
Apr 3 2025
I think it actually is broken on main as well and it's just been silently failing. I opened a patch to add the port back in
Apr 2 2025
@brouberol would you happen to have some insight into this issue?
Apr 1 2025
Actually nevermind, the error we had last time was
Caused by: datahub.shaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers
at datahub.shaded.org.apache.kafka.clients.ClientUtils.parseAndValidateAddresses(ClientUtils.java:84)
at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:408)
... 20 moreCurrently encountering this error, with the search instance, think it's the same issue we encountered when migrating the analytics instance to k8s:
Mar 27 2025
Or a real quick way is to not pass in the new logger from eventstreams to KafkaSSE, that way it creates a bunyan logger and the logs won't appear as an error anymore in logstash