Page MenuHomePhabricator

tchin (Thomas)
Senior Software Engineer

Today

  • No visible events.

Tomorrow

  • No visible events.

Thursday

  • No visible events.

User Details

User Since
Jun 21 2021, 2:34 PM (242 w, 12 h)
Availability
Available
LDAP User
TChin
MediaWiki User
TChin (WMF) [ Global Accounts ]

Recent Activity

Fri, Feb 6

tchin added a parent task for T416719: OpsWeek: Bump memory of refine job for product_metrics.web_base_with_ip to avoid recent OOMs: Unknown Object (Task).
Fri, Feb 6, 6:44 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Thu, Jan 29

tchin added a comment to T415549: Instance-level EventGate configuration to enable/disable functionality.

I chatted with @Ottomata about this a little bit, here's what I'm going to attempt:

Thu, Jan 29, 9:30 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Event-Platform

Mon, Jan 26

tchin updated the task description for T415549: Instance-level EventGate configuration to enable/disable functionality.
Mon, Jan 26, 9:45 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Event-Platform
tchin claimed T415549: Instance-level EventGate configuration to enable/disable functionality.
Mon, Jan 26, 3:00 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Event-Platform
tchin updated the task description for T415549: Instance-level EventGate configuration to enable/disable functionality.
Mon, Jan 26, 2:39 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Event-Platform
tchin added a subtask for T415549: Instance-level EventGate configuration to enable/disable functionality: Unknown Object (Task).
Mon, Jan 26, 2:37 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Event-Platform
tchin created T415549: Instance-level EventGate configuration to enable/disable functionality.
Mon, Jan 26, 2:37 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Event-Platform

Jan 6 2026

tchin added a member for Trusted-Contributors: BPiovesan-WMF.
Jan 6 2026, 4:14 PM

Dec 8 2025

tchin moved T411803: Fix reconcile bug where user_id is not being populated correctly. from Urgent to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 8 2025, 4:58 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
tchin moved T412035: Upgrade Airflow HdfsEmailOperator to take both a String or a List(String) email addresses. from Next Up to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 8 2025, 4:49 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
tchin moved T411876: Add new data-steward email to Human-Bot Alert email. from In progress to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 8 2025, 4:49 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
tchin moved T411378: Human vs Bot Alerting Email Upgrade from In progress to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 8 2025, 4:48 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Dec 5 2025

tchin added a project to T410266: Explore how to migrate PyFlink to Java/Scala: Spike.
Dec 5 2025, 3:09 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Spike, Event-Platform

Dec 3 2025

tchin updated tchin.
Dec 3 2025, 10:33 PM
tchin added a comment to T360794: Implement stream of HTML content on mw.page_change event.

@fkaelin How urgent is the need for this stream? We're considering moving off of PyFlink and this would be a good opportunity to spike on a Java pipeline instead of a quick implementation now and then the complexities of dealing with any migration pains later

Dec 3 2025, 2:36 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Research, Event-Platform

Dec 2 2025

tchin created T411457: NDA Access for tchin.
Dec 2 2025, 3:47 AM · Essential-Work, Release-Engineering-Team (Doing 😎), WMF-NDA-Requests

Nov 14 2025

tchin added a comment to T409469: Enable ChangeProp to consume mediawiki.page_content_change.v1.

Would we also need to explicitly create the topics in main? Is auto topic creation enabled there?

Nov 14 2025, 3:24 PM · Data-Engineering, serviceops, Machine-Learning-Team

Nov 12 2025

tchin added a comment to T409469: Enable ChangeProp to consume mediawiki.page_content_change.v1.

what work is required to produce mediawiki.page_content_change.v1 to Kafka main? I'm expecting just some helmfile changes, for example, in the mw-page-content-change-enrich/values-codfw.yaml, and not requiring changes in mediawiki event enrichment code, right?

Nov 12 2025, 5:23 PM · Data-Engineering, serviceops, Machine-Learning-Team

Nov 11 2025

tchin updated the task description for T404340: [EPIC] Upgrade flink jobs to java 17.
Nov 11 2025, 11:08 PM · Data-Engineering, Wikidata, Wikidata-Query-Service, Essential-Work, Discovery-Search, Epic

Nov 3 2025

tchin claimed T408918: Upgrade mediawiki-event-enrichment jobs to Flink 1.20.2 and Java 17.
Nov 3 2025, 3:56 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Event-Platform, Essential-Work

Oct 31 2025

tchin created T408918: Upgrade mediawiki-event-enrichment jobs to Flink 1.20.2 and Java 17.
Oct 31 2025, 1:15 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Event-Platform, Essential-Work

Oct 28 2025

tchin added a comment to T405952: EventgateProduceRateStop / EventGateProduceRateAnomaly alert should be active datacenter aware.

Talked with Andrew about this more. The main problem is that MediaWiki is active/passive, but eventgate is basically active/active. The external eventgate instances will expect traffic on both DCs, but the internal ones would see activity on the actice DC (but may still get events on the passive DC).

Oct 28 2025, 5:23 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Observability-Alerting, Event-Platform

Oct 27 2025

tchin added a comment to T405952: EventgateProduceRateStop / EventGateProduceRateAnomaly alert should be active datacenter aware.

Wow, this was harder than I thought. So what we to need to happen is to detect the active DC mediawiki_wmf_master_datacenter which is indicated by the the datacenter label and only alert on the active datacenter. All metrics have a site label which is (from what I can tell) the datacenter the metric is exported from.

Oct 27 2025, 5:52 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Observability-Alerting, Event-Platform

Oct 9 2025

tchin updated the task description for T406872: Fix mediawiki event enrichment to work with newest version of Blubber.
Oct 9 2025, 2:15 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Event-Platform
tchin created T406872: Fix mediawiki event enrichment to work with newest version of Blubber.
Oct 9 2025, 2:03 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Event-Platform

Oct 8 2025

tchin created T406747: Merge eventgate and eventgate-wikimedia repos.
Oct 8 2025, 3:16 PM · Data-Engineering, Event-Platform
tchin added a comment to T397330: mediawiki.content_history: flink applications experiencing frequent restarts due to JobManager OOMs.

Looks like it's fixed! In the dashboard the Job Manager stopped OOM-ing

Oct 8 2025, 2:36 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Event-Platform

Sep 29 2025

tchin added a subtask for T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils: T343342: eventgate logs field explosion.
Sep 29 2025, 4:56 PM · Data-Engineering, MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Patch-For-Review, Event-Platform, service-utils
tchin added a parent task for T343342: eventgate logs field explosion: T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils.
Sep 29 2025, 4:56 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Patch-For-Review, Event-Platform, Observability-Logging

Sep 26 2025

tchin added a comment to T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils.

Hmmm ok everything is deployed now and it works fine, but I can't tell if the p99 performance got worse, or the express metrics are broken somehow (either broken beforehand and now fixed or vice-versa). What makes me suspicious is that when you look at the latency quantiles by HTTP method from before the deployment, every deployment and every instance had a GET and POST p99 of almost exactly 9.90ms. After the deployment, it's actually correlated with the amount of events it's received. I'm assuming this means that something was actually fixed somewhere, but because of this, alerts are being fired on the passive DC because of the bursty nature of events there and the latency increase that's correlated with it.

Sep 26 2025, 7:15 PM · Data-Engineering, MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Patch-For-Review, Event-Platform, service-utils
tchin added a comment to T376026: Update event-producing tools to overwrite `meta.dt`.

In the logs I spotted another offender

{"@timestamp":"2025-09-26T16:47:14.571Z","ecs.version":"8.10.0","log.level":"info","message":"Overriding meta.dt in event b63f71b4-d6ff-4a1f-8544-e01d11df60c3 of schema at /sparql/query/1.3.0 destined to stream wdqs-external.sparql-query from 2025-09-26T16:47:14.499Z to 2025-09-26T16:47:14.571Z.","service":{"name":"eventgate-analytics"}}
Sep 26 2025, 6:27 PM · Data-Engineering, Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Event-Platform
tchin added a comment to T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils.

Deployed to eventgate-analytics-external and it looks stable. Proceeding to deploy to the remaining instances.

Sep 26 2025, 4:11 PM · Data-Engineering, MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Patch-For-Review, Event-Platform, service-utils
tchin added a comment to T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils.

Fully deployed to eventgate-logging-external. Logs seem fine. No log spam, the duplicate dropped fields are fixed, and it's fully ingested into logstash in ECS format now. Metrics also look good. It's a bit fuzzy because this happened during the DC switchover, but looking at the dashboard it seems like all metrics still match except the only one lost is the one I stated before that's under Memory usage (sum over all pods). That doesn't concern me that much since the service is also now being picked up by the new service-utils metrics dashboard so we still have memory reporting.

Sep 26 2025, 3:07 PM · Data-Engineering, MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Patch-For-Review, Event-Platform, service-utils

Sep 23 2025

tchin added a comment to T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils.

Also these three metrics stopped being reported, which I don't really know why since from what I can tell it's a Kubernetes metric
nodejs_process_heap_used_bytes
nodejs_process_heap_total_bytes
nodejs_process_heap_rss_bytes

Sep 23 2025, 4:29 PM · Data-Engineering, MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Patch-For-Review, Event-Platform, service-utils
tchin added a comment to T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils.

Deployed to eventgate-logging-external for codfw, it works in the sense that it didn't blow up, but will have to fix some stuff before I deploy the rest of it. Logs export fine in ECS, but for some reason a lot of fields are being dropped. Metrics show up in the dashboards but some need renaming, and I also forgot to add metrics for the express routes.

Sep 23 2025, 4:05 PM · Data-Engineering, MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Patch-For-Review, Event-Platform, service-utils

Sep 16 2025

tchin added a comment to T397330: mediawiki.content_history: flink applications experiencing frequent restarts due to JobManager OOMs.

I'm going to try to upgrade mw-content-history-reconcile-enrich-next to Flink 1.20 to see if it magically fixes the issue, but I won't do any work migrating from deprecated config and stuff in this ticket though. If the issue doesn't get fix, at least with the update it includes a feature that allows us to profile the JobManager using the Flink Web UI, which could be useful.

Sep 16 2025, 8:35 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Event-Platform

Aug 29 2025

tchin added a comment to T401010: Optimize metrics computation for the MW Content Pipeline.

oo very nice!! I wonder how it'd compare to a pure java version of the deequ code. Maybe if we switch to SQL we can take the opportunity to revamp the metrics table schema?

Aug 29 2025, 3:07 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review

Aug 28 2025

tchin updated the task description for T403171: Add user-agent to http calls from eventgate-wikimedia.
Aug 28 2025, 1:02 PM · MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform
tchin created T403171: Add user-agent to http calls from eventgate-wikimedia.
Aug 28 2025, 12:58 PM · MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform
tchin created T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils.
Aug 28 2025, 12:52 PM · Data-Engineering, MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Patch-For-Review, Event-Platform, service-utils

Aug 27 2025

tchin added a project to T361768: Migrate and re-deploy eventgate using new service-utils: Event-Platform.
Aug 27 2025, 8:09 PM · Event-Platform, Data-Engineering (Q1 FY25/26 July 1st - September 30th), service-utils

Aug 26 2025

tchin claimed T402801: CI in schemas-event-secondary fails because the tests do not follow WMF's Robot policy.
Aug 26 2025, 1:09 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Patch-For-Review, Event-Platform

Aug 18 2025

tchin added a comment to T396564: EventStreams: duplicate events from double compute (wdqs/rdf) streams.

@dcausse fyi I just deployed eventstreams with your patch

Aug 18 2025, 4:47 PM · Discovery-Search (2025.07.25 - 2025.08.15), Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform, EventStreams

Aug 5 2025

tchin added a comment to T390140: Eventstreams 'assignments' logstash field type.

assignments is now stringified in KafkaSSE, but in the logs I see that assignments is in normalized.dropped.no_such_field. Is there something I'm missing? @colewhite

When this task was filed, eventstreams used the legacy logstash format. Now that eventstreams is writing ECS, the field is reaped because the field does not exist in the schema.

Within ECS, we can either rely on event.original, amend the schema, or use another field.

Aug 5 2025, 3:19 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform, SRE Observability, EventStreams

Jul 30 2025

tchin changed the status of T366487: Event Platform schemas should not support type changes to structs as array element or map value types, a subtask of T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation, from Open to In Progress.
Jul 30 2025, 3:55 AM · Data-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review
tchin changed the status of T366487: Event Platform schemas should not support type changes to structs as array element or map value types, a subtask of T259924: HiveExtensions.convertToSchema does not properly convert arrays of structs, from Open to In Progress.
Jul 30 2025, 3:55 AM · Data-Engineering-Icebox, Data-Engineering, Patch-Needs-Improvement
tchin changed the status of T366487: Event Platform schemas should not support type changes to structs as array element or map value types from Open to In Progress.
Jul 30 2025, 3:55 AM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform

Jul 28 2025

tchin added a comment to T390140: Eventstreams 'assignments' logstash field type.

assignments is now stringified in KafkaSSE, but in the logs I see that assignments is in normalized.dropped.no_such_field. Is there something I'm missing? @colewhite

Jul 28 2025, 3:30 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform, SRE Observability, EventStreams

Jul 25 2025

tchin closed T388439: Add metrics for monthly reconciles, a subtask of T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2), as Resolved.
Jul 25 2025, 6:34 PM · Data-Engineering-Roadmap, DPE-Mediawiki-Content, Epic
tchin closed T388439: Add metrics for monthly reconciles as Resolved.
Jul 25 2025, 6:34 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin added a comment to T388439: Add metrics for monthly reconciles.

Yeah it can be closed out

Jul 25 2025, 6:33 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content

Jul 21 2025

tchin added a comment to T398922: EventGate: Add Prometheus metric for hoisting errors.

If you want to drive it, be my guest and I can help you out if needed. Or we can pair program together. Whichever you prefer

Jul 21 2025, 2:22 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform, Test Kitchen

Jul 11 2025

tchin added a comment to T361768: Migrate and re-deploy eventgate using new service-utils.

This includes upgrade to Nodejs 20.

Hi! Has this happened? Looking at the images currently deployed per deployments-charts repo


$ podman run --rm -it --entrypoint /bin/sh docker-registry.wikimedia.org/repos/data-engineering/eventgate-wikimedia:v1.11.0 -c "nodejs -v"
v20.5.1

and

podman run --rm -it --entrypoint /bin/sh docker-registry.wikimedia.org/repos/data-engineering/eventgate-wikimedia:v1.14.0 -c "nodejs -v"
v20.5.1

says yes, but I guess it doesn't hurt to double check that we are all on the same page.

Jul 11 2025, 3:30 PM · Event-Platform, Data-Engineering (Q1 FY25/26 July 1st - September 30th), service-utils

Jul 7 2025

tchin added a comment to T398325: Figure out how Eventstreams connected client metrics went negative.

I should also mention that metrics eventually recovered, probably due to T383977 having unintentionally doing rolling restarts of all the pods, resetting metrics. Taking a look at that ticket again since I'm already in the code.

Jul 7 2025, 2:41 PM · Data-Engineering (Q4 2025 April 1st - June 30th), EventStreams
tchin added a comment to T398325: Figure out how Eventstreams connected client metrics went negative.

Was digging through the logs and found a bunch of these InvalidAssignmentError requests which is caused by setting a malformed last-event-id header. This header when set takes precedence over the url stream parameter (which can lead to another bug if the stream names are different than the topics in the header), and passes all of eventstream's checks until it gets handed to KafkaSSE where it blows up. This somehow fires both the close and finish event.

Jul 7 2025, 2:38 PM · Data-Engineering (Q4 2025 April 1st - June 30th), EventStreams

Jul 2 2025

tchin claimed T398325: Figure out how Eventstreams connected client metrics went negative.
Jul 2 2025, 2:38 PM · Data-Engineering (Q4 2025 April 1st - June 30th), EventStreams

Jul 1 2025

tchin created T398325: Figure out how Eventstreams connected client metrics went negative.
Jul 1 2025, 2:15 PM · Data-Engineering (Q4 2025 April 1st - June 30th), EventStreams

Jun 30 2025

tchin added a comment to T388439: Add metrics for monthly reconciles.

Adjusted airflow variables to use the new conda artifact. Should be good to go now. Now the only question is how long the metrics computation will take...

Jun 30 2025, 5:04 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin moved T390140: Eventstreams 'assignments' logstash field type from Next Up to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
Jun 30 2025, 4:44 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform, SRE Observability, EventStreams

Jun 26 2025

tchin added a comment to T388439: Add metrics for monthly reconciles.

Just noticed that in the metrics computation script, it deletes any duplicated metrics WHERE partition_ts = CAST('{args.min_timestamp}' AS TIMESTAMP) in case of reruns. However, for all-of-wiki-time, min_timestamp is always 2000-01-01T00:00:00. We need the partition_ts column to be the max_timestamp for this case.

Jun 26 2025, 1:28 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin added a comment to T388439: Add metrics for monthly reconciles.

The metrics table has no unique index I can match on to update rows so I had to match on almost every column but it worked I guess

Jun 26 2025, 1:50 AM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content

Jun 24 2025

tchin added a comment to T388439: Add metrics for monthly reconciles.

Since implementing the metrics segregation, we should now update the legacy metrics with the computation class before implementing the monthly metrics

spark-sql (default)> SELECT COUNT(*) AS count
                   > FROM wmf_data_ops.data_quality_metrics
                   > WHERE tags['project'] = 'mediawiki_content_history'
                   >   AND (tags['computation_class'] IS NULL OR tags['computation_class'] = '')
                   > ;
count
577920
Jun 24 2025, 2:12 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content

May 16 2025

tchin added a comment to T386862: Enable Spark data lineage for all Airflow instances.

I think the solution is to make the code aware of both endpoints, and then pick the correct one inside the SparkSubmitOperator based off of the launcher param before it sets the rest of the config. Right now the endpoint is set by a jinja template, but by the time airflow templates the string it's probably too late?

May 16 2025, 5:44 PM · Data-Engineering, Patch-For-Review

May 11 2025

tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
May 11 2025, 11:04 PM · DPE-Data-Quality, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin added a comment to T384962: Implement alerting for wmf_content.mediawiki_content_history_v1.

Deployed and updated airflow variables to use artifact v0.6.0

May 11 2025, 11:04 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Patch-For-Review, DPE-Mediawiki-Content

May 5 2025

tchin moved T388439: Add metrics for monthly reconciles from In progress to In Review on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:28 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin moved T389162: [Data Quality] Add ability to add tags to alerts from In progress to In Review on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:28 PM · DPE-Data-Quality, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin moved T384962: Implement alerting for wmf_content.mediawiki_content_history_v1 from In progress to In Review on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:27 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Patch-For-Review, DPE-Mediawiki-Content
tchin moved T391959: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: EventGate & Varnish: update to receive events from beacon event v2 from In progress to In Review on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:27 PM · Test Kitchen (Experiment Platform Sprint 6), Data-Engineering (Q4 2025 April 1st - June 30th)
tchin moved T389903: Analytics Cluster Dataset Usage Discovery Task from Next Up to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:14 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th)
tchin moved T391708: Duplicate revisions and excess reverts in 2025-03 MediaWiki History snapshot from In Review to Done on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:13 PM · Analytics-Data-Problem, Data-Engineering (Q4 2025 April 1st - June 30th)
tchin moved T392244: Facilitate automatic artifact cache warming for airflow-dags artifacts from Next Up to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:13 PM · Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th)
tchin moved T383931: Unify Airflow's datasets.yaml config files across instances from Next Up to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:13 PM · Data-Engineering (Q4 2025 April 1st - June 30th)

May 2 2025

tchin added a comment to T393130: handle large log response.

On the service-utils side, that property should've been filled in by express:

May 2 2025, 7:37 PM · Essential-Work, MW-1.45-notes (1.45.0-wmf.1; 2025-05-13), Patch-For-Review, Abstract Wikipedia team (25Q4 (Apr–Jun)), function-evaluator

Apr 25 2025

tchin added a comment to T391959: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: EventGate & Varnish: update to receive events from beacon event v2.

subject_id should be base64-decodable and its length prior to decoding should be at least 22 characters

Apr 25 2025, 7:22 PM · Test Kitchen (Experiment Platform Sprint 6), Data-Engineering (Q4 2025 April 1st - June 30th)
tchin added a comment to T391959: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: EventGate & Varnish: update to receive events from beacon event v2.

Also reading into this ticket more, any event that is sent that has an X-Experiment-Enrollments but doesn't have an experiment field in its schema gets dropped? Are there instances where there could be a X-Experiment-Enrollments header on non-experiment events that we want to keep? It would be a no-op anyways?

Apr 25 2025, 1:58 PM · Test Kitchen (Experiment Platform Sprint 6), Data-Engineering (Q4 2025 April 1st - June 30th)
tchin added a comment to T391959: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: EventGate & Varnish: update to receive events from beacon event v2.

If I'm reading this correctly, now we want in the stream config:

producers:
  eventgate:
    enrich_fields_from_http_headers:
      'x-experiment-enrollments': 'x-experiment-enrollments'

EventGate checks if this exists, which adds a x-experiment-enrollments field to the event. If it does then it does the processing listing on the ticket. Then after the processing, it removes the x-experiment-enrollments field before it gets validated by EventGate.

Apr 25 2025, 1:56 PM · Test Kitchen (Experiment Platform Sprint 6), Data-Engineering (Q4 2025 April 1st - June 30th)

Apr 18 2025

tchin moved T391959: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: EventGate & Varnish: update to receive events from beacon event v2 from Next Up to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
Apr 18 2025, 7:08 PM · Test Kitchen (Experiment Platform Sprint 6), Data-Engineering (Q4 2025 April 1st - June 30th)
tchin added a project to T391959: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: EventGate & Varnish: update to receive events from beacon event v2: Data-Engineering (Q4 2025 April 1st - June 30th).
Apr 18 2025, 7:08 PM · Test Kitchen (Experiment Platform Sprint 6), Data-Engineering (Q4 2025 April 1st - June 30th)

Apr 15 2025

tchin added a comment to T386862: Enable Spark data lineage for all Airflow instances.

Throwing out a guess, it seems like because where the lineage runs depends on the driver, it needs to somehow be aware of whether or not it's running on k8s and choose the correct kafka bootstrap url. I wonder if there's an easy way to figure this out? This will probably require some refactoring.

Apr 15 2025, 4:43 PM · Data-Engineering, Patch-For-Review

Apr 14 2025

tchin moved T390247: Migrate Gobblin job repository to GitLab from Done to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
Apr 14 2025, 3:26 PM · Data-Engineering (Q4 2025 April 1st - June 30th)
tchin moved T390247: Migrate Gobblin job repository to GitLab from Next Up to Done on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
Apr 14 2025, 3:26 PM · Data-Engineering (Q4 2025 April 1st - June 30th)
tchin moved T388439: Add metrics for monthly reconciles from Next Up to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
Apr 14 2025, 3:14 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin claimed T388439: Add metrics for monthly reconciles.
Apr 14 2025, 3:12 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin moved T388721: Support for FY2024-25 4.3.11 - webrequest based scraping detection from Next Up to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
Apr 14 2025, 3:11 PM · Data-Engineering, Essential-Work
tchin moved T370470: [CIM] Skewed ranking with the top Editors monthly API from In Review to Ready to Deploy on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
Apr 14 2025, 3:09 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Patch-For-Review, Commons-Impact-Metrics

Apr 8 2025

tchin added a comment to T387033: Figure root cause of silent failures when computing metrics for mediawiki_content_history_v1.

Reopening as we had another instance of skein OOM:

xcollazo@an-launcher1002:~$ sudo -u analytics yarn logs -appOwner analytics -applicationId application_1741864027385_383042 | grep "Application driver failed" -B 2 -A 2
...
25/03/31 13:44:16 INFO skein.ApplicationMaster: Registering application with resource manager
25/03/31 13:44:16 INFO skein.ApplicationMaster: Starting application driver
25/03/31 18:00:10 INFO skein.ApplicationMaster: Shutting down: Application driver failed with exit code 143. This is often due to the application master memory limit being exceeded. See the diagnostics for more information.
25/03/31 18:00:10 INFO skein.ApplicationMaster: Unregistering application with status FAILED
25/03/31 18:00:10 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.

Verified that driver has 16GB as per Airflow task details.

Bumping to 24GB manually to rerun, although 24GB for a driver for metrics seems silly. I wonder what dequee is doing in there...

Apr 8 2025, 8:31 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin added a comment to T386862: Enable Spark data lineage for all Airflow instances.

Ok so that wasn't it, now we get this error:

25/04/08 19:42:35 ERROR AsyncEventQueue: Listener DatahubSparkListener threw an exception
datahub.shaded.org.apache.kafka.common.KafkaException: Failed to construct kafka producer
	at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:430)
	at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:298)
	at datahub.client.kafka.KafkaEmitter.<init>(KafkaEmitter.java:55)
	at datahub.spark.DatahubEventEmitter.getEmitter(DatahubEventEmitter.java:85)
	at datahub.spark.DatahubEventEmitter.emitMcps(DatahubEventEmitter.java:401)
	at datahub.spark.DatahubEventEmitter.emitCoalesced(DatahubEventEmitter.java:190)
	at datahub.spark.DatahubSparkListener.onApplicationEnd(DatahubSparkListener.java:279)
	at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:57)
	at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
	at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
	at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
	at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
	at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
	at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
	at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
	at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
	at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
	at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1381)
	at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
Caused by: datahub.shaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers
	at datahub.shaded.org.apache.kafka.clients.ClientUtils.parseAndValidateAddresses(ClientUtils.java:84)
	at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:408)
	... 20 more
Apr 8 2025, 8:18 PM · Data-Engineering, Patch-For-Review

Apr 7 2025

tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Apr 7 2025, 1:46 PM · DPE-Data-Quality, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content

Apr 4 2025

tchin added a comment to T389162: [Data Quality] Add ability to add tags to alerts.

Seems to be working, webrequest_analyzer dag runs normally and I can see the columns being filled in the table:

Apr 4 2025, 3:51 PM · DPE-Data-Quality, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Apr 4 2025, 3:43 PM · DPE-Data-Quality, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Apr 4 2025, 3:35 PM · DPE-Data-Quality, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Apr 4 2025, 3:10 PM · DPE-Data-Quality, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin added a comment to T384962: Implement alerting for wmf_content.mediawiki_content_history_v1.

Altered table:

ALTER TABLE wmf_data_ops.data_quality_alerts ADD COLUMNS (
    dataset_date BIGINT COMMENT 'AWS Deequ resultKey: key insertion time.',
    tags MAP<STRING,STRING> COMMENT 'AWS Deequ resultKey: key tags.'
);
Apr 4 2025, 2:43 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Patch-For-Review, DPE-Mediawiki-Content

Apr 3 2025

tchin added a comment to T386862: Enable Spark data lineage for all Airflow instances.

I think it actually is broken on main as well and it's just been silently failing. I opened a patch to add the port back in

Apr 3 2025, 12:52 PM · Data-Engineering, Patch-For-Review

Apr 2 2025

tchin updated subscribers of T386862: Enable Spark data lineage for all Airflow instances.

@brouberol would you happen to have some insight into this issue?

Apr 2 2025, 4:18 AM · Data-Engineering, Patch-For-Review

Apr 1 2025

tchin added a comment to T386862: Enable Spark data lineage for all Airflow instances.

Actually nevermind, the error we had last time was

Caused by: datahub.shaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers
        at datahub.shaded.org.apache.kafka.clients.ClientUtils.parseAndValidateAddresses(ClientUtils.java:84)
        at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:408)
        ... 20 more
Apr 1 2025, 8:54 PM · Data-Engineering, Patch-For-Review
tchin added a comment to T386862: Enable Spark data lineage for all Airflow instances.

Currently encountering this error, with the search instance, think it's the same issue we encountered when migrating the analytics instance to k8s:

Apr 1 2025, 4:46 PM · Data-Engineering, Patch-For-Review

Mar 27 2025

tchin added a comment to T390140: Eventstreams 'assignments' logstash field type.

Or a real quick way is to not pass in the new logger from eventstreams to KafkaSSE, that way it creates a bunyan logger and the logs won't appear as an error anymore in logstash

Mar 27 2025, 1:51 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform, SRE Observability, EventStreams
tchin claimed T390140: Eventstreams 'assignments' logstash field type.
Mar 27 2025, 1:47 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform, SRE Observability, EventStreams