Page MenuHomePhabricator

tchin (Thomas)
Senior Software Engineer

Today

  • No visible events.

Tomorrow

  • No visible events.

Wednesday

  • No visible events.

User Details

User Since
Jun 21 2021, 2:34 PM (260 w, 3 h)
Availability
Available
LDAP User
TChin
MediaWiki User
TChin (WMF) [ Global Accounts ]

Recent Activity

Fri, Jun 12

tchin added a comment to T411771: Migrate PageViewInfo calls away from rest-gateway.

(Added the page-analytics port to Wikitech)

Fri, Jun 12, 3:11 PM · MW-1.47-notes (1.47.0-wmf.7; 2026-06-16), Patch-For-Review, Data-Engineering (Q4 FS25/26 April 1st - June 30st), ServiceOps-SharedInfra, ServiceOps new, PageViewInfo

Mon, Jun 8

tchin added a comment to T425029: mediawiki.page_change.v1 - add revision.editor.first_edit_dt field.

Since this requires bumping the page change schema, patches are blocked from merging until T421237 is resolved

Mon, Jun 8, 5:27 PM · Patch-For-Review, DPE-MediaWiki-Incremental-History, Event-Platform, Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Wed, Jun 3

tchin added a comment to T424706: DE3.1 - Logged-out Wikipedia 21-day retention on mobile web.

It's okay, I changed a few small things in the sql so I can just do it manually on my end

Wed, Jun 3, 11:42 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Metrics-Sprint-2026-2027

Tue, Jun 2

tchin added a comment to T424706: DE3.1 - Logged-out Wikipedia 21-day retention on mobile web.

The mediawiki_database field is missing for page_visit events, which means we cannot calculate retention rates at the wiki level.
Suggestion: Snapshot only the global retention baseline and skip the per-wiki retention baseline for Rounds 8–10.

Tue, Jun 2, 2:08 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Metrics-Sprint-2026-2027
tchin updated subscribers of T424706: DE3.1 - Logged-out Wikipedia 21-day retention on mobile web.

@amastilovic is there an easy way to selectively run modified dbt jobs in production for backfilling like what we might need above?

Tue, Jun 2, 1:42 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Metrics-Sprint-2026-2027

Mon, Jun 1

tchin added a comment to T411771: Migrate PageViewInfo calls away from rest-gateway.

I can take a look at this; what are the ports for the pageviews and unique-devices services?

Mon, Jun 1, 3:18 PM · MW-1.47-notes (1.47.0-wmf.7; 2026-06-16), Patch-For-Review, Data-Engineering (Q4 FS25/26 April 1st - June 30st), ServiceOps-SharedInfra, ServiceOps new, PageViewInfo
tchin claimed T411771: Migrate PageViewInfo calls away from rest-gateway.
Mon, Jun 1, 3:16 PM · MW-1.47-notes (1.47.0-wmf.7; 2026-06-16), Patch-For-Review, Data-Engineering (Q4 FS25/26 April 1st - June 30st), ServiceOps-SharedInfra, ServiceOps new, PageViewInfo
tchin claimed T424706: DE3.1 - Logged-out Wikipedia 21-day retention on mobile web.
Mon, Jun 1, 3:16 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Metrics-Sprint-2026-2027

Mon, May 18

tchin added a comment to T426369: analytics-refinery-source repo fails post-merge builds, releases succeed.

Just re-ran analytics-refinery-maven-release and it succeeded, I guess my specific problem was transitory

Mon, May 18, 1:22 PM · DPE-MediaWiki-Incremental-History

May 15 2026

tchin added a comment to T426369: analytics-refinery-source repo fails post-merge builds, releases succeed.

Just tried running analytics-refinery-maven-release, it failed with:

14:12:36 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-release-plugin:3.0.1:prepare (default-cli) on project refinery: Unable to commit files
14:12:36 [ERROR] Provider message:
14:12:36 [ERROR] The git-push command failed.
14:12:36 [ERROR] Command output:
14:12:36 [ERROR] To https://gerrit.wikimedia.org/r/analytics/refinery/source
14:12:36 [ERROR]  ! [rejected]        master -> master (fetch first)
14:12:36 [ERROR] error: failed to push some refs to 'https://gerrit.wikimedia.org/r/analytics/refinery/source'
14:12:36 [ERROR] hint: Updates were rejected because the remote contains work that you do
14:12:36 [ERROR] hint: not have locally. This is usually caused by another repository pushing
14:12:36 [ERROR] hint: to the same ref. You may want to first integrate the remote changes
14:12:36 [ERROR] hint: (e.g., 'git pull ...') before pushing again.
14:12:36 [ERROR] hint: See the 'Note about fast-forwards' in 'git push --help' for details.
14:12:36 [ERROR] -> [Help 1]
May 15 2026, 6:45 PM · DPE-MediaWiki-Incremental-History

May 11 2026

tchin added a comment to T419569: Attribution Research: Instrument Donation Attempts.

Took a look at this, here's what I found:

  • Main menu sidebar: Has a class n-sitesupport on the <li>. This is default to wikis with the WikimediaMessages extension.
  • Top links when logged out - This is specific to the Vector 2022 skin. Has the class pt-sitesupport-2 on the <li>, but when overflowed and put into a hamburger menu it has class pt-sitesupport. If it exists, it removes the n-sitesupport button.
  • Mobile Web "Hamburger" Menu: This is from the Minerva Neue skin. The donate button is generated by the skin, and bypasses the normal sidebar behavior so doesn't have any sitesupport id. But it does have a data-event-name="menu.donate"already implemented to track clicks.
  • Contact Us Portal: I have no idea where this comes from. Is it just a regular wiki page that only admins can edit?
May 11 2026, 3:22 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), MW-1.46-notes (1.46.0-wmf.20; 2026-03-17), Epic
tchin updated Other Assignee for T419569: Attribution Research: Instrument Donation Attempts, added: tchin.
May 11 2026, 2:58 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), MW-1.46-notes (1.46.0-wmf.20; 2026-03-17), Epic

May 8 2026

tchin added a comment to T409462: mediawiki.page_change.v1 event - add a page namespace_is_content field.

We could do this by adding page_type enum field, or a boolean is_content_namespace field

I don't think an enum would work, a content namespace is more like an abstract concept that exists outside of the normal MW-defined namespaces and could technically be any namespace, so I think using a boolean is probably simplest. An enum would only be useful as an array for CONTENT and then the actual namespace prefix itself

May 8 2026, 3:13 PM · MW-1.47-notes (1.47.0-wmf.3; 2026-05-19), Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform

May 7 2026

tchin added a comment to T420296: Upgrade eventgate-* to node24.

A simple version bump to node 24 in the blubber file for eventgate-wikimedia failed due to some C++ compiling error from node-rdkafka. Probably we'd have to update that in node-rdkafka-factory and eventgate first before being able to update eventgate-wikimedia.

May 7 2026, 2:09 PM · Data-Engineering, Event-Platform

Apr 27 2026

tchin added a comment to T420621: Logged in reader retention logging.

Forgot to add an update from the Dublin offsite, there's now the client_platform_family column so now the dataset can be split desktop/mobile. Because of the inability to backfill from the beginning of time this means the first week or so of the dataset doesn't have it but afterwards it should be there

Apr 27 2026, 3:12 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), MW-1.46-notes (1.46.0-wmf.22; 2026-03-31), Reader Experience Team, Test Kitchen

Apr 7 2026

tchin added a comment to T420621: Logged in reader retention logging.

Realized that it would be useful to have the domain the pageview happened, so recreated the table and backfilled.

Apr 7 2026, 8:40 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), MW-1.46-notes (1.46.0-wmf.22; 2026-03-31), Reader Experience Team, Test Kitchen

Apr 3 2026

tchin added a comment to T420621: Logged in reader retention logging.

Data is now available in the data lake under wmf_readership.active_reader_baseline.

spark-sql (default)> select count(1) from wmf_readership.active_reader_baseline;
count(1)
894215
Time taken: 13.804 seconds, Fetched 1 row(s)
Apr 3 2026, 6:31 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), MW-1.46-notes (1.46.0-wmf.22; 2026-03-31), Reader Experience Team, Test Kitchen

Mar 24 2026

tchin added a comment to T418804: table_maintenance_iceberg_monthly permission issue fails task due to permission on Ivy cache artifact.

I got the same error running an airflow devenv while developing a Spark 3.3.2 DAG.

Mar 24 2026, 1:55 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Mar 23 2026

tchin moved T419882: Consider updating our heuristics for media type classification in AQS / wikistats from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 23 2026, 3:28 PM · Patch-For-Review, Data-Engineering (Q4 FS25/26 April 1st - June 30st), AQS2.0
tchin added a comment to T420621: Logged in reader retention logging.

we should cover as many wikis as feasible

That itself could be its own task, but I'm assuming that 100% sampling on every wiki is *probably* fine since this instrument will only apply to logged in users.

Mar 23 2026, 3:22 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), MW-1.46-notes (1.46.0-wmf.22; 2026-03-31), Reader Experience Team, Test Kitchen
tchin moved T420621: Logged in reader retention logging from Urgent to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 23 2026, 3:18 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), MW-1.46-notes (1.46.0-wmf.22; 2026-03-31), Reader Experience Team, Test Kitchen
tchin moved T420787: Visualizing inconsistencies and reconciles via Superset from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 23 2026, 3:17 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Mar 19 2026

tchin updated the task description for T420621: Logged in reader retention logging.
Mar 19 2026, 8:51 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), MW-1.46-notes (1.46.0-wmf.22; 2026-03-31), Reader Experience Team, Test Kitchen

Mar 16 2026

tchin added a parent task for T420257: Upgrade eventstreams and eventstreams-internal to node24 (or node22): Unknown Object (Task).
Mar 16 2026, 8:15 PM · Patch-For-Review, Data-Engineering, Event-Platform

Mar 9 2026

tchin updated subscribers of T416756: Release OpenTelemetry integration for service-utils.

cc: @Ottomata as a very interesting read

Mar 9 2026, 3:41 PM · service-utils, ServiceOps-SharedInfra, ServiceOps new

Mar 6 2026

tchin added a comment to T409106: X-Experiment-Enrollments EventGate handling reinforcement for MalformedHeaderError cases.

Eventgate v1.28.0 is now deployed

Mar 6 2026, 3:39 PM · Patch-For-Review, Test Kitchen (Experiment Platform Sprint 20), Essential-Work, Data-Engineering-Radar, Data-Engineering, Event-Platform

Feb 6 2026

tchin added a parent task for T416719: OpsWeek: Bump memory of refine job for product_metrics.web_base_with_ip to avoid recent OOMs: Unknown Object (Task).
Feb 6 2026, 6:44 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Jan 29 2026

tchin added a comment to T415549: Instance-level EventGate configuration to enable/disable functionality.

I chatted with @Ottomata about this a little bit, here's what I'm going to attempt:

Jan 29 2026, 9:30 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform

Jan 26 2026

tchin updated the task description for T415549: Instance-level EventGate configuration to enable/disable functionality.
Jan 26 2026, 9:45 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform
tchin claimed T415549: Instance-level EventGate configuration to enable/disable functionality.
Jan 26 2026, 3:00 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform
tchin updated the task description for T415549: Instance-level EventGate configuration to enable/disable functionality.
Jan 26 2026, 2:39 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform
tchin added a subtask for T415549: Instance-level EventGate configuration to enable/disable functionality: Unknown Object (Task).
Jan 26 2026, 2:37 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform
tchin created T415549: Instance-level EventGate configuration to enable/disable functionality.
Jan 26 2026, 2:37 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform

Jan 6 2026

tchin added a member for Trusted-Contributors: BPiovesan-WMF.
Jan 6 2026, 4:14 PM

Dec 8 2025

tchin moved T411803: Fix reconcile bug where user_id is not being populated correctly. from Urgent to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 8 2025, 4:58 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
tchin moved T412035: Upgrade Airflow HdfsEmailOperator to take both a String or a List(String) email addresses. from Next Up to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 8 2025, 4:49 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
tchin moved T411876: Add new data-steward email to Human-Bot Alert email. from In progress to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 8 2025, 4:49 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
tchin moved T411378: Human vs Bot Alerting Email Upgrade from In progress to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 8 2025, 4:48 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Dec 5 2025

tchin added a project to T410266: Explore how to migrate PyFlink to Java/Scala: Spike.
Dec 5 2025, 3:09 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Spike, Event-Platform

Dec 3 2025

tchin updated tchin.
Dec 3 2025, 10:33 PM
tchin added a comment to T360794: Event stream with latest revision HTML & parent revision HTML diff.

@fkaelin How urgent is the need for this stream? We're considering moving off of PyFlink and this would be a good opportunity to spike on a Java pipeline instead of a quick implementation now and then the complexities of dealing with any migration pains later

Dec 3 2025, 2:36 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Research, Event-Platform

Dec 2 2025

tchin created T411457: NDA Access for tchin.
Dec 2 2025, 3:47 AM · Essential-Work, Release-Engineering-Team (Doing 😎), WMF-NDA-Requests

Nov 14 2025

tchin added a comment to T409469: Enable ChangeProp to consume mediawiki.page_content_change.v1.

Would we also need to explicitly create the topics in main? Is auto topic creation enabled there?

Nov 14 2025, 3:24 PM · Data-Engineering, serviceops-deprecated, Machine-Learning-Team

Nov 12 2025

tchin added a comment to T409469: Enable ChangeProp to consume mediawiki.page_content_change.v1.

what work is required to produce mediawiki.page_content_change.v1 to Kafka main? I'm expecting just some helmfile changes, for example, in the mw-page-content-change-enrich/values-codfw.yaml, and not requiring changes in mediawiki event enrichment code, right?

Nov 12 2025, 5:23 PM · Data-Engineering, serviceops-deprecated, Machine-Learning-Team

Nov 11 2025

tchin updated the task description for T404340: [EPIC] Upgrade flink jobs to java 17.
Nov 11 2025, 11:08 PM · Data-Engineering, Wikidata, Wikidata-Query-Service, Essential-Work, Discovery-Search, Epic

Nov 3 2025

tchin claimed T408918: Upgrade mediawiki-event-enrichment jobs to >= Flink 1.20.3 and Java 17.
Nov 3 2025, 3:56 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Event-Platform, Essential-Work

Oct 31 2025

tchin created T408918: Upgrade mediawiki-event-enrichment jobs to >= Flink 1.20.3 and Java 17.
Oct 31 2025, 1:15 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Event-Platform, Essential-Work

Oct 28 2025

tchin added a comment to T405952: EventgateProduceRateStop / EventGateProduceRateAnomaly alert should be active datacenter aware.

Talked with Andrew about this more. The main problem is that MediaWiki is active/passive, but eventgate is basically active/active. The external eventgate instances will expect traffic on both DCs, but the internal ones would see activity on the actice DC (but may still get events on the passive DC).

Oct 28 2025, 5:23 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Observability-Alerting, Event-Platform

Oct 27 2025

tchin added a comment to T405952: EventgateProduceRateStop / EventGateProduceRateAnomaly alert should be active datacenter aware.

Wow, this was harder than I thought. So what we to need to happen is to detect the active DC mediawiki_wmf_master_datacenter which is indicated by the the datacenter label and only alert on the active datacenter. All metrics have a site label which is (from what I can tell) the datacenter the metric is exported from.

Oct 27 2025, 5:52 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Observability-Alerting, Event-Platform

Oct 9 2025

tchin updated the task description for T406872: Fix mediawiki event enrichment to work with newest version of Blubber.
Oct 9 2025, 2:15 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Event-Platform
tchin created T406872: Fix mediawiki event enrichment to work with newest version of Blubber.
Oct 9 2025, 2:03 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Event-Platform

Oct 8 2025

tchin created T406747: Merge eventgate and eventgate-wikimedia repos.
Oct 8 2025, 3:16 PM · Data-Engineering, Event-Platform
tchin added a comment to T397330: mediawiki.content_history: flink applications experiencing frequent restarts due to JobManager OOMs.

Looks like it's fixed! In the dashboard the Job Manager stopped OOM-ing

Oct 8 2025, 2:36 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Event-Platform

Sep 29 2025

tchin added a subtask for T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils: T343342: eventgate logs field explosion.
Sep 29 2025, 4:56 PM · Data-Engineering, MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Patch-For-Review, Event-Platform, service-utils
tchin added a parent task for T343342: eventgate logs field explosion: T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils.
Sep 29 2025, 4:56 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Patch-For-Review, Event-Platform, Observability-Logging

Sep 26 2025

tchin added a comment to T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils.

Hmmm ok everything is deployed now and it works fine, but I can't tell if the p99 performance got worse, or the express metrics are broken somehow (either broken beforehand and now fixed or vice-versa). What makes me suspicious is that when you look at the latency quantiles by HTTP method from before the deployment, every deployment and every instance had a GET and POST p99 of almost exactly 9.90ms. After the deployment, it's actually correlated with the amount of events it's received. I'm assuming this means that something was actually fixed somewhere, but because of this, alerts are being fired on the passive DC because of the bursty nature of events there and the latency increase that's correlated with it.

Sep 26 2025, 7:15 PM · Data-Engineering, MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Patch-For-Review, Event-Platform, service-utils
tchin added a comment to T376026: Update event-producing tools to overwrite `meta.dt`.

In the logs I spotted another offender

{"@timestamp":"2025-09-26T16:47:14.571Z","ecs.version":"8.10.0","log.level":"info","message":"Overriding meta.dt in event b63f71b4-d6ff-4a1f-8544-e01d11df60c3 of schema at /sparql/query/1.3.0 destined to stream wdqs-external.sparql-query from 2025-09-26T16:47:14.499Z to 2025-09-26T16:47:14.571Z.","service":{"name":"eventgate-analytics"}}
Sep 26 2025, 6:27 PM · Data-Engineering, Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Event-Platform
tchin added a comment to T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils.

Deployed to eventgate-analytics-external and it looks stable. Proceeding to deploy to the remaining instances.

Sep 26 2025, 4:11 PM · Data-Engineering, MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Patch-For-Review, Event-Platform, service-utils
tchin added a comment to T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils.

Fully deployed to eventgate-logging-external. Logs seem fine. No log spam, the duplicate dropped fields are fixed, and it's fully ingested into logstash in ECS format now. Metrics also look good. It's a bit fuzzy because this happened during the DC switchover, but looking at the dashboard it seems like all metrics still match except the only one lost is the one I stated before that's under Memory usage (sum over all pods). That doesn't concern me that much since the service is also now being picked up by the new service-utils metrics dashboard so we still have memory reporting.

Sep 26 2025, 3:07 PM · Data-Engineering, MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Patch-For-Review, Event-Platform, service-utils

Sep 23 2025

tchin added a comment to T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils.

Also these three metrics stopped being reported, which I don't really know why since from what I can tell it's a Kubernetes metric
nodejs_process_heap_used_bytes
nodejs_process_heap_total_bytes
nodejs_process_heap_rss_bytes

Sep 23 2025, 4:29 PM · Data-Engineering, MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Patch-For-Review, Event-Platform, service-utils
tchin added a comment to T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils.

Deployed to eventgate-logging-external for codfw, it works in the sense that it didn't blow up, but will have to fix some stuff before I deploy the rest of it. Logs export fine in ECS, but for some reason a lot of fields are being dropped. Metrics show up in the dashboards but some need renaming, and I also forgot to add metrics for the express routes.

Sep 23 2025, 4:05 PM · Data-Engineering, MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Patch-For-Review, Event-Platform, service-utils

Sep 16 2025

tchin added a comment to T397330: mediawiki.content_history: flink applications experiencing frequent restarts due to JobManager OOMs.

I'm going to try to upgrade mw-content-history-reconcile-enrich-next to Flink 1.20 to see if it magically fixes the issue, but I won't do any work migrating from deprecated config and stuff in this ticket though. If the issue doesn't get fix, at least with the update it includes a feature that allows us to profile the JobManager using the Flink Web UI, which could be useful.

Sep 16 2025, 8:35 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Event-Platform

Aug 29 2025

tchin added a comment to T401010: Sunset in-pipeline metrics computation for the MW Content Pipelines.

oo very nice!! I wonder how it'd compare to a pure java version of the deequ code. Maybe if we switch to SQL we can take the opportunity to revamp the metrics table schema?

Aug 29 2025, 3:07 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review

Aug 28 2025

tchin updated the task description for T403171: Add user-agent to http calls from eventgate-wikimedia.
Aug 28 2025, 1:02 PM · MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform
tchin created T403171: Add user-agent to http calls from eventgate-wikimedia.
Aug 28 2025, 12:58 PM · MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform
tchin created T403169: Migrate and re-deploy eventgate-wikimedia using new service-utils.
Aug 28 2025, 12:52 PM · Data-Engineering, MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Patch-For-Review, Event-Platform, service-utils

Aug 27 2025

tchin added a project to T361768: Migrate and re-deploy eventgate using new service-utils: Event-Platform.
Aug 27 2025, 8:09 PM · Event-Platform, Data-Engineering (Q1 FY25/26 July 1st - September 30th), service-utils

Aug 26 2025

tchin claimed T402801: CI in schemas-event-secondary fails because the tests do not follow WMF's Robot policy.
Aug 26 2025, 1:09 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Patch-For-Review, Event-Platform

Aug 18 2025

tchin added a comment to T396564: EventStreams: duplicate events from double compute (wdqs/rdf) streams.

@dcausse fyi I just deployed eventstreams with your patch

Aug 18 2025, 4:47 PM · Discovery-Search (2025.07.25 - 2025.08.15), Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform, EventStreams

Aug 5 2025

tchin added a comment to T390140: Eventstreams 'assignments' logstash field type.

assignments is now stringified in KafkaSSE, but in the logs I see that assignments is in normalized.dropped.no_such_field. Is there something I'm missing? @colewhite

When this task was filed, eventstreams used the legacy logstash format. Now that eventstreams is writing ECS, the field is reaped because the field does not exist in the schema.

Within ECS, we can either rely on event.original, amend the schema, or use another field.

Aug 5 2025, 3:19 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform, SRE Observability, EventStreams

Jul 30 2025

tchin changed the status of T366487: Event Platform schemas should not support type changes to structs as array element or map value types, a subtask of T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation, from Open to In Progress.
Jul 30 2025, 3:55 AM · Data-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review
tchin changed the status of T366487: Event Platform schemas should not support type changes to structs as array element or map value types, a subtask of T259924: HiveExtensions.convertToSchema does not properly convert arrays of structs, from Open to In Progress.
Jul 30 2025, 3:55 AM · Data-Engineering-Icebox, Data-Engineering, Patch-Needs-Improvement
tchin changed the status of T366487: Event Platform schemas should not support type changes to structs as array element or map value types from Open to In Progress.
Jul 30 2025, 3:55 AM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform

Jul 28 2025

tchin added a comment to T390140: Eventstreams 'assignments' logstash field type.

assignments is now stringified in KafkaSSE, but in the logs I see that assignments is in normalized.dropped.no_such_field. Is there something I'm missing? @colewhite

Jul 28 2025, 3:30 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform, SRE Observability, EventStreams

Jul 25 2025

tchin closed T388439: Add metrics for monthly reconciles, a subtask of T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2), as Resolved.
Jul 25 2025, 6:34 PM · Data-Engineering-Roadmap, DPE-Mediawiki-Content, Epic
tchin closed T388439: Add metrics for monthly reconciles as Resolved.
Jul 25 2025, 6:34 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin added a comment to T388439: Add metrics for monthly reconciles.

Yeah it can be closed out

Jul 25 2025, 6:33 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content

Jul 21 2025

tchin added a comment to T398922: EventGate: Add Prometheus metric for hoisting errors.

If you want to drive it, be my guest and I can help you out if needed. Or we can pair program together. Whichever you prefer

Jul 21 2025, 2:22 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform, Test Kitchen

Jul 11 2025

tchin added a comment to T361768: Migrate and re-deploy eventgate using new service-utils.

This includes upgrade to Nodejs 20.

Hi! Has this happened? Looking at the images currently deployed per deployments-charts repo


$ podman run --rm -it --entrypoint /bin/sh docker-registry.wikimedia.org/repos/data-engineering/eventgate-wikimedia:v1.11.0 -c "nodejs -v"
v20.5.1

and

podman run --rm -it --entrypoint /bin/sh docker-registry.wikimedia.org/repos/data-engineering/eventgate-wikimedia:v1.14.0 -c "nodejs -v"
v20.5.1

says yes, but I guess it doesn't hurt to double check that we are all on the same page.

Jul 11 2025, 3:30 PM · Event-Platform, Data-Engineering (Q1 FY25/26 July 1st - September 30th), service-utils

Jul 7 2025

tchin added a comment to T398325: Figure out how Eventstreams connected client metrics went negative.

I should also mention that metrics eventually recovered, probably due to T383977 having unintentionally doing rolling restarts of all the pods, resetting metrics. Taking a look at that ticket again since I'm already in the code.

Jul 7 2025, 2:41 PM · Data-Engineering (Q4 2025 April 1st - June 30th), EventStreams
tchin added a comment to T398325: Figure out how Eventstreams connected client metrics went negative.

Was digging through the logs and found a bunch of these InvalidAssignmentError requests which is caused by setting a malformed last-event-id header. This header when set takes precedence over the url stream parameter (which can lead to another bug if the stream names are different than the topics in the header), and passes all of eventstream's checks until it gets handed to KafkaSSE where it blows up. This somehow fires both the close and finish event.

Jul 7 2025, 2:38 PM · Data-Engineering (Q4 2025 April 1st - June 30th), EventStreams

Jul 2 2025

tchin claimed T398325: Figure out how Eventstreams connected client metrics went negative.
Jul 2 2025, 2:38 PM · Data-Engineering (Q4 2025 April 1st - June 30th), EventStreams

Jul 1 2025

tchin created T398325: Figure out how Eventstreams connected client metrics went negative.
Jul 1 2025, 2:15 PM · Data-Engineering (Q4 2025 April 1st - June 30th), EventStreams

Jun 30 2025

tchin added a comment to T388439: Add metrics for monthly reconciles.

Adjusted airflow variables to use the new conda artifact. Should be good to go now. Now the only question is how long the metrics computation will take...

Jun 30 2025, 5:04 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin moved T390140: Eventstreams 'assignments' logstash field type from Next Up to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
Jun 30 2025, 4:44 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform, SRE Observability, EventStreams

Jun 26 2025

tchin added a comment to T388439: Add metrics for monthly reconciles.

Just noticed that in the metrics computation script, it deletes any duplicated metrics WHERE partition_ts = CAST('{args.min_timestamp}' AS TIMESTAMP) in case of reruns. However, for all-of-wiki-time, min_timestamp is always 2000-01-01T00:00:00. We need the partition_ts column to be the max_timestamp for this case.

Jun 26 2025, 1:28 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin added a comment to T388439: Add metrics for monthly reconciles.

The metrics table has no unique index I can match on to update rows so I had to match on almost every column but it worked I guess

Jun 26 2025, 1:50 AM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content

Jun 24 2025

tchin added a comment to T388439: Add metrics for monthly reconciles.

Since implementing the metrics segregation, we should now update the legacy metrics with the computation class before implementing the monthly metrics

spark-sql (default)> SELECT COUNT(*) AS count
                   > FROM wmf_data_ops.data_quality_metrics
                   > WHERE tags['project'] = 'mediawiki_content_history'
                   >   AND (tags['computation_class'] IS NULL OR tags['computation_class'] = '')
                   > ;
count
577920
Jun 24 2025, 2:12 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content

May 16 2025

tchin added a comment to T386862: Enable Spark data lineage for all Airflow instances.

I think the solution is to make the code aware of both endpoints, and then pick the correct one inside the SparkSubmitOperator based off of the launcher param before it sets the rest of the config. Right now the endpoint is set by a jinja template, but by the time airflow templates the string it's probably too late?

May 16 2025, 5:44 PM · Data-Engineering, Patch-For-Review

May 11 2025

tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
May 11 2025, 11:04 PM · DPE-Data-Quality, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin added a comment to T384962: Implement alerting for wmf_content.mediawiki_content_history_v1.

Deployed and updated airflow variables to use artifact v0.6.0

May 11 2025, 11:04 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Patch-For-Review, DPE-Mediawiki-Content

May 5 2025

tchin moved T388439: Add metrics for monthly reconciles from In progress to In Review on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:28 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin moved T389162: [Data Quality] Add ability to add tags to alerts from In progress to In Review on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:28 PM · DPE-Data-Quality, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin moved T384962: Implement alerting for wmf_content.mediawiki_content_history_v1 from In progress to In Review on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:27 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Patch-For-Review, DPE-Mediawiki-Content
tchin moved T391959: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: EventGate & Varnish: update to receive events from beacon event v2 from In progress to In Review on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:27 PM · Test Kitchen (Experiment Platform Sprint 6), Data-Engineering (Q4 2025 April 1st - June 30th)
tchin moved T389903: Analytics Cluster Dataset Usage Discovery Task from Next Up to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:14 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th)
tchin moved T391708: Duplicate revisions and excess reverts in 2025-03 MediaWiki History snapshot from In Review to Done on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:13 PM · Analytics-Data-Problem, Data-Engineering (Q4 2025 April 1st - June 30th)
tchin moved T392244: Facilitate automatic artifact cache warming for airflow-dags artifacts from Next Up to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:13 PM · Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th)
tchin moved T383931: Unify Airflow's datasets.yaml config files across instances from Next Up to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
May 5 2025, 3:13 PM · Data-Engineering (Q4 2025 April 1st - June 30th)

May 2 2025

tchin added a comment to T393130: handle large log response.

On the service-utils side, that property should've been filled in by express:

May 2 2025, 7:37 PM · Essential-Work, MW-1.45-notes (1.45.0-wmf.1; 2025-05-13), Patch-For-Review, Abstract Wikipedia team (25Q4 (Apr–Jun)), function-evaluator