Page MenuHomePhabricator

tchin (Thomas)
Software Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Jun 21 2021, 2:34 PM (199 w, 6 d)
Availability
Available
LDAP User
TChin
MediaWiki User
TChin (WMF) [ Global Accounts ]

Recent Activity

Fri, Apr 18

tchin moved T391959: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: EventGate & Varnish: update to receive events from beacon event v2 from Next Up to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
Fri, Apr 18, 7:08 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Experimentation Lab
tchin added a project to T391959: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: EventGate & Varnish: update to receive events from beacon event v2: Data-Engineering (Q4 2025 April 1st - June 30th).
Fri, Apr 18, 7:08 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Experimentation Lab

Tue, Apr 15

tchin added a comment to T386862: Enable Spark data lineage for all Airflow instances.

Throwing out a guess, it seems like because where the lineage runs depends on the driver, it needs to somehow be aware of whether or not it's running on k8s and choose the correct kafka bootstrap url. I wonder if there's an easy way to figure this out? This will probably require some refactoring.

Tue, Apr 15, 4:43 PM · Data-Engineering (Q4 2025 April 1st - June 30th)

Mon, Apr 14

tchin moved T390247: Migrate Gobblin job repository to GitLab from Done to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
Mon, Apr 14, 3:26 PM · Data-Engineering (Q4 2025 April 1st - June 30th)
tchin moved T390247: Migrate Gobblin job repository to GitLab from Next Up to Done on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
Mon, Apr 14, 3:26 PM · Data-Engineering (Q4 2025 April 1st - June 30th)
tchin moved T388439: Add metrics for monthly reconciles from Next Up to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
Mon, Apr 14, 3:14 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Patch-For-Review, DPE-Mediawiki-Content
tchin claimed T388439: Add metrics for monthly reconciles.
Mon, Apr 14, 3:12 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Patch-For-Review, DPE-Mediawiki-Content
tchin moved T388721: Support for 4.3.11 - webrequest based scraping detection from Next Up to In progress on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
Mon, Apr 14, 3:11 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Essential-Work
tchin moved T370470: [CIM] Skewed ranking with the top Editors monthly API from In Review to Ready to Deploy on the Data-Engineering (Q4 2025 April 1st - June 30th) board.
Mon, Apr 14, 3:09 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Patch-For-Review, Commons-Impact-Metrics

Tue, Apr 8

tchin added a comment to T387033: Figure root cause of silent failures when computing metrics for mediawiki_content_history_v1.

Reopening as we had another instance of skein OOM:

xcollazo@an-launcher1002:~$ sudo -u analytics yarn logs -appOwner analytics -applicationId application_1741864027385_383042 | grep "Application driver failed" -B 2 -A 2
...
25/03/31 13:44:16 INFO skein.ApplicationMaster: Registering application with resource manager
25/03/31 13:44:16 INFO skein.ApplicationMaster: Starting application driver
25/03/31 18:00:10 INFO skein.ApplicationMaster: Shutting down: Application driver failed with exit code 143. This is often due to the application master memory limit being exceeded. See the diagnostics for more information.
25/03/31 18:00:10 INFO skein.ApplicationMaster: Unregistering application with status FAILED
25/03/31 18:00:10 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.

Verified that driver has 16GB as per Airflow task details.

Bumping to 24GB manually to rerun, although 24GB for a driver for metrics seems silly. I wonder what dequee is doing in there...

Tue, Apr 8, 8:31 PM · Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin added a comment to T386862: Enable Spark data lineage for all Airflow instances.

Ok so that wasn't it, now we get this error:

25/04/08 19:42:35 ERROR AsyncEventQueue: Listener DatahubSparkListener threw an exception
datahub.shaded.org.apache.kafka.common.KafkaException: Failed to construct kafka producer
	at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:430)
	at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:298)
	at datahub.client.kafka.KafkaEmitter.<init>(KafkaEmitter.java:55)
	at datahub.spark.DatahubEventEmitter.getEmitter(DatahubEventEmitter.java:85)
	at datahub.spark.DatahubEventEmitter.emitMcps(DatahubEventEmitter.java:401)
	at datahub.spark.DatahubEventEmitter.emitCoalesced(DatahubEventEmitter.java:190)
	at datahub.spark.DatahubSparkListener.onApplicationEnd(DatahubSparkListener.java:279)
	at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:57)
	at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
	at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
	at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
	at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
	at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
	at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
	at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
	at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
	at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
	at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1381)
	at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
Caused by: datahub.shaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers
	at datahub.shaded.org.apache.kafka.clients.ClientUtils.parseAndValidateAddresses(ClientUtils.java:84)
	at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:408)
	... 20 more
Tue, Apr 8, 8:18 PM · Data-Engineering (Q4 2025 April 1st - June 30th)

Mon, Apr 7

tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Mon, Apr 7, 1:46 PM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content

Fri, Apr 4

xcollazo awarded T389162: [Data Quality] Add ability to add tags to alerts a Party Time token.
Fri, Apr 4, 3:59 PM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin added a comment to T389162: [Data Quality] Add ability to add tags to alerts.

Seems to be working, webrequest_analyzer dag runs normally and I can see the columns being filled in the table:

Fri, Apr 4, 3:51 PM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Fri, Apr 4, 3:43 PM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Fri, Apr 4, 3:35 PM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Fri, Apr 4, 3:10 PM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin added a comment to T384962: Implement alerting for wmf_content.mediawiki_content_history_v1.

Altered table:

ALTER TABLE wmf_data_ops.data_quality_alerts ADD COLUMNS (
    dataset_date BIGINT COMMENT 'AWS Deequ resultKey: key insertion time.',
    tags MAP<STRING,STRING> COMMENT 'AWS Deequ resultKey: key tags.'
);
Fri, Apr 4, 2:43 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Patch-For-Review, DPE-Mediawiki-Content

Thu, Apr 3

tchin added a comment to T386862: Enable Spark data lineage for all Airflow instances.

I think it actually is broken on main as well and it's just been silently failing. I opened a patch to add the port back in

Thu, Apr 3, 12:52 PM · Data-Engineering (Q4 2025 April 1st - June 30th)

Wed, Apr 2

tchin updated subscribers of T386862: Enable Spark data lineage for all Airflow instances.

@brouberol would you happen to have some insight into this issue?

Wed, Apr 2, 4:18 AM · Data-Engineering (Q4 2025 April 1st - June 30th)

Tue, Apr 1

tchin added a comment to T386862: Enable Spark data lineage for all Airflow instances.

Actually nevermind, the error we had last time was

Caused by: datahub.shaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers
        at datahub.shaded.org.apache.kafka.clients.ClientUtils.parseAndValidateAddresses(ClientUtils.java:84)
        at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:408)
        ... 20 more
Tue, Apr 1, 8:54 PM · Data-Engineering (Q4 2025 April 1st - June 30th)
tchin added a comment to T386862: Enable Spark data lineage for all Airflow instances.

Currently encountering this error, with the search instance, think it's the same issue we encountered when migrating the analytics instance to k8s:

Tue, Apr 1, 4:46 PM · Data-Engineering (Q4 2025 April 1st - June 30th)

Thu, Mar 27

tchin added a comment to T390140: Eventstreams 'assignments' logstash field type.

Or a real quick way is to not pass in the new logger from eventstreams to KafkaSSE, that way it creates a bunyan logger and the logs won't appear as an error anymore in logstash

Thu, Mar 27, 1:51 PM · Data-Engineering (Q4 2025 April 1st - June 30th), SRE Observability, EventStreams
tchin claimed T390140: Eventstreams 'assignments' logstash field type.
Thu, Mar 27, 1:47 PM · Data-Engineering (Q4 2025 April 1st - June 30th), SRE Observability, EventStreams
tchin added a comment to T390140: Eventstreams 'assignments' logstash field type.

Ah, the issue is in KafkaSEE, which still assumes bunyan. I'll have to standardize the logging inside that lib

Thu, Mar 27, 1:46 PM · Data-Engineering (Q4 2025 April 1st - June 30th), SRE Observability, EventStreams

Tue, Mar 25

tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Tue, Mar 25, 9:46 PM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Tue, Mar 25, 9:41 PM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Tue, Mar 25, 9:27 PM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Tue, Mar 25, 9:26 PM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Tue, Mar 25, 8:46 PM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Tue, Mar 25, 8:43 PM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content

Mar 18 2025

tchin moved T389162: [Data Quality] Add ability to add tags to alerts from Next Up to In progress on the Data-Engineering (Q3 2025 January 1st - March 31th) board.
Mar 18 2025, 12:42 PM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin edited projects for T389162: [Data Quality] Add ability to add tags to alerts, added: Data-Engineering (Q3 2025 January 1st - March 31th); removed Epic.
Mar 18 2025, 6:35 AM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Mar 18 2025, 5:33 AM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin updated the task description for T389162: [Data Quality] Add ability to add tags to alerts.
Mar 18 2025, 5:31 AM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin added a parent task for T384962: Implement alerting for wmf_content.mediawiki_content_history_v1: T389162: [Data Quality] Add ability to add tags to alerts.
Mar 18 2025, 5:28 AM · Data-Engineering (Q4 2025 April 1st - June 30th), Patch-For-Review, DPE-Mediawiki-Content
tchin added a subtask for T389162: [Data Quality] Add ability to add tags to alerts: T384962: Implement alerting for wmf_content.mediawiki_content_history_v1.
Mar 18 2025, 5:28 AM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content
tchin created T389162: [Data Quality] Add ability to add tags to alerts.
Mar 18 2025, 5:28 AM · DPE-Data-Quality, Patch-For-Review, Data-Engineering (Q4 2025 April 1st - June 30th), DPE-Mediawiki-Content

Mar 14 2025

tchin added a comment to T384962: Implement alerting for wmf_content.mediawiki_content_history_v1.

Here's an issue I currently see: the data_quality_ops.data_quality_alerts doesn't have a column to put in metadata like tags like the metrics table does. This doesn't affect the actual alerting part, but would affect any future analyses and dashboarding someone might want to do on the verification checks. For instance if we want to alert on T388439 there isn't a way currently to differentiate records in the table that are checking monthly vs daily reconciles. Even now, there's an open question whether the source_table column in the alerts table should refer to data_quality_ops.data_quality_metrics or the underlying table that the metrics were computed against.

Mar 14 2025, 8:05 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Patch-For-Review, DPE-Mediawiki-Content

Mar 12 2025

tchin added a comment to T387360: Update wikifunctions Grafana dashboard for service-utils.

@tchin This is already done, right? Can we close this one?

Mar 12 2025, 5:53 PM · Abstract Wikipedia team (25Q3 (Jan–Mar)), Abstract Wikipedia Fix-It tasks, Data-Engineering (Q3 2025 January 1st - March 31th)

Mar 5 2025

tchin added a comment to T387824: Fix service-utils metrics routing naming discrepancy.

Reconstructing the path using only the req object does work, but only if the params belong to the local router. So basically it doesn't work. Which means I need to go the other direction and make the middleware router-aware. This may require a major version bump.

Mar 5 2025, 8:52 PM · Data-Engineering (Q3 2025 January 1st - March 31th)

Mar 4 2025

tchin moved T387824: Fix service-utils metrics routing naming discrepancy from Next Up to In progress on the Data-Engineering (Q3 2025 January 1st - March 31th) board.
Mar 4 2025, 10:36 PM · Data-Engineering (Q3 2025 January 1st - March 31th)
tchin changed the status of T387824: Fix service-utils metrics routing naming discrepancy from Open to In Progress.
Mar 4 2025, 10:35 PM · Data-Engineering (Q3 2025 January 1st - March 31th)
tchin changed the status of T387824: Fix service-utils metrics routing naming discrepancy, a subtask of T360924: Replace service runner with a simplified library to better support metrics and debugging: service-utils, from Open to In Progress.
Mar 4 2025, 10:35 PM · Data-Engineering (Q3 2025 January 1st - March 31th)
tchin added a comment to T387824: Fix service-utils metrics routing naming discrepancy.

Yeah I'm going option 3 as well. The main issue is that the middleware is not aware of the router or the path it's on, so I need to figure out a way to make it aware of the router, or reconstruct the path using only the req object. Some more investigation is needed

Mar 4 2025, 10:35 PM · Data-Engineering (Q3 2025 January 1st - March 31th)
tchin added a comment to T387850: NEL logs are missing geoip information.

Could it have something to do with T382173?

Mar 4 2025, 2:05 PM · Data-Engineering (Q3 2025 January 1st - March 31th), Event-Platform, SRE Observability
tchin added a comment to T387360: Update wikifunctions Grafana dashboard for service-utils.

@DSantamaria Sorry, the ticket took longer to make since it took a while to just do the investigation to write it. Here it is: T387824

Mar 4 2025, 7:52 AM · Abstract Wikipedia team (25Q3 (Jan–Mar)), Abstract Wikipedia Fix-It tasks, Data-Engineering (Q3 2025 January 1st - March 31th)
tchin created T387824: Fix service-utils metrics routing naming discrepancy.
Mar 4 2025, 7:50 AM · Data-Engineering (Q3 2025 January 1st - March 31th)

Feb 28 2025

tchin added a comment to T387360: Update wikifunctions Grafana dashboard for service-utils.

Something seems wrong here, I should be seeing 3 paths here, --domain/v1/evaluate, --domain/v1/supported-programming-languages, and _info, but only root is showing up. Either no one has gone to those paths since service-utils was deployed or some investigation is needed...

Feb 28 2025, 8:45 PM · Abstract Wikipedia team (25Q3 (Jan–Mar)), Abstract Wikipedia Fix-It tasks, Data-Engineering (Q3 2025 January 1st - March 31th)

Feb 26 2025

tchin moved T387360: Update wikifunctions Grafana dashboard for service-utils from Next Up to In Review on the Data-Engineering (Q3 2025 January 1st - March 31th) board.
Feb 26 2025, 11:37 PM · Abstract Wikipedia team (25Q3 (Jan–Mar)), Abstract Wikipedia Fix-It tasks, Data-Engineering (Q3 2025 January 1st - March 31th)
tchin added a comment to T387360: Update wikifunctions Grafana dashboard for service-utils.

Updated the dashboard.

NodeJS
Feb 26 2025, 11:36 PM · Abstract Wikipedia team (25Q3 (Jan–Mar)), Abstract Wikipedia Fix-It tasks, Data-Engineering (Q3 2025 January 1st - March 31th)
tchin added a comment to T387360: Update wikifunctions Grafana dashboard for service-utils.

Here are the current metrics I'm seeing for function-orchestrator when curling a pod

# HELP process_cpu_user_seconds_total Total user CPU time spent in seconds.
# TYPE process_cpu_user_seconds_total counter
process_cpu_user_seconds_total 68.77650600000004
Feb 26 2025, 9:20 PM · Abstract Wikipedia team (25Q3 (Jan–Mar)), Abstract Wikipedia Fix-It tasks, Data-Engineering (Q3 2025 January 1st - March 31th)
tchin created T387360: Update wikifunctions Grafana dashboard for service-utils.
Feb 26 2025, 4:08 PM · Abstract Wikipedia team (25Q3 (Jan–Mar)), Abstract Wikipedia Fix-It tasks, Data-Engineering (Q3 2025 January 1st - March 31th)

Feb 24 2025

tchin moved T384962: Implement alerting for wmf_content.mediawiki_content_history_v1 from Next Up to In progress on the Data-Engineering (Q3 2025 January 1st - March 31th) board.
Feb 24 2025, 8:57 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Patch-For-Review, DPE-Mediawiki-Content
tchin moved T386709: Update eventstreams Grafana Dashboards to fix GC metrics from In progress to In Review on the Data-Engineering (Q3 2025 January 1st - March 31th) board.
Feb 24 2025, 8:56 PM · Essential-Work, Data-Engineering (Q3 2025 January 1st - March 31th)
tchin added a comment to T386709: Update eventstreams Grafana Dashboards to fix GC metrics.

There doesn't seem to be a service label on these GC metrics so I used kubernetes_namespace instead.

Feb 24 2025, 8:45 PM · Essential-Work, Data-Engineering (Q3 2025 January 1st - March 31th)

Feb 20 2025

tchin moved T386972: WF service logging seems to be partially missing from Next Up to In progress on the Data-Engineering (Q3 2025 January 1st - March 31th) board.
Feb 20 2025, 11:05 PM · Abstract Wikipedia team (25Q3 (Jan–Mar)), Essential-Work, Data-Engineering (Q3 2025 January 1st - March 31th), function-evaluator, function-orchestrator
tchin claimed T386972: WF service logging seems to be partially missing.
Feb 20 2025, 11:04 PM · Abstract Wikipedia team (25Q3 (Jan–Mar)), Essential-Work, Data-Engineering (Q3 2025 January 1st - March 31th), function-evaluator, function-orchestrator
tchin added a comment to T386972: WF service logging seems to be partially missing.

This was a wild combination of issues which resulted in me submitting an issue upstream.
So, a quick explanation working backwards:

  1. The library I'm using to deep merge when reformatting log messages doesn't merge Symbols (seems like a bug)
  2. Winston uses symbols for internal state
  3. If Symbol('level') is not in the logged message object, apparently Winston doesn't output anything at all
  4. Because the symbol was only erased when the logged message was reformatted, logging still looked like it worked
  5. The more we fixed T383448, the more logging looked like it didn't work
  6. Because the test was a unit test on the formatter and not an integration test on the entirety of Winston, it didn't catch it
Feb 20 2025, 10:59 PM · Abstract Wikipedia team (25Q3 (Jan–Mar)), Essential-Work, Data-Engineering (Q3 2025 January 1st - March 31th), function-evaluator, function-orchestrator

Feb 18 2025

tchin updated the task description for T364779: Migrate node-based services in production to node20.
Feb 18 2025, 4:34 PM · Platform Engineering, Recommendation-API, Wikifeeds, Push-Notification-Service, Mobile-Content-Service, Maps (Kartotherian), EventStreams, Citoid, Proton, ChangeProp
tchin updated the task description for T366612: Publish Data Engineering maintained NodeJS packages to GitLab and use them in depender code.
Feb 18 2025, 1:40 PM · Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review
tchin moved T386709: Update eventstreams Grafana Dashboards to fix GC metrics from Next Up to In progress on the Data-Engineering (Q3 2025 January 1st - March 31th) board.
Feb 18 2025, 1:40 PM · Essential-Work, Data-Engineering (Q3 2025 January 1st - March 31th)
tchin moved T373689: EventStreams: kafka key should be serialized as a string from Ready to Deploy to Done on the Data-Engineering (Q3 2025 January 1st - March 31th) board.
Feb 18 2025, 1:40 PM · Data-Engineering (Q3 2025 January 1st - March 31th), Event-Platform, EventStreams
tchin added a parent task for T386709: Update eventstreams Grafana Dashboards to fix GC metrics: T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics.
Feb 18 2025, 1:15 PM · Essential-Work, Data-Engineering (Q3 2025 January 1st - March 31th)
tchin added a subtask for T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics: T386709: Update eventstreams Grafana Dashboards to fix GC metrics.
Feb 18 2025, 1:15 PM · Data-Engineering, observability, ChangeProp, Event-Platform, service-runner
tchin created T386709: Update eventstreams Grafana Dashboards to fix GC metrics.
Feb 18 2025, 1:14 PM · Essential-Work, Data-Engineering (Q3 2025 January 1st - March 31th)

Feb 15 2025

tchin added a comment to T386114: DAG failing due to failure to acquire lock on wmf_data_ops.data_quality_metrics table.

Looks like it got locked again. I wonder if we should just split the metrics table to have one per pipeline to avoid this whole thing entirely

Feb 15 2025, 3:10 AM · Data-Platform-SRE (2025.03.01 - 2025.03.21), Essential-Work, Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review, DPE-Mediawiki-Content, Data-Platform (Data Platform Ops Week Working Group)

Feb 14 2025

tchin closed T386204: Update eventstreams Grafana Dashboards to use histogram for router metrics, a subtask of T361769: Migrate and re-deploy eventstreams using service-utils, as Resolved.
Feb 14 2025, 7:52 PM · Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review
tchin closed T386204: Update eventstreams Grafana Dashboards to use histogram for router metrics as Resolved.

Garbage collection is enabled by default, but it seems that the default metrics are also in histogram now and not gauge which is what the dashboard assumes

Feb 14 2025, 7:52 PM · Data-Engineering (Q3 2025 January 1st - March 31th)

Feb 13 2025

tchin moved T386204: Update eventstreams Grafana Dashboards to use histogram for router metrics from Next Up to Ready to Deploy on the Data-Engineering (Q3 2025 January 1st - March 31th) board.
Feb 13 2025, 7:02 PM · Data-Engineering (Q3 2025 January 1st - March 31th)
tchin moved T361769: Migrate and re-deploy eventstreams using service-utils from In Review to Ready to Deploy on the Data-Engineering (Q3 2025 January 1st - March 31th) board.
Feb 13 2025, 7:02 PM · Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review
tchin added a comment to T386204: Update eventstreams Grafana Dashboards to use histogram for router metrics.

Stream Connection Duration

- min(express_router_request_duration_seconds{service="$service", path=~"v2/stream/.*"})
+ min(rate(express_router_request_duration_seconds_sum{service="$service", path=~"stream/.*"}[5m])/rate(express_router_request_duration_seconds_count{service="$service", path=~"stream/.*"}[5m])>0)
Feb 13 2025, 6:59 PM · Data-Engineering (Q3 2025 January 1st - March 31th)
tchin updated the task description for T386204: Update eventstreams Grafana Dashboards to use histogram for router metrics.
Feb 13 2025, 6:30 PM · Data-Engineering (Q3 2025 January 1st - March 31th)
tchin changed the status of T361769: Migrate and re-deploy eventstreams using service-utils from Open to In Progress.
Feb 13 2025, 6:20 PM · Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review
tchin changed the status of T361769: Migrate and re-deploy eventstreams using service-utils, a subtask of T360924: Replace service runner with a simplified library to better support metrics and debugging: service-utils, from Open to In Progress.
Feb 13 2025, 6:20 PM · Data-Engineering (Q3 2025 January 1st - March 31th)
tchin changed the status of T386204: Update eventstreams Grafana Dashboards to use histogram for router metrics, a subtask of T361769: Migrate and re-deploy eventstreams using service-utils, from Open to In Progress.
Feb 13 2025, 6:20 PM · Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review
tchin changed the status of T386204: Update eventstreams Grafana Dashboards to use histogram for router metrics from Open to In Progress.
Feb 13 2025, 6:20 PM · Data-Engineering (Q3 2025 January 1st - March 31th)
tchin added a comment to T362681: Provide nodejs20 base images for production.

Do we know what libraries are included in the node20 image vs the node18 one? I recently had to install libssl-dev to get a service working on node20.

Feb 13 2025, 12:52 PM · serviceops
tchin added a comment to T386092: Enable custom SSL certificate CA bundle to work with confluent-kafka > 2.6.2.

I don't know why but it seems like the kafka sink doesn't recognize schema_registry_config and it's currently failing:

Exception:
1 validation error for KafkaSinkConfig
schema_registry_config
extra fields not permitted (type=value_error.extra)
Feb 13 2025, 12:43 AM · Patch-For-Review, Data-Platform-SRE (2025.02.10 - 2025.02.28)

Feb 12 2025

tchin added a comment to T386114: DAG failing due to failure to acquire lock on wmf_data_ops.data_quality_metrics table.

There's another dag that fails but I turned it off since it runs hourly so it's super noisy: webrequest_analyzer

Feb 12 2025, 3:36 PM · Data-Platform-SRE (2025.03.01 - 2025.03.21), Essential-Work, Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review, DPE-Mediawiki-Content, Data-Platform (Data Platform Ops Week Working Group)
Ottomata awarded T386204: Update eventstreams Grafana Dashboards to use histogram for router metrics a Love token.
Feb 12 2025, 2:33 PM · Data-Engineering (Q3 2025 January 1st - March 31th)
tchin created T386204: Update eventstreams Grafana Dashboards to use histogram for router metrics.
Feb 12 2025, 2:27 PM · Data-Engineering (Q3 2025 January 1st - March 31th)

Feb 1 2025

tchin added a comment to T357684: Dashboard and alerting of data quality metrics for wmf_content.mediawiki_content_history_v1.

Summary

  • Created a python script and airflow dag for computing metrics
  • Dogfood refinery-python and therefore PyDeequ
    • refinery-python doesn't work with the latest version of PyDeequ. We're currently pinning it but it should be upgraded.
  • Discovered Deequ has some major quirks, or it's more like we're not using it for its intended purpose
    • Can't directly insert metrics.. Metrics are always computed and therefore associated with an Analyzer.
    • Can't implement custom Analyzers in PyDeequ (GitHub Issue)
    • Can't compute metrics across tables. A workaround had to be used.
    • Doesn't output metrics on empty data (except for size). i.e. Asking it to give the Completeness of a column on a DataFrame of 0 records results in no metrics.
  • Created a Superset dashboard with metrics that were computed, turns out Superset also has some quirks
    • Superset expects a table to compute metrics over, not a table of already computed metrics. Some workarounds had to be used.
    • A dashboard with 500+ wikis is not that helpful, perhaps split it into multiple smaller dashboards
    • Superset does not handle time series data that well. No metrics on a specific day results in no data points. Resampling is only available for some graphs.
Feb 1 2025, 10:06 PM · Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review, Dumps 2.0 (Kanban Board), Experimentation Lab

Jan 30 2025

tchin updated the task description for T357684: Dashboard and alerting of data quality metrics for wmf_content.mediawiki_content_history_v1.
Jan 30 2025, 4:25 PM · Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review, Dumps 2.0 (Kanban Board), Experimentation Lab

Jan 28 2025

tchin created T384962: Implement alerting for wmf_content.mediawiki_content_history_v1.
Jan 28 2025, 7:27 PM · Data-Engineering (Q4 2025 April 1st - June 30th), Patch-For-Review, DPE-Mediawiki-Content

Jan 27 2025

tchin added a comment to T357684: Dashboard and alerting of data quality metrics for wmf_content.mediawiki_content_history_v1.

I setup a superset dashboard here

Jan 27 2025, 7:03 PM · Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review, Dumps 2.0 (Kanban Board), Experimentation Lab

Jan 24 2025

tchin added a comment to T357684: Dashboard and alerting of data quality metrics for wmf_content.mediawiki_content_history_v1.

Nice catch! Putting up a patch

Jan 24 2025, 9:04 PM · Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review, Dumps 2.0 (Kanban Board), Experimentation Lab

Jan 23 2025

tchin added a comment to T318269: Test and analyze Kuromoji & Sudachi Japanese language analyzers.

If you have any questions I could try to ask them on the sudachi slack?

Jan 23 2025, 5:33 PM · Discovery-Search (2025.03.01 - 2025.03.21)

Jan 21 2025

tchin added a project to T384364: Create an instance-level npm package registry in Gitlab: GitLab (Administration, Settings & Policy).
Jan 21 2025, 6:56 PM · User-brennen, Release-Engineering-Team (Seen), Data-Engineering, GitLab (Administration, Settings & Policy)
tchin added a parent task for T366614: [Epic] Migrate Data Engineering maintained NodeJS repositories to GitLab: T384364: Create an instance-level npm package registry in Gitlab.
Jan 21 2025, 6:54 PM · Data-Engineering-Roadmap, Epic
tchin added a subtask for T384364: Create an instance-level npm package registry in Gitlab: T366614: [Epic] Migrate Data Engineering maintained NodeJS repositories to GitLab.
Jan 21 2025, 6:54 PM · User-brennen, Release-Engineering-Team (Seen), Data-Engineering, GitLab (Administration, Settings & Policy)
tchin created T384364: Create an instance-level npm package registry in Gitlab.
Jan 21 2025, 6:53 PM · User-brennen, Release-Engineering-Team (Seen), Data-Engineering, GitLab (Administration, Settings & Policy)

Jan 14 2025

tchin added a comment to T375176: Enable HA for the mw-content-history-reconcile-enrich flink application.

@brouberol I guess I need to set egress? What's the cidr of Ceph?

Jan 14 2025, 7:00 AM · Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review, Dumps 2.0 (Kanban Board)
tchin added a comment to T382065: Add support for active/active double compute streams in the EventStreams HTTP service.

Deployed v0.11.0 on beta and confirmed it's fixed

Jan 14 2025, 5:05 AM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service, EventStreams

Jan 9 2025

tchin added a comment to T375176: Enable HA for the mw-content-history-reconcile-enrich flink application.

Tried deploying to staging with @gmodena, got this error but it doesn't show up in logstash

java.util.concurrent.CompletionException: com.amazonaws.SdkClientException: Unable to execute HTTP request: 
Connect to rgw.eqiad.dpe.anycast.wmnet:443 [rgw.eqiad.dpe.anycast.wmnet/10.3.0.8, 
rgw.eqiad.dpe.anycast.wmnet/2a02:ec80:ff00:101:0:0:0:8] failed: connect timed out
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
    at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Jan 9 2025, 3:59 PM · Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review, Dumps 2.0 (Kanban Board)
tchin added a comment to T382065: Add support for active/active double compute streams in the EventStreams HTTP service.

Deployed Eventstreams v0.10.0 on beta and it throws this error when listening to a stream:

{"message":"No topics available for consumption. This likely means that the configured allowedTopics do not currently exist.","origin":"KafkaSSE","name":"ConfigurationError","allowedTopics":[null],"statusCode":500}
Jan 9 2025, 2:23 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service, EventStreams

Jan 6 2025

tchin added a comment to T353939: [NEEDS GROOMING] deequ repo should be instantiated from Wikimedia's DQ metrics store.

Looking at it, it seems like the easiest thing to do would be to use Deequ's AnalysisResultSerde and add a new column to our metrics table to store the result. Then we can implement our own metrics repository or maybe if we upgrade to Deequ 2.0.7 there's a SparkTableMetricsRepository that we can possibly extend from.

Jan 6 2025, 2:56 AM · Patch-For-Review, Data-Engineering

Dec 23 2024

tchin created T382703: refinery-python should be moved to the data-engineering namespace.
Dec 23 2024, 12:19 PM · DPE-Mediawiki-Content, Epic

Dec 19 2024

tchin moved T375176: Enable HA for the mw-content-history-reconcile-enrich flink application from Next Up to In Review on the Data-Engineering (Q2 2024 October 1st - December 31th) board.
Dec 19 2024, 11:58 AM · Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review, Dumps 2.0 (Kanban Board)
tchin renamed T375176: Enable HA for the mw-content-history-reconcile-enrich flink application from Enable HA for the mw-dump-rev-content-reconcile-enrich flink application to Enable HA for the mw-content-history-reconcile-enrich flink application.
Dec 19 2024, 9:36 AM · Data-Engineering (Q3 2025 January 1st - March 31th), Patch-For-Review, Dumps 2.0 (Kanban Board)