Page MenuHomePhabricator
Feed Search

Fri, Jan 16

xcollazo moved T414389: Publish Dumps 2 to dumps.wikimedia.org and provide only monthly dumps from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Fri, Jan 16, 7:51 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T414389: Publish Dumps 2 to dumps.wikimedia.org and provide only monthly dumps.

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1924

mw file export: remove mid-month file export DAG

Fri, Jan 16, 7:39 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo closed T406515: Add user_central_id to mediawiki_content_history_v1 (and mediawiki_content_current_v1), a subtask of T405039: Global Editor Metrics - Data Pipeline, as Resolved.
Fri, Jan 16, 5:26 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, OKR-Work, MediaWiki-Page-derived-data
xcollazo closed T406515: Add user_central_id to mediawiki_content_history_v1 (and mediawiki_content_current_v1) as Resolved.

I am closing this ticket as done, since we did incorporate user_central_id to the mediawiki_content_* tables.

Fri, Jan 16, 5:26 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
xcollazo created T414832: Backfill `user_central_id` on wmf_content.mediawiki_content_* tables.
Fri, Jan 16, 5:25 PM · Data-Engineering
xcollazo changed the status of T414389: Publish Dumps 2 to dumps.wikimedia.org and provide only monthly dumps from Open to In Progress.
Fri, Jan 16, 3:50 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo changed the status of T414389: Publish Dumps 2 to dumps.wikimedia.org and provide only monthly dumps, a subtask of T384382: Production-level file export (aka dump) of MW Content in XML, from Open to In Progress.
Fri, Jan 16, 3:50 PM · Data-Engineering, DPE-Mediawiki-Content, Epic
xcollazo moved T413888: duplicated page_title in mediawiki_content_current_v1 from In Review to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Fri, Jan 16, 3:36 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo moved T401892: Update MediaWiki Content History SLO draft for SRE review from Ready to Deploy to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Fri, Jan 16, 1:55 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T401892: Update MediaWiki Content History SLO draft for SRE review.

Final Asana status of this completed work at https://app.asana.com/0/1210776717300332/progress/1212800339589065.

Fri, Jan 16, 1:54 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo claimed T407237: Wait till november wmf_raw.mediawiki_slots sqoop table is available, and apply origin_rev_id fix to mw_content tables.
Fri, Jan 16, 1:09 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), DPE-Mediawiki-Content
xcollazo moved T411803: Fix reconcile bug where user_id is not being populated correctly. from Ready to Deploy to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Fri, Jan 16, 1:07 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

There are still reconcile events being ingested, but the majority are taken care of.

Fri, Jan 16, 1:07 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo moved T396031: Remove code and tables related to deprecated mediawiki_wikitext_history and mediawiki_wikitext_current from In Review to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Fri, Jan 16, 1:02 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), DPE-Mediawiki-Content

Thu, Jan 15

xcollazo added a comment to T414105: SDS 2.2.6 Improve experiment event data data lake management.

Summary of meeting with @mpopov:

Thu, Jan 15, 7:48 PM · OKR-Work, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo closed T404975: Another instance of duplicate rows on wmf_content.mediawiki_content_history_v1 as Resolved.

Follow up work being done on T410431. Closing this one.

Thu, Jan 15, 5:27 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), DPE-Mediawiki-Content
xcollazo closed T385112: Investigate reasons for remaining inconsistencies as Resolved.

I'm being bold and closing this long standing ticket. We have found and fixed inconsistencies elsewhere, and we do not seem to have a need to track this globally here.

Thu, Jan 15, 5:25 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, DPE-Mediawiki-Content
xcollazo closed T385112: Investigate reasons for remaining inconsistencies, a subtask of T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2), as Resolved.
Thu, Jan 15, 5:25 PM · Data-Engineering-Roadmap, DPE-Mediawiki-Content, Epic
xcollazo updated the task description for T396031: Remove code and tables related to deprecated mediawiki_wikitext_history and mediawiki_wikitext_current.
Thu, Jan 15, 5:21 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), DPE-Mediawiki-Content
xcollazo moved T396031: Remove code and tables related to deprecated mediawiki_wikitext_history and mediawiki_wikitext_current from Blocked/Paused to In Review on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Thu, Jan 15, 5:17 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), DPE-Mediawiki-Content
xcollazo added a comment to T396031: Remove code and tables related to deprecated mediawiki_wikitext_history and mediawiki_wikitext_current.

Ran the following to drop tables:

analytics@an-launcher1003:/mnt/hdfs/wmf/data/wmf$ hostname -f
an-launcher1003.eqiad.wmnet
Thu, Jan 15, 5:09 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), DPE-Mediawiki-Content
xcollazo moved T413888: duplicated page_title in mediawiki_content_current_v1 from In progress to In Review on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Thu, Jan 15, 3:51 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo changed the status of T413888: duplicated page_title in mediawiki_content_current_v1 from Open to In Progress.
Thu, Jan 15, 3:50 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T413888: duplicated page_title in mediawiki_content_current_v1.

A cursory look on the remaining 447 duplicate titles suggest that these are recently deleted and undeleted articles that have not been reconciled yet.

Thu, Jan 15, 3:50 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T413888: duplicated page_title in mediawiki_content_current_v1.

Fix has been deployed to production and ran via 2026-01-14 run of spark_process_changes:

Thu, Jan 15, 3:25 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T414673: Missing blocks on analytics hadoop.

For completeness:

Thu, Jan 15, 1:27 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23)
xcollazo added a comment to P87547 HDFS Corrupted files - 2026-01-15.

(modified it to be sorted)

Thu, Jan 15, 12:30 PM
xcollazo edited P87547 HDFS Corrupted files - 2026-01-15.
Thu, Jan 15, 12:30 PM

Wed, Jan 14

xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

Attemting to repro the bug described on T411484#11432242, we can see that reconcile is now populating user_id, user_central_id, and user_text correctly (while before these details were NULL):

Wed, Jan 14, 7:39 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo moved T411803: Fix reconcile bug where user_id is not being populated correctly. from In progress to Ready to Deploy on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Wed, Jan 14, 7:18 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T413888: duplicated page_title in mediawiki_content_current_v1.

On MR 94 we workaround the Spark bug found on T413888#11521376 by using more CTEs. Go figure... ¯\_(ツ)_/¯

Wed, Jan 14, 6:47 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T413888: duplicated page_title in mediawiki_content_current_v1.

Looks like we are hitting https://issues.apache.org/jira/browse/SPARK-41557. This particular issue seems fixed in Spark 3.3.3+.

Wed, Jan 14, 3:34 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T413888: duplicated page_title in mediawiki_content_current_v1.

Unfortunately, after deploying this fix, the MERGE INTO failed with:

pyspark.sql.utils.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 25 columns and the second table has 20 columns;
Wed, Jan 14, 2:37 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T413888: duplicated page_title in mediawiki_content_current_v1.

Deploying fix to production, and rerunning latest mw_content_merge_changes_to_mw_content_current_daily pipeline to pick up changes.

Wed, Jan 14, 1:58 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Tue, Jan 13

xcollazo moved T406864: Optimize table maintenance for wmf_content.mediawiki_content_current_v1 from In Review to Done on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Tue, Jan 13, 4:59 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T406864: Optimize table maintenance for wmf_content.mediawiki_content_current_v1.

Count takes 2 seconds now:

spark.sql("""
SELECT count(1) as count
FROM wmf_content.mediawiki_content_current_v1
""").show()
+---------+
|    count|
+---------+
|699518277|
+---------+
Tue, Jan 13, 4:59 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo moved T406864: Optimize table maintenance for wmf_content.mediawiki_content_current_v1 from In progress to In Review on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Tue, Jan 13, 12:56 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo moved T406864: Optimize table maintenance for wmf_content.mediawiki_content_current_v1 from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Tue, Jan 13, 12:42 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo changed the status of T406864: Optimize table maintenance for wmf_content.mediawiki_content_current_v1, a subtask of T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2), from Open to In Progress.
Tue, Jan 13, 12:42 PM · Data-Engineering-Roadmap, DPE-Mediawiki-Content, Epic
xcollazo changed the status of T406864: Optimize table maintenance for wmf_content.mediawiki_content_current_v1 from Open to In Progress.
Tue, Jan 13, 12:42 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T406864: Optimize table maintenance for wmf_content.mediawiki_content_current_v1.

I suspect that the issue here is accumulation of delta files. Right now we have:

spark.sql("""
SELECT count(1) as count,
       content
FROM wmf_content.mediawiki_content_current_v1.files
GROUP BY content
""").show()
[Stage 23:===================================================>      (8 + 1) / 9]
+-----+-------+
|count|content|
+-----+-------+
| 1268|      1|
| 8939|      0|
+-----+-------+
Tue, Jan 13, 12:40 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Mon, Jan 12

xcollazo added a comment to T401892: Update MediaWiki Content History SLO draft for SRE review.

🎉 🎉 🎉

Mon, Jan 12, 5:10 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T414266: Feature request: Export Wikidata JSON as JSONL.

@So9q: like @Pfps mentions, it seem we already provide what you request. Are we missing something?

Mon, Jan 12, 2:54 PM · Data-Engineering, Dumps-Generation
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

After reconcile, hitting T410431 again.

Mon, Jan 12, 12:45 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Fri, Jan 9

xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

Now running mw_content_reconcile_mw_content_history_monthly to see where were at.

Fri, Jan 9, 8:50 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

Now retrying spark_process_reconciliation_events for 2025-12-14 with pushdown_strategy=earliest_revision_dt.

It is currently 9h in, and I expect it to take about the same time as 2025-12-13.

Fri, Jan 9, 6:10 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T413888: duplicated page_title in mediawiki_content_current_v1.

Great find! I think I see the root cause now, but I will let you play with it more and come to your own conclusions!

Is it connected to the DELETE operation only being able to be used with the WHEN MATCHED clause?

Fri, Jan 9, 1:12 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Thu, Jan 8

xcollazo added a comment to T413888: duplicated page_title in mediawiki_content_current_v1.

...
In the specific case of the Skateboard page:

spark.sql("select * from changes where page_id in (9508983,10697151) ").show()
+-------+--------+-----------+-----------+
|wiki_id| page_id|revision_id|change_type|
+-------+--------+-----------+-----------+
| itwiki|10697151|  148212922|     delete|
+-------+--------+-----------+-----------+

The join with the changes returns an empty result set, this means that we have deletes to be executed but given the logic of the join the rows cannot match.
Here is the result:

spark.sql(
        f"""

  SELECT 
         mwch.page_id,
         mwch.page_namespace_id,
         mwch.page_title,
         mwch.page_redirect_target,
         mwch.user_id,
         mwch.user_central_id,
         mwch.user_text,
         mwch.user_is_visible,
         mwch.revision_id,
         mwch.revision_parent_id,
         mwch.revision_dt,
         mwch.revision_is_minor_edit,
         mwch.revision_comment,
         mwch.revision_comment_is_visible,
         mwch.revision_size,
         mwch.revision_content_slots,
         mwch.revision_content_is_visible,
         mwch.wiki_id,
         GREATEST(mwch.row_content_update_dt,
                  mwch.row_visibility_update_dt,
                  mwch.row_move_update_dt)        AS row_update_dt,
         changes.change_type
  FROM {source_table} mwch
  INNER JOIN changes
    ON (mwch.wiki_id = changes.wiki_id AND mwch.page_id = changes.page_id AND mwch.revision_id = changes.revision_id)

""").show()
+-------+-----------------+----------+--------------------+-------+---------------+---------+---------------+-----------+------------------+-----------+----------------------+----------------+---------------------------+-------------+----------------------+---------------------------+-------+-------------+-----------+
|page_id|page_namespace_id|page_title|page_redirect_target|user_id|user_central_id|user_text|user_is_visible|revision_id|revision_parent_id|revision_dt|revision_is_minor_edit|revision_comment|revision_comment_is_visible|revision_size|revision_content_slots|revision_content_is_visible|wiki_id|row_update_dt|change_type|
+-------+-----------------+----------+--------------------+-------+---------------+---------+---------------+-----------+------------------+-----------+----------------------+----------------+---------------------------+-------------+----------------------+---------------------------+-------+-------------+-----------+
+-------+-----------------+----------+--------------------+-------+---------------+---------+---------------+-----------+------------------+-----------+----------------------+----------------+---------------------------+-------------+----------------------+---------------------------+-------+-------------+-----------+

This is happening cause the rows stored in the changes CTE are not present in the source_table. Therefore no delete is happening.

Thu, Jan 8, 3:38 PM · Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

For posterity, this is the configuration that made 2025-12-13 succeed, and that is expected to make 2025-12-14 succeed as well:

{
  "start_date": "2025-03-18T00:00:00",
  "sla": "PT16H",
  "conda_env": "hdfs:///wmf/cache/artifacts/airflow/analytics/mediawiki-content-pipelines-0.19.0-v0.19.0.conda.tgz",
  "hive_mediawiki_content_history_v1_table": "wmf_content.mediawiki_content_history_v1",
  "hive_mediawiki_page_content_change_table": "event.mediawiki_page_content_change_v1",
  "hive_revision_visibility_change": "event.mediawiki_revision_visibility_change",
  "hive_mediawiki_content_history_reconcile_enriched_v1": "event.mediawiki_content_history_reconcile_enriched_v1",
  "driver_memory": "64G",
  "driver_cores": "4",
  "executor_memory": "40G",
  "executor_cores": "1",
  "spark_extraJavaOptions": "-Xss8m",
  "spark_conf": {
    "spark.driver.maxResultSize": "8G",
    "spark.dynamicAllocation.maxExecutors": "80",
    "spark.sql.shuffle.partitions": "1024",
    "spark.sql.iceberg.locality.enabled": "true",
    "spark.reducer.maxReqsInFlight": 1,
    "spark.shuffle.io.retryWait": "60s",
    "spark.shuffle.io.maxRetries": 20
  },
  "big_wikis": [
    "commonswiki",
    "wikidatawiki",
    "enwiki",
    "svwiki",
    "ruwiki",
    "frwiki",
    "dewiki",
    "eswiki",
    "zhwiki",
    "itwiki",
    "arwiki",
    "enwiktionary",
    "labswiki",
    "ptwiki",
    "idwiki"
  ],
  "reconcile_pushdown_strategy": "earliest_revision_dt"
}
Thu, Jan 8, 2:41 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

Now retrying spark_process_reconciliation_events for 2025-12-14 with pushdown_strategy=earliest_revision_dt.

Thu, Jan 8, 2:19 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

Retrying spark_process_reconciliation_events for 2025-12-13 with pushdown_strategy=earliest_revision_dt.

https://yarn.wikimedia.org/proxy/application_1764064841637_997699/

Thu, Jan 8, 2:18 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Wed, Jan 7

xcollazo added a comment to T413349: Update pingback MediaWiki versions to include new values.

(Airflow DAG MR has been updated with test fixes, and it is on the release train to be deployed in the next couple days)

Wed, Jan 7, 4:01 PM · MediaWiki-General, Patch-For-Review, Data-Engineering

Tue, Jan 6

xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

Retrying spark_process_reconciliation_events for 2025-12-13 with pushdown_strategy=earliest_revision_dt.

Tue, Jan 6, 2:14 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Mon, Jan 5

xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

From MR 1903:

Mon, Jan 5, 9:25 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T413767: Some wikimedia 20260101 dump files missing.

IIRC, the status page updating before it is is actually ready for download is a side effect of T388378. We do not rsync the data files as often as we used to before that migration.

Mon, Jan 5, 6:35 PM · Data-Engineering, Dumps-Generation
xcollazo added a comment to T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily.

@MGerlach identified another repro of duplicates, but this time found via page_title:

Mon, Jan 5, 2:55 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Fri, Dec 19

xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

2025-12-13 and 2025-12-14 ultimately failed, both with Driver OOM, event with driver-memory = 64GB, and even with spark.sql.adaptive.autoBroadcastJoinThreshold = -1.

Fri, Dec 19, 8:48 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Dec 18 2025

xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

2025-12-13 failed again, event with driver-memory = 64GB.

Dec 18 2025, 8:33 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

Ingesting day 2025-12-12 of spark_process_reconciliation_events was successful with the tuning of T411803#11466392.

Dec 18 2025, 2:21 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Dec 17 2025

xcollazo updated subscribers of T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily.

Then, during reconciliation (both daily or monthly) the process will pull from the MariaDB replica which one is correct.

Dec 17 2025, 2:48 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo updated subscribers of T411803: Fix reconcile bug where user_id is not being populated correctly..

On a rerun as per T411803#11466392 we were doing very well, but now failed on duplicate rows as in T410431!

Dec 17 2025, 2:41 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Dec 16 2025

xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

The strategy of wiki_id by wiki_id did not succeed, at least not for the mediawiki_revision_history_v1.

Dec 16 2025, 8:52 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Dec 15 2025

xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

Ok, lets recap:

Dec 15 2025, 8:18 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo closed T409006: Update dump mirror rsync allowlist to reflect new IP address for Scatter as Resolved.
Dec 15 2025, 5:37 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Dumps-Generation

Dec 12 2025

xcollazo added a comment to T409105: mediawiki.page_change.v1 event stream - Investigate mistmatched meta.dt and dt (and rev_dt) fields.

I think our model definitely has a gap if we are to account for imports.

Dec 12 2025, 7:58 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MW-Interfaces-Team, Event-Platform
xcollazo added a comment to T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily.

The optimization_predicates computed with the 'set_of_page_ids' pushdown_strategy lists the page_id 178775087 and since it will not match with the page_id 100282687 already stored in the table for the revision_id 1118294298, the insert clause will be performed.
This causes the duplication of the revision_id 1118294298.

Awesome finding @APizzata-WMF !!

Dec 12 2025, 5:48 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo moved T409006: Update dump mirror rsync allowlist to reflect new IP address for Scatter from Next Up to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 12 2025, 3:54 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Dumps-Generation
xcollazo claimed T409006: Update dump mirror rsync allowlist to reflect new IP address for Scatter.
Dec 12 2025, 3:54 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Dumps-Generation
xcollazo changed the status of T409006: Update dump mirror rsync allowlist to reflect new IP address for Scatter from Open to In Progress.
Dec 12 2025, 3:53 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Dumps-Generation
xcollazo added a comment to T412443: Handle `network_flows_internal` data growth.

@xcollazo what's the reason for the failure ? No disk space ?

Dec 12 2025, 2:01 PM · Infrastructure-Foundations, netops, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

Resuming all MW Content DAGs.

Dec 12 2025, 1:36 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Dec 11 2025

xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

There are a lot of reconcile events to be ingested (See Flink app last 1h.) All these events are legit, but did inadvertedly come from an airflow-devenv instance. Will debug that issue separately.

Dec 11 2025, 10:58 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a subtask for T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2): T412461: On reconcile, consider what happens when a restore and a delete happen on the same revision.
Dec 11 2025, 9:42 PM · Data-Engineering-Roadmap, DPE-Mediawiki-Content, Epic
xcollazo added a parent task for T412461: On reconcile, consider what happens when a restore and a delete happen on the same revision: T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2).
Dec 11 2025, 9:42 PM · Data-Engineering, DPE-Mediawiki-Content
xcollazo created T412461: On reconcile, consider what happens when a restore and a delete happen on the same revision.
Dec 11 2025, 9:42 PM · Data-Engineering, DPE-Mediawiki-Content
xcollazo added a comment to T409006: Update dump mirror rsync allowlist to reflect new IP address for Scatter.

@Harej please confirm whether rsync works on your end.

Dec 11 2025, 9:25 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Dumps-Generation
xcollazo added a comment to T411598: Increase partitions of mediawiki.content_history_reconcile.v1.

(Side note: I just bumped TaskManagers to 20 temporarily due to T411803).

Dec 11 2025, 9:20 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T412428: Wikidata full .json.gz dumps not published since 20250625.

20251210 is available at https://dumps.wikimedia.org/other/wikibase/wikidatawiki/20251210/.

Dec 11 2025, 8:15 PM · Wikidata, Data-Engineering, Dumps-Generation, Wikidata data dumps
xcollazo added a comment to T412443: Handle `network_flows_internal` data growth.

( For now, we have paused the pipeline, as it is continually failing: https://airflow.wikimedia.org/dags/druid_load_network_flows_internal_daily/grid )

Dec 11 2025, 8:03 PM · Infrastructure-Foundations, netops, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T409006: Update dump mirror rsync allowlist to reflect new IP address for Scatter.

Confirmed via contact email that the request is legit.

Dec 11 2025, 5:40 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Dumps-Generation

Dec 10 2025

xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..

Copy pasting from MR 88, for completeness:

Dec 10 2025, 7:02 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T408819: When wikis cannot be exported due to SiteInfo, don't fail them.

Yes, Presto is even better!

Dec 10 2025, 2:59 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), Data-Engineering, DPE-Mediawiki-Content
xcollazo added a comment to T408819: When wikis cannot be exported due to SiteInfo, don't fail them.

Thanks for the effort @brouberol!

Dec 10 2025, 2:47 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), Data-Engineering, DPE-Mediawiki-Content

Dec 5 2025

xcollazo moved T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily from Urgent to Next Up on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 5 2025, 3:22 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo moved T411803: Fix reconcile bug where user_id is not being populated correctly. from Next Up to Urgent on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 5 2025, 3:21 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo moved T410688: Implement a new pipeline and table with reconciled historical revision data from In Review to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 5 2025, 3:19 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Dec 4 2025

xcollazo added a comment to T411598: Increase partitions of mediawiki.content_history_reconcile.v1.

In my opinion, in general, it looks potentially a bit dangerous to have a single partition on a topic; for example, if due to an issue, we need to send 500M records for reconcile to that topic, we'll need days to process that, and one broker will get a lot of data suddenly.

Dec 4 2025, 6:34 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table.

Opened T411803: Fix reconcile bug where user_id is not being populated correctly. to fix the reconcile issue.

Dec 4 2025, 4:54 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
xcollazo created T411803: Fix reconcile bug where user_id is not being populated correctly..
Dec 4 2025, 4:53 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
xcollazo added a comment to T411598: Increase partitions of mediawiki.content_history_reconcile.v1.

because this process runs only once a month.

Dec 4 2025, 4:37 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Dec 3 2025

xcollazo added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

I don't know if my other script runs have tidied all of those things up... But I doubt it, as they don't delete rows... But it will have fixed some of the lu_local_id is null entries as per T303590#11411628

Are those ocwiki dupes legit? ...

Dec 3 2025, 7:02 PM · MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
xcollazo added a comment to T360794: Implement stream of HTML content on mw.page_change event.

@fkaelin How urgent is the need for this stream? We're considering moving off of PyFlink and this would be a good opportunity to spike on a Java pipeline instead of a quick implementation now and then the complexities of dealing with any migration pains later

Dec 3 2025, 4:20 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Research, Event-Platform

Dec 2 2025

xcollazo added a comment to T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table.

I used joal.test_mediawiki_revision_history_druid, a test run of https://gerrit.wikimedia.org/r/1214023, to do some data validations.

Dec 2 2025, 9:39 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data

Dec 1 2025

xcollazo moved T406515: Add user_central_id to mediawiki_content_history_v1 (and mediawiki_content_current_v1) from Urgent to Next Up on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 1 2025, 3:58 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
xcollazo added a comment to T408819: When wikis cannot be exported due to SiteInfo, don't fail them.

If that works as expected, I'll try to integrate it into the dump DAGs.

Dec 1 2025, 3:30 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), Data-Engineering, DPE-Mediawiki-Content

Nov 26 2025

xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

@JAllemandou table wmf_content.mediawiki_revision_history has been backfilled, and the Airflow DAG is now live at https://airflow.wikimedia.org/dags/mw_revision_merge_events_to_mw_revision_history_daily/grid

Nov 26 2025, 10:16 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

Backfilling rerun from T410688#11410585 finished successfully, but it struggled with a couple task failures:

Response code
Time taken: 10965.418 seconds
Nov 26 2025, 9:50 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
xcollazo added a comment to T408819: When wikis cannot be exported due to SiteInfo, don't fail them.

How often do we run into this case? Could this be handled by documentation / run book for the person on OpsWeek?

Nov 26 2025, 8:39 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), Data-Engineering, DPE-Mediawiki-Content
xcollazo merged T411079: `central_auth.local_user` table contains corrupted records into T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.
Nov 26 2025, 6:38 PM · MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth
xcollazo merged task T411079: `central_auth.local_user` table contains corrupted records into T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.
Nov 26 2025, 6:38 PM · MediaWiki-Platform-Team, MediaWiki-Core-AuthManager