In T414389#11529535, @CodeReviewBot wrote:xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1924
mw file export: remove mid-month file export DAG
- Feed Queries
- All Stories
- Search
- Feed Search
- Transactions
- Transaction Logs
Feed Search
Fri, Jan 16
Fri, Jan 16
xcollazo added a comment to T414389: Publish Dumps 2 to dumps.wikimedia.org and provide only monthly dumps.
xcollazo closed T406515: Add user_central_id to mediawiki_content_history_v1 (and mediawiki_content_current_v1) as Resolved.
I am closing this ticket as done, since we did incorporate user_central_id to the mediawiki_content_* tables.
xcollazo changed the status of T414389: Publish Dumps 2 to dumps.wikimedia.org and provide only monthly dumps from Open to In Progress.
xcollazo changed the status of T414389: Publish Dumps 2 to dumps.wikimedia.org and provide only monthly dumps, a subtask of T384382: Production-level file export (aka dump) of MW Content in XML, from Open to In Progress.
Final Asana status of this completed work at https://app.asana.com/0/1210776717300332/progress/1212800339589065.
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
There are still reconcile events being ingested, but the majority are taken care of.
Thu, Jan 15
Thu, Jan 15
Summary of meeting with @mpopov:
xcollazo closed T404975: Another instance of duplicate rows on wmf_content.mediawiki_content_history_v1 as Resolved.
Follow up work being done on T410431. Closing this one.
I'm being bold and closing this long standing ticket. We have found and fixed inconsistencies elsewhere, and we do not seem to have a need to track this globally here.
xcollazo closed T385112: Investigate reasons for remaining inconsistencies, a subtask of T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2), as Resolved.
xcollazo updated the task description for T396031: Remove code and tables related to deprecated mediawiki_wikitext_history and mediawiki_wikitext_current.
xcollazo added a comment to T396031: Remove code and tables related to deprecated mediawiki_wikitext_history and mediawiki_wikitext_current.
Ran the following to drop tables:
analytics@an-launcher1003:/mnt/hdfs/wmf/data/wmf$ hostname -f an-launcher1003.eqiad.wmnet
xcollazo changed the status of T413888: duplicated page_title in mediawiki_content_current_v1 from Open to In Progress.
A cursory look on the remaining 447 duplicate titles suggest that these are recently deleted and undeleted articles that have not been reconciled yet.
Fix has been deployed to production and ran via 2026-01-14 run of spark_process_changes:
For completeness:
(modified it to be sorted)
Wed, Jan 14
Wed, Jan 14
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
Attemting to repro the bug described on T411484#11432242, we can see that reconcile is now populating user_id, user_central_id, and user_text correctly (while before these details were NULL):
On MR 94 we workaround the Spark bug found on T413888#11521376 by using more CTEs. Go figure... ¯\_(ツ)_/¯
Looks like we are hitting https://issues.apache.org/jira/browse/SPARK-41557. This particular issue seems fixed in Spark 3.3.3+.
Unfortunately, after deploying this fix, the MERGE INTO failed with:
pyspark.sql.utils.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 25 columns and the second table has 20 columns;
Deploying fix to production, and rerunning latest mw_content_merge_changes_to_mw_content_current_daily pipeline to pick up changes.
Tue, Jan 13
Tue, Jan 13
xcollazo added a comment to T406864: Optimize table maintenance for wmf_content.mediawiki_content_current_v1.
Count takes 2 seconds now:
spark.sql("""
SELECT count(1) as count
FROM wmf_content.mediawiki_content_current_v1
""").show()
+---------+
| count|
+---------+
|699518277|
+---------+xcollazo changed the status of T406864: Optimize table maintenance for wmf_content.mediawiki_content_current_v1, a subtask of T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2), from Open to In Progress.
xcollazo changed the status of T406864: Optimize table maintenance for wmf_content.mediawiki_content_current_v1 from Open to In Progress.
xcollazo added a comment to T406864: Optimize table maintenance for wmf_content.mediawiki_content_current_v1.
I suspect that the issue here is accumulation of delta files. Right now we have:
spark.sql("""
SELECT count(1) as count,
content
FROM wmf_content.mediawiki_content_current_v1.files
GROUP BY content
""").show()
[Stage 23:===================================================> (8 + 1) / 9]
+-----+-------+
|count|content|
+-----+-------+
| 1268| 1|
| 8939| 0|
+-----+-------+Mon, Jan 12
Mon, Jan 12
🎉 🎉 🎉
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
After reconcile, hitting T410431 again.
Fri, Jan 9
Fri, Jan 9
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
Now running mw_content_reconcile_mw_content_history_monthly to see where were at.
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
In T411803#11503785, @xcollazo wrote:Now retrying spark_process_reconciliation_events for 2025-12-14 with pushdown_strategy=earliest_revision_dt.
It is currently 9h in, and I expect it to take about the same time as 2025-12-13.
In T413888#11504711, @APizzata-WMF wrote:Great find! I think I see the root cause now, but I will let you play with it more and come to your own conclusions!
Is it connected to the DELETE operation only being able to be used with the WHEN MATCHED clause?
Thu, Jan 8
Thu, Jan 8
In T413888#11504019, @APizzata-WMF wrote:...
In the specific case of the Skateboard page:spark.sql("select * from changes where page_id in (9508983,10697151) ").show()+-------+--------+-----------+-----------+ |wiki_id| page_id|revision_id|change_type| +-------+--------+-----------+-----------+ | itwiki|10697151| 148212922| delete| +-------+--------+-----------+-----------+The join with the changes returns an empty result set, this means that we have deletes to be executed but given the logic of the join the rows cannot match.
Here is the result:spark.sql( f""" SELECT mwch.page_id, mwch.page_namespace_id, mwch.page_title, mwch.page_redirect_target, mwch.user_id, mwch.user_central_id, mwch.user_text, mwch.user_is_visible, mwch.revision_id, mwch.revision_parent_id, mwch.revision_dt, mwch.revision_is_minor_edit, mwch.revision_comment, mwch.revision_comment_is_visible, mwch.revision_size, mwch.revision_content_slots, mwch.revision_content_is_visible, mwch.wiki_id, GREATEST(mwch.row_content_update_dt, mwch.row_visibility_update_dt, mwch.row_move_update_dt) AS row_update_dt, changes.change_type FROM {source_table} mwch INNER JOIN changes ON (mwch.wiki_id = changes.wiki_id AND mwch.page_id = changes.page_id AND mwch.revision_id = changes.revision_id) """).show()+-------+-----------------+----------+--------------------+-------+---------------+---------+---------------+-----------+------------------+-----------+----------------------+----------------+---------------------------+-------------+----------------------+---------------------------+-------+-------------+-----------+ |page_id|page_namespace_id|page_title|page_redirect_target|user_id|user_central_id|user_text|user_is_visible|revision_id|revision_parent_id|revision_dt|revision_is_minor_edit|revision_comment|revision_comment_is_visible|revision_size|revision_content_slots|revision_content_is_visible|wiki_id|row_update_dt|change_type| +-------+-----------------+----------+--------------------+-------+---------------+---------+---------------+-----------+------------------+-----------+----------------------+----------------+---------------------------+-------------+----------------------+---------------------------+-------+-------------+-----------+ +-------+-----------------+----------+--------------------+-------+---------------+---------+---------------+-----------+------------------+-----------+----------------------+----------------+---------------------------+-------------+----------------------+---------------------------+-------+-------------+-----------+This is happening cause the rows stored in the changes CTE are not present in the source_table. Therefore no delete is happening.
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
For posterity, this is the configuration that made 2025-12-13 succeed, and that is expected to make 2025-12-14 succeed as well:
{
"start_date": "2025-03-18T00:00:00",
"sla": "PT16H",
"conda_env": "hdfs:///wmf/cache/artifacts/airflow/analytics/mediawiki-content-pipelines-0.19.0-v0.19.0.conda.tgz",
"hive_mediawiki_content_history_v1_table": "wmf_content.mediawiki_content_history_v1",
"hive_mediawiki_page_content_change_table": "event.mediawiki_page_content_change_v1",
"hive_revision_visibility_change": "event.mediawiki_revision_visibility_change",
"hive_mediawiki_content_history_reconcile_enriched_v1": "event.mediawiki_content_history_reconcile_enriched_v1",
"driver_memory": "64G",
"driver_cores": "4",
"executor_memory": "40G",
"executor_cores": "1",
"spark_extraJavaOptions": "-Xss8m",
"spark_conf": {
"spark.driver.maxResultSize": "8G",
"spark.dynamicAllocation.maxExecutors": "80",
"spark.sql.shuffle.partitions": "1024",
"spark.sql.iceberg.locality.enabled": "true",
"spark.reducer.maxReqsInFlight": 1,
"spark.shuffle.io.retryWait": "60s",
"spark.shuffle.io.maxRetries": 20
},
"big_wikis": [
"commonswiki",
"wikidatawiki",
"enwiki",
"svwiki",
"ruwiki",
"frwiki",
"dewiki",
"eswiki",
"zhwiki",
"itwiki",
"arwiki",
"enwiktionary",
"labswiki",
"ptwiki",
"idwiki"
],
"reconcile_pushdown_strategy": "earliest_revision_dt"
}xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
Now retrying spark_process_reconciliation_events for 2025-12-14 with pushdown_strategy=earliest_revision_dt.
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
In T411803#11496102, @xcollazo wrote:Retrying spark_process_reconciliation_events for 2025-12-13 with pushdown_strategy=earliest_revision_dt.
https://yarn.wikimedia.org/proxy/application_1764064841637_997699/
Wed, Jan 7
Wed, Jan 7
(Airflow DAG MR has been updated with test fixes, and it is on the release train to be deployed in the next couple days)
Tue, Jan 6
Tue, Jan 6
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
Retrying spark_process_reconciliation_events for 2025-12-13 with pushdown_strategy=earliest_revision_dt.
Mon, Jan 5
Mon, Jan 5
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
From MR 1903:
IIRC, the status page updating before it is is actually ready for download is a side effect of T388378. We do not rsync the data files as often as we used to before that migration.
xcollazo added a comment to T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily.
@MGerlach identified another repro of duplicates, but this time found via page_title:
Fri, Dec 19
Fri, Dec 19
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
2025-12-13 and 2025-12-14 ultimately failed, both with Driver OOM, event with driver-memory = 64GB, and even with spark.sql.adaptive.autoBroadcastJoinThreshold = -1.
Dec 18 2025
Dec 18 2025
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
2025-12-13 failed again, event with driver-memory = 64GB.
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
Ingesting day 2025-12-12 of spark_process_reconciliation_events was successful with the tuning of T411803#11466392.
Dec 17 2025
Dec 17 2025
xcollazo updated subscribers of T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily.
Then, during reconciliation (both daily or monthly) the process will pull from the MariaDB replica which one is correct.
xcollazo updated subscribers of T411803: Fix reconcile bug where user_id is not being populated correctly..
On a rerun as per T411803#11466392 we were doing very well, but now failed on duplicate rows as in T410431!
Dec 16 2025
Dec 16 2025
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
The strategy of wiki_id by wiki_id did not succeed, at least not for the mediawiki_revision_history_v1.
Dec 15 2025
Dec 15 2025
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
Ok, lets recap:
Dec 12 2025
Dec 12 2025
xcollazo added a comment to T409105: mediawiki.page_change.v1 event stream - Investigate mistmatched meta.dt and dt (and rev_dt) fields.
I think our model definitely has a gap if we are to account for imports.
xcollazo added a comment to T410431: Troubleshoot duplicates issue in mw_content_merge_events_to_mw_content_history_daily.
The optimization_predicates computed with the 'set_of_page_ids' pushdown_strategy lists the page_id 178775087 and since it will not match with the page_id 100282687 already stored in the table for the revision_id 1118294298, the insert clause will be performed.
This causes the duplication of the revision_id 1118294298.
Awesome finding @APizzata-WMF !!
xcollazo changed the status of T409006: Update dump mirror rsync allowlist to reflect new IP address for Scatter from Open to In Progress.
In T412443#11453895, @ayounsi wrote:@xcollazo what's the reason for the failure ? No disk space ?
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
Resuming all MW Content DAGs.
Dec 11 2025
Dec 11 2025
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
There are a lot of reconcile events to be ingested (See Flink app last 1h.) All these events are legit, but did inadvertedly come from an airflow-devenv instance. Will debug that issue separately.
xcollazo added a comment to T409006: Update dump mirror rsync allowlist to reflect new IP address for Scatter.
@Harej please confirm whether rsync works on your end.
(Side note: I just bumped TaskManagers to 20 temporarily due to T411803).
20251210 is available at https://dumps.wikimedia.org/other/wikibase/wikidatawiki/20251210/.
( For now, we have paused the pipeline, as it is continually failing: https://airflow.wikimedia.org/dags/druid_load_network_flows_internal_daily/grid )
xcollazo added a comment to T409006: Update dump mirror rsync allowlist to reflect new IP address for Scatter.
Confirmed via contact email that the request is legit.
Dec 10 2025
Dec 10 2025
xcollazo added a comment to T411803: Fix reconcile bug where user_id is not being populated correctly..
Copy pasting from MR 88, for completeness:
xcollazo added a comment to T408819: When wikis cannot be exported due to SiteInfo, don't fail them.
Yes, Presto is even better!
xcollazo added a comment to T408819: When wikis cannot be exported due to SiteInfo, don't fail them.
Thanks for the effort @brouberol!
Dec 5 2025
Dec 5 2025
Dec 4 2025
Dec 4 2025
In my opinion, in general, it looks potentially a bit dangerous to have a single partition on a topic; for example, if due to an issue, we need to send 500M records for reconcile to that topic, we'll need days to process that, and one broker will get a lot of data suddenly.
xcollazo added a comment to T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table.
Opened T411803: Fix reconcile bug where user_id is not being populated correctly. to fix the reconcile issue.
because this process runs only once a month.
Dec 3 2025
Dec 3 2025
xcollazo added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.
In T411116#11411658, @Reedy wrote:I don't know if my other script runs have tidied all of those things up... But I doubt it, as they don't delete rows... But it will have fixed some of the lu_local_id is null entries as per T303590#11411628
Are those ocwiki dupes legit? ...
In T360794#11428447, @tchin wrote:@fkaelin How urgent is the need for this stream? We're considering moving off of PyFlink and this would be a good opportunity to spike on a Java pipeline instead of a quick implementation now and then the complexities of dealing with any migration pains later
Dec 2 2025
Dec 2 2025
xcollazo added a comment to T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table.
I used joal.test_mediawiki_revision_history_druid, a test run of https://gerrit.wikimedia.org/r/1214023, to do some data validations.
Dec 1 2025
Dec 1 2025
xcollazo added a comment to T408819: When wikis cannot be exported due to SiteInfo, don't fail them.
If that works as expected, I'll try to integrate it into the dump DAGs.
Nov 26 2025
Nov 26 2025
xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.
@JAllemandou table wmf_content.mediawiki_revision_history has been backfilled, and the Airflow DAG is now live at https://airflow.wikimedia.org/dags/mw_revision_merge_events_to_mw_revision_history_daily/grid
xcollazo added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.
Backfilling rerun from T410688#11410585 finished successfully, but it struggled with a couple task failures:
Response code Time taken: 10965.418 seconds
xcollazo added a comment to T408819: When wikis cannot be exported due to SiteInfo, don't fail them.
In T408819#11410377, @Ahoelzl wrote:How often do we run into this case? Could this be handled by documentation / run book for the person on OpsWeek?
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL · Credits