User Details
- User Since
- Oct 3 2025, 9:25 AM (18 w, 6 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- APizzata-WMF [ Global Accounts ]
Today
Yes indeed, in T416491 I have an example of this situation and a possible cure. We can either close this, or link the two together?
spark.sql("""
SELECT count(1) as count FROM (
SELECT count(1) as count,
wiki_id,
revision_id
FROM wmf_content.mediawiki_content_history_v1
GROUP BY wiki_id, revision_id
HAVING count > 1
)
""").show(300, truncate=False)returned
+-----+ |count| +-----+ |0 | +-----+
As already stated in T410431#11566608 we now consider the issue solved. Will push relevant changes to the repo and close the ticket.
Tue, Feb 10
a one-time comparison to wmf.mediawiki_history
My change is the following:
from pyspark.sql import functions as F from pyspark.sql.window import Window deletes_and_restores_query=""" SELECT distinct log_page AS page_id, log_action, log_timestamp
Mon, Feb 9
Thu, Feb 5
After a fruitful conversation with @JAllemandou, we came to the conclusion that for the different nature of data stored in the wmf.mediawiki_history and the events.* tables would be better to think of a decoupled metric:
Wed, Feb 4
Tue, Feb 3
Mon, Feb 2
Thu, Jan 29
We have changed the pushdown_strategy to earliest_revision_dt and this should avoid the duplication. AFAICS from the 2026-01-11 (day that we changed) we are not having duplicates anymore.
We will continue monitoring the situation for the next 2 weeks (up to 2026-02-11) and if no duplicate shows up we can consider the situation fixed.
Thu, Jan 22
Wed, Jan 21
Tue, Jan 20
Mon, Jan 19
I have yet to understand what's happening behind the scenes, but maybe the presence of the delete does not allow for a correct reconciliation?
What's your thoughts @xcollazo @JAllemandou
While exploring the data and validating the solution in T414779, found the following curious example:
page_id= 69510715 and wiki_id = 'enwiki'
Testing the following solution:
deletes_and_undeletes as (
-- select all the pages that have both delete and undelete change_kind
-- and order by the latest operation (given by meta.dt)
SELECT
wiki_id,
page['page_id'] AS page_id,
page_change_kind,
row_number() over (
partition by wiki_id, page['page_id'] order by meta.dt desc
) as rn
FROM
{source_mw_event_page_change_table}
WHERE
{hive_filter}
AND
page_change_kind in ('delete', 'undelete')
),
deleted_pages as (
-- identify the pages that have been deleted and not restored:
--page_change_kind is delete and it is the most recent operation
SELECT
wiki_id,
page_id
FROM
deletes_and_undeletes
WHERE
rn=1 AND
page_change_kind= 'delete'
),Fri, Jan 16
Wed, Jan 14
Created a document with info about ALIS and Cassandra DAGs.
Will use the same file for future for more notes and info dump.
Tue, Jan 13
After today's meeting with @xcollazo and @JAllemandou we realised the following.
Using the query in T410431#11465081:
spark.sql("""
select
meta.dt,
revision.rev_dt,
day,month,year,
page.page_id,
page.page_title,
revision.rev_id,
page_change_kind
from
event.mediawiki_page_content_change_v1
where wiki_id= 'commonswiki'
and page.page_id in(178775087,100282687)
order by meta.dt, page.page_id asc
""").show(truncate=False)+---------------------------+--------------------+---+-----+----+---------+----------------------------+----------+----------------+ |dt |rev_dt |day|month|year|page_id |page_title |rev_id |page_change_kind| +---------------------------+--------------------+---+-----+----+---------+----------------------------+----------+----------------+ |2025-11-19T07:26:41.89519Z |2025-11-19T07:26:37Z|19 |11 |2025|100282687|File:Flag_of_Ulleung.svg |1118294200|edit | |2025-11-20T07:39:25.814734Z|2025-11-20T07:39:21Z|20 |11 |2025|100282687|File:Flag_of_Ulleung.svg |1118806092|edit | |2025-11-20T21:16:29.814428Z|2025-11-20T07:39:21Z|20 |11 |2025|100282687|File:Flag_of_Ulleung.svg |1118806092|delete | |2025-11-20T21:16:35.813491Z|2025-11-19T07:26:37Z|20 |11 |2025|100282687|File:Flag_of_Ulleung.svg |1118294200|undelete | |2025-11-20T21:16:37.824006Z|2025-11-20T21:16:32Z|20 |11 |2025|100282687|File:Flag_of_Ulleung_(2).svg|1119085982|move | |2025-11-20T21:17:31.75636Z |2025-11-20T21:17:30Z|20 |11 |2025|100282687|File:Flag_of_Ulleung_(2).svg|1119086355|edit | |2025-11-20T21:34:28.240769Z|2025-11-20T07:39:21Z|20 |11 |2025|178775087|File:Flag_of_Ulleung.svg |1118806092|undelete | |2025-11-20T21:34:44.24225Z |2025-11-20T21:34:41Z|20 |11 |2025|178775087|File:Flag_of_Ulleung.svg |1119092139|edit | |2025-12-05T08:38:05.952091Z|2025-12-05T08:38:02Z|5 |12 |2025|100282687|File:Flag_of_Ulleung_(2).svg|1125950709|edit | |2025-12-05T08:38:40.678265Z|2025-12-05T08:38:36Z|5 |12 |2025|178775087|File:Flag_of_Ulleung.svg |1125950925|edit | +---------------------------+--------------------+---+-----+----+---------+----------------------------+----------+----------------+
But ordering the result by meta.dt we can infer that the undelete with rev_id 1118294200 of the page_id 1118294200 was a way to reverse the edit of edit 1118806092.
The move 1119085982 is part of the moved page File:Flag of Ulleung.svg to File:Flag of Ulleung (2).svg without leaving a redirect: -- using SplitFileHistory.js procedure that is applied consciously by users. This, allows the users to move the page to a new one without leaving a redirect.
Finally the duplication of the rev_id 1118294298 comes from the delete 1118806092 of the page_id 100282687 and due to the optimization logic we apply when merging data ('set_of_page_ids' pushdown_strategy) that avoids the full scan of the table.
Jan 12 2026
Done T410431#11512827 ✅
Run a deduplication like what was done in https://phabricator.wikimedia.org/T404975#11197939 :
duplicates situation for mediawiki_content_history_v1:
spark.sql("""
SELECT count(*) as total_duplicate
FROM (
SELECT
count(1) as count,
wiki_id,
revision_id
FROM wmf_content.mediawiki_content_history_v1
GROUP BY wiki_id ,revision_id
HAVING count > 1
)
""").show(3000, truncate=False)Hey @xcollazo I will run the heuristic to deduplicate. Texting you here when is everything is clean.
Jan 8 2026
Great find! I think I see the root cause now, but I will let you play with it more and come to your own conclusions!
Is it connected to the DELETE operation only being able to be used with the WHEN MATCHED clause?
The pages showing in the mediawiki_content_current_v1 tables that seem duplicate are actually caused by the unexecuted delete operation in the creation of the current table.
Jan 7 2026
Performed first round of study and analysis, waiting for a call with Marco scheduled on Thursday
Jan 6 2026
Regarding T410431#11492207 after a call with @xcollazo we determined that this duplication in the current table is due to another possible bug. Created: T413888
Jan 5 2026
Dec 19 2025
Uploaded a new set of metrics with the following names:
- wmf_content_mediawiki_content_history_v1_completeness_sli_days: counter of the days that the metric has been executed
- wmf_content_mediawiki_content_history_v1_completeness_sli_alerts: count of alerts in case the completeness of the table is under the SLO threshold.
Dec 17 2025
Regarding this:
I was discussing this issue with @JAllemandou, and he mentioned that these rows could very well be coming from the logging table. See discussion in an MR comment thread here.
Executed deduplication steps as we discussed: T410431#11469747
Run a deduplication like what was done in https://phabricator.wikimedia.org/T404975#11197939 :
Dec 16 2025
Thanks @BTullis, I can now see the menu!
After a discussion with @xcollazo we realised that the problem is connected to a combination of undelete and delete events.
Dec 15 2025
Dec 9 2025
Dec 8 2025
Current situation after the monthly reconciliation (query executed on 2025-12-08):
spark.sql("""
SELECT count(*) as total_duplicates
FROM (
SELECT
count(1) as count,
wiki_id,
revision_id
FROM wmf_content.mediawiki_content_history_v1
GROUP BY wiki_id ,revision_id
HAVING count > 1
)
""").show(3000, truncate=False)Dec 3 2025
Nov 26 2025
All the action items from my side have been published on the document
Nov 24 2025
Action items on the previous linked document:
Nov 21 2025
Nov 20 2025
updated the document: https://wikitech.wikimedia.org/wiki/SLO/MediaWiki_Content_History_Table
Nov 6 2025
Nov 5 2025
I have created the P84877 with the results of the query. I have also changed the final columns to fit a more generic approach and not specialised for the mwch. The query now also runs on group by wiki_id and there is no more reason to iterate over them.
The spark configuration tested is the following:
config = {
"spark.driver.memory": "16g",
"spark.driver.cores": 4,
"spark.driver.maxResultSize": "8g",
"spark.dynamicAllocation.maxExecutors": 32,
"spark.executor.memory": "16g",
"spark.executor.cores": 2,
"spark.sql.shuffle.partitions": 512,
}I feel it could be reduced in the final development stages. Please lmk what you think about this!
Oct 31 2025
I have updated the qeury:
Oct 29 2025
Oct 27 2025
Oct 24 2025
What about the source data we ingest?
I was thinking that a possible completeness measure should include also the source event.mediawiki_page_content_change_v1 table and union it with the wmf_content.inconsistent_rows_of_mediawiki_content_history_v1. Something like this:
Oct 14 2025
Oct 10 2025
@MoritzMuehlenhoff perfect, thank you very much!
