Data Platform Engineering Bug Report or Data Problem Form.
What kind of problem are you reporting?
- Access related problem
- Service related problem
- Data related problem
For a data related problem:
- Is this a data quality issue?
- Yes
- What datasets and/or dashboards are affected?
- mediawiki_history and all derived datasets
- What are the observed vs expected results? Please include information such as location of data, any initial assessments, sql statements, screenshots.
- A snapshot of mediawiki_history should not contain more than one row for each revision. In other words, there should be no duplicate values of (wiki_db, revision_id) among revision-create events. This is the case in the 2024-05 snapshot, but in the 2024-06 snapshot there are 2.1 M rows that are duplicates of another revision. I have set the priority to high, but since these duplicates are only 0.03% of all revision-create events in the snapshots, I would not object to medium priority too.
SQL queries
Duplicates in the 2024-06 snapshot
WITH rev_id_frequencies AS (
SELECT
COUNT(*) AS frequency
FROM wmf.mediawiki_history
WHERE
event_entity = 'revision'
AND event_type = 'create'
AND snapshot = '2024-06'
GROUP BY
wiki_db,
revision_id
)
SELECT
COUNT(*) AS revisions_with_duplicates_count,
SUM(frequency - 1) AS duplicate_count
FROM rev_id_frequencies
WHERE frequency > 1revisions_with_duplicates_count duplicate_count 1080128 2146208
Duplicates in the 2024-05 snapshot
WITH rev_id_frequencies AS (
SELECT
COUNT(*) AS frequency
FROM wmf.mediawiki_history
WHERE
event_entity = 'revision'
AND event_type = 'create'
AND snapshot = '2024-05'
GROUP BY
wiki_db,
revision_id
)
SELECT
COUNT(*) AS revisions_with_duplicates_count,
SUM(frequency - 1) AS duplicate_count
FROM rev_id_frequencies
WHERE frequency > 1revisions_with_duplicates_count duplicate_count 0 NULL
