Page MenuHomePhabricator

Investigate and fix duplicate data on wmf_content.mediawiki_content_history_v1 for muswiki
Closed, ResolvedPublic

Description

While working on T388040, we found the following situation:

# sanity check
spark.sql("""
SELECT count(1) as count,
       wiki_id,
       page_id,
       revision_id
FROM wmf_content.mediawiki_content_history_v1
GROUP BY wiki_id, page_id, revision_id
HAVING count > 1
""").show(20, truncate=False)
[Stage 52:=====================================================>(286 + 3) / 289]
+-----+-------+-------+-----------+
|count|wiki_id|page_id|revision_id|
+-----+-------+-------+-----------+
|7    |muswiki|2      |2          |
|8    |muswiki|1      |1          |
+-----+-------+-------+-----------+

wmf_content.mediawiki_content_history_v1 should never have more than one row per wiki_id, page_id, revision_id, but it does for this particualr wiki.

This wiki appears to be very new:

# sanity check
spark.sql("""
SELECT count(1) as count
FROM wmf_content.mediawiki_content_history_v1
WHERE wiki_id = 'muswiki'
""").show(20, truncate=False)
[Stage 53:>                                                         (0 + 1) / 1]
+-----+
|count|
+-----+
|290  |
+-----+

Let's try to figure out how this happened, and also let's cleanup wmf_content.mediawiki_content_history_v1.

Event Timeline

Upon further inspection, muswiki is actually closed:

xcollazo@stat1011:~$ curl https://noc.wikimedia.org/conf/dblists/closed.dblist | grep mus
muswiki

Further even, it was closed 14 years ago. This makes me believe that this was just a fluke when we backfilled wmf_content.mediawiki_content_history_v1 from the Dumps 1 generated table wmf.mediawiki_wikitext_history, which can very well contain duplicate data.

The current snapshot from wmf.mediawiki_wikitext_history does not have duplicate data however:

 presto> select count(1) as count, page_id, revision_id from wmf.mediawiki_wikitext_history where wiki_db = 'muswiki' and snapshot = '2025-01' group by page_id, revision_id order by count desc;
 count | page_id | revision_id 
-------+---------+-------------
     1 |    1465 |        2713 
     1 |    1827 |        3356 
     1 |    1466 |        2714 
     1 |       1 |        2941 
     1 |    1881 |        3553 
     1 |    1838 |        3414 
     1 |    1877 |        3577 
     1 |       1 |        2796 
...

Since this wiki is closed, and we currently do not support closed wikis (to be considered further on T371509) I'm not going to worry about this much.

I'm simply going to delete the offending wiki completely via:

DELETE
FROM wmf_content.mediawiki_content_history_v1
WHERE wiki_id = 'muswiki'

Since this wiki is not part of the open, non private set of wikis, the reconcile mechanism will not try to recover the deleted data.

$ whoami
analytics
$ hostname -f
an-launcher1002.eqiad.wmnet

$ spark3-sql

spark-sql (default)> DELETE
                   > FROM wmf_content.mediawiki_content_history_v1
                   > WHERE wiki_id = 'muswiki';
Response code
Time taken: 6.643 seconds

spark-sql (default)> select count(1) from FROM wmf_content.mediawiki_content_history_v1 WHERE wiki_id = 'muswiki';
from
0
Time taken: 2.379 seconds, Fetched 1 row(s)

Mentioned in SAL (#wikimedia-analytics) [2025-03-17T19:30:16Z] <xcollazo> Ran "DELETE FROM wmf_content.mediawiki_content_history_v1 WHERE wiki_id = 'muswiki'" to delete bad data. T388715.

For completeness, I've rerun the bad data detection SQL below just now, and found no further offending data:

# sanity check
spark.sql("""
SELECT count(1) as count,
       wiki_id,
       page_id,
       revision_id
FROM wmf_content.mediawiki_content_history_v1
GROUP BY wiki_id, page_id, revision_id
HAVING count > 1
""").show(20, truncate=False)

+-----+-------+-------+-----------+
|count|wiki_id|page_id|revision_id|
+-----+-------+-------+-----------+
+-----+-------+-------+-----------+