While working on T388040, we found the following situation:
# sanity check
spark.sql("""
SELECT count(1) as count,
wiki_id,
page_id,
revision_id
FROM wmf_content.mediawiki_content_history_v1
GROUP BY wiki_id, page_id, revision_id
HAVING count > 1
""").show(20, truncate=False)
[Stage 52:=====================================================>(286 + 3) / 289]
+-----+-------+-------+-----------+
|count|wiki_id|page_id|revision_id|
+-----+-------+-------+-----------+
|7 |muswiki|2 |2 |
|8 |muswiki|1 |1 |
+-----+-------+-------+-----------+wmf_content.mediawiki_content_history_v1 should never have more than one row per wiki_id, page_id, revision_id, but it does for this particualr wiki.
This wiki appears to be very new:
# sanity check
spark.sql("""
SELECT count(1) as count
FROM wmf_content.mediawiki_content_history_v1
WHERE wiki_id = 'muswiki'
""").show(20, truncate=False)
[Stage 53:> (0 + 1) / 1]
+-----+
|count|
+-----+
|290 |
+-----+Let's try to figure out how this happened, and also let's cleanup wmf_content.mediawiki_content_history_v1.