Page MenuHomePhabricator

Duplicate rows on wmf_content.mediawiki_content_history_v1 after big reconcile
Closed, ResolvedPublic

Description

Similar to T388715: Investigate and fix duplicate data on wmf_content.mediawiki_content_history_v1 for muswiki we have just failed the update of wmf_content.mediawiki_content_current_v1 because of duplicate rows on wmf_content.mediawiki_content_history_v1:

Total amount of duplicate rows:

spark.sql("""
SELECT count(1) as count FROM (
  SELECT count(1) as count,
         wiki_id,
         page_id,
         revision_id
  FROM wmf_content.mediawiki_content_history_v1
  GROUP BY wiki_id, page_id, revision_id
  HAVING count > 1
)
""").show(200, truncate=False)
spark.sql("""
SELECT count(1) as count FROM (
  SELECT count(1) as count,
         wiki_id,
         page_id,
         revision_id
  FROM wmf_content.mediawiki_content_history_v1
  GROUP BY wiki_id, page_id, revision_id
  HAVING count > 1
)
""").show(200, truncate=False)
[Stage 30:===================================================>(1023 + 1) / 1024]
+-----+
|count|
+-----+
|57397|
+-----+

Distribution:

spark.sql("""
SELECT count(1) as count,
       wiki_id
  FROM (
    SELECT count(1) as count,
           wiki_id,
           page_id,
           revision_id
    FROM wmf_content.mediawiki_content_history_v1
    GROUP BY wiki_id, page_id, revision_id
    HAVING count > 1
  )
GROUP BY wiki_id
ORDER BY count DESC
""").show(200, truncate=False)

+-----+-------------+
|count|wiki_id      |
+-----+-------------+
|13233|plwiki       |
|11743|ptwiki       |
|7137 |ruwiki       |
|5372 |nlwiki       |
|4782 |nowiki       |
|3167 |rowiki       |
|2726 |plwikibooks  |
|1836 |plwikiquote  |
|1826 |ruwiktionary |
|765  |ptwiktionary |
|532  |plwiktionary |
|357  |ruwikivoyage |
|351  |ruwikisource |
|328  |ptwikiquote  |
|318  |ruwikinews   |
|272  |plwikivoyage |
|244  |nlwiktionary |
|218  |ptwikibooks  |
|216  |ruwikiquote  |
|169  |rowiktionary |
|167  |ndswiki      |
|161  |nnwiki       |
|115  |plwikimedia  |
|113  |ptwikisource |
|112  |ruwikibooks  |
|107  |pflwiki      |
|105  |outreachwiki |
|82   |mznwiki      |
|82   |ptwikinews   |
|74   |nvwiki       |
|60   |nlwikibooks  |
|57   |plwikinews   |
|42   |napwiki      |
|41   |ruewiki      |
|38   |plwikisource |
|36   |nlwikiquote  |
|35   |ruwikimedia  |
|35   |ptwikivoyage |
|30   |ocwiki       |
|21   |rowikisource |
|21   |oswiki       |
|20   |rowikiquote  |
|20   |rmywiki      |
|19   |omwiki       |
|18   |sahwiki      |
|15   |pcdwiki      |
|14   |quwiki       |
|14   |papwiki      |
|13   |nowiktionary |
|13   |nds_nlwiki   |
|8    |nlwikinews   |
|8    |myvwiki      |
|7    |newiki       |
|7    |nlwikimedia  |
|7    |nowikimedia  |
|7    |rwwiktionary |
|6    |pnbwiki      |
|6    |pawiki       |
|6    |pswiki       |
|6    |napwikisource|
|5    |pmswiki      |
|5    |pdcwiki      |
|4    |rnwiki       |
|4    |ruwikiversity|
|4    |roa_tarawiki |
|4    |ptwikiversity|
|3    |rowikinews   |
|3    |novwiki      |
|3    |nowikiquote  |
|3    |mywiktionary |
|2    |rwwiki       |
|2    |orwiki       |
|2    |rmwiki       |
|2    |pamwiki      |
|2    |pmswikisource|
|1    |nywiki       |
|1    |nnwikiquote  |
|1    |rowikibooks  |
|1    |newwiki      |
|1    |nycwikimedia |
|1    |pagwiki      |
|1    |piwiki       |
|1    |nsowiki      |
|1    |nnwiktionary |
+-----+-------------+

Event Timeline

Paused mw_content_merge_changes_to_mw_content_current_daily DAG as it continues to fail with:

Job aborted due to stage failure: Task 18 in stage 15.0 failed 4 times, most recent failure: Lost task 18.3 in stage 15.0 (TID 449312) (an-worker1158.eqiad.wmnet executor 189): org.apache.spark.SparkException: The ON search condition of the MERGE statement matched a single row from the target table with multiple rows of the source table. This could result in the target row being operated on more than once with an update or delete operation and is not allowed.

As per distribution in description above, there are no issues on the big wikis (enwiki, wikidatawiki, commonswiki).

The issue also seems to appear when we ingested revision visibility events and the page also moved. Here is a (redacted) example:

# drill down on a random dup: |2    |ptwikisource|17063  |267136     |
spark.sql("""
SELECT wiki_id, page_id, revision_id, row_content_update_dt, row_visibility_update_dt, row_move_update_dt
FROM wmf_content.mediawiki_content_history_v1
WHERE wiki_id = 'ptwikisource'
  AND page_id = 17063
  AND revision_id = 267136
""").show(200, truncate=False)
[Stage 163:>                                                        (0 + 1) / 1]
+------------+-------+-----------+--------------------------+--------------------------+--------------------------+
|wiki_id     |page_id|revision_id|row_content_update_dt     |row_visibility_update_dt  |row_move_update_dt        |
+------------+-------+-----------+--------------------------+--------------------------+--------------------------+
|ptwikisource|17063  |267136     |2025-06-01 19:45:26.410699|2024-07-01 00:00:00       |null                      |
|ptwikisource|17063  |267136     |2025-06-18 20:21:20.805068|2025-06-18 20:21:20.805068|2025-06-18 20:21:20.805068|
+------------+-------+-----------+--------------------------+--------------------------+--------------------------+

I have no root cause yet, but we did just finish a full reconcile run, so will investigate that.

We do want to remove all the bad revisions ASAP so that the daily runs can continue over the weekend, so I am running the following clean up script:

hostname -f
an-launcher1002.eqiad.wmnet

sudo -u analytics bash

screen -S xcollazo_mw_content_history_fix_T388715

spark3-sql \
--master yarn \
--conf spark.driver.maxResultSize=8G \
--conf spark.dynamicAllocation.maxExecutors=128 \
--conf spark.sql.shuffle.partitions=2048 \
--conf spark.sql.iceberg.locality.enabled=true \
--conf spark.reducer.maxReqsInFlight=1 \
--conf spark.shuffle.io.retryWait=180s \
--conf spark.shuffle.io.maxRetries=10  \
--executor-cores 2 \
--executor-memory 16G \
--driver-cores 4 \
--driver-memory 16G \
--name xcollazo_mw_content_history_fix_T388715



MERGE INTO wmf_content.mediawiki_content_history_v1 t
USING(
    SELECT /*+ BROADCASTJOIN(br) */
           s.wiki_id,
           s.page_id,
           s.revision_id,
           row_visibility_update_dt,
           row_number() over ( PARTITION BY s.wiki_id, s.revision_id ORDER BY s.row_visibility_update_dt ASC NULLS FIRST) AS row_num
    FROM wmf_content.mediawiki_content_history_v1 s
    INNER JOIN (
      SELECT wiki_id,
             page_id,
             revision_id
      FROM wmf_content.mediawiki_content_history_v1
      GROUP BY wiki_id, page_id, revision_id
      HAVING count(1) > 1
    ) br ON (
           s.wiki_id = br.wiki_id
           AND s.page_id = br.page_id
           AND s.revision_id = br.revision_id
         )
    WHERE s.wiki_id NOT IN ('enwiki', 'commonswiki', 'wikidatawiki')
) s

ON  t.wiki_id = s.wiki_id
AND t.page_id = s.page_id
AND t.revision_id = s.revision_id
AND t.row_visibility_update_dt = s.row_visibility_update_dt
AND t.wiki_id NOT IN ('enwiki', 'commonswiki', 'wikidatawiki')
AND s.row_num = 1

WHEN MATCHED THEN DELETE;

The SQL above identifies all duplicate revisions, leveraging the fact that big wikis are not affected, and then picks one of them as the chosen one to delete. After looking at the duplicate data, it looked like deleting the row with the oldest row_visibility_update_dt made sense.

Success:

analytics@an-launcher1002:/home/xcollazo$ screen -S xcollazo_mw_content_history_fix_T388715

                   >            row_visibility_update_dt,
                   >            row_number() over ( PARTITION BY s.wiki_id, s.revision_id ORDER BY s.row_visibility_update_dt ASC NULLS FIRST) AS row_num
                   >     FROM wmf_content.mediawiki_content_history_v1 s
                   >     INNER JOIN (
                   >       SELECT wiki_id,
                   >              page_id,
                   >              revision_id
                   >       FROM wmf_content.mediawiki_content_history_v1
                   >       GROUP BY wiki_id, page_id, revision_id
                   >       HAVING count(1) > 1
                   >     ) br ON (
                   >            s.wiki_id = br.wiki_id
                   >            AND s.page_id = br.page_id
                   >            AND s.revision_id = br.revision_id
                   >          )
                   >     WHERE s.wiki_id NOT IN ('enwiki', 'commonswiki', 'wikidatawiki')
                   > ) s
                   >
                   > ON  t.wiki_id = s.wiki_id
                   > AND t.page_id = s.page_id
                   > AND t.revision_id = s.revision_id
                   > AND t.row_visibility_update_dt = s.row_visibility_update_dt
                   > AND t.wiki_id NOT IN ('enwiki', 'commonswiki', 'wikidatawiki')
                   > AND s.row_num = 1
                   >
                   > WHEN MATCHED THEN DELETE;
Response code
Time taken: 7306.46 seconds

Verification that issue is gone:

# verification after hot fix
spark.sql("""
SELECT count(1) as count FROM (
  SELECT count(1) as count,
         wiki_id,
         page_id,
         revision_id
  FROM wmf_content.mediawiki_content_history_v1
  GROUP BY wiki_id, page_id, revision_id
  HAVING count > 1
)
""").show(200, truncate=False)
[Stage 166:==================================================>(1022 + 2) / 1024]
+-----+
|count|
+-----+
|0    |
+-----+

But that we still have one version of the offending rows:

# drill down on a random dup: |2    |ptwikisource|17063  |267136     |
spark.sql("""
SELECT wiki_id, page_id, revision_id, row_content_update_dt, row_visibility_update_dt, row_move_update_dt
FROM wmf_content.mediawiki_content_history_v1
WHERE wiki_id = 'ptwikisource'
  AND page_id = 17063
  AND revision_id = 267136
""").show(200, truncate=False)
[Stage 170:>                                                        (0 + 1) / 1]
+------------+-------+-----------+--------------------------+--------------------------+--------------------------+
|wiki_id     |page_id|revision_id|row_content_update_dt     |row_visibility_update_dt  |row_move_update_dt        |
+------------+-------+-----------+--------------------------+--------------------------+--------------------------+
|ptwikisource|17063  |267136     |2025-06-18 20:21:20.805068|2025-06-18 20:21:20.805068|2025-06-18 20:21:20.805068|
+------------+-------+-----------+--------------------------+--------------------------+--------------------------+

Ok we are good to restart all pipelines. We can root cause next week.

The issue seems related to the parallel ingestion of visibility and page moves events (this turned out to not be true, more info on comments below). Here is an example offending row before ingesting a particular move event :

spark.sql("""
SELECT wiki_id, page_id, revision_id, row_content_update_dt, row_visibility_update_dt, row_move_update_dt
FROM wmf_content.mediawiki_content_history_v1 TIMESTAMP AS OF '2025-06-20'
WHERE wiki_id = 'nowiki'
  AND page_id = 2110877
  AND revision_id = 23482275
""").show(200, truncate=False)


+-------+-------+-----------+--------------------------+------------------------+------------------+
|wiki_id|page_id|revision_id|row_content_update_dt     |row_visibility_update_dt|row_move_update_dt|
+-------+-------+-----------+--------------------------+------------------------+------------------+
|nowiki |2110877|23482275   |2025-06-01 19:33:29.191161|2024-07-01 00:00:00     |null              |
+-------+-------+-----------+--------------------------+------------------------+------------------+

Same query but TIMESTAMP AS OF '2025-06-20 12', which shows the issue:

+-------+-------+-----------+--------------------------+--------------------------+--------------------------+
|wiki_id|page_id|revision_id|row_content_update_dt     |row_visibility_update_dt  |row_move_update_dt        |
+-------+-------+-----------+--------------------------+--------------------------+--------------------------+
|nowiki |2110877|23482275   |2025-06-18 20:07:41.820995|2025-06-18 20:07:41.820995|2025-06-18 20:07:41.820995|
|nowiki |2110877|23482275   |2025-06-01 19:33:29.191161|2024-07-01 00:00:00       |null                      |
+-------+-------+-----------+--------------------------+--------------------------+--------------------------+

Same query but TIMESTAMP AS OF '2025-06-21', which shows corrected data after fix from T397525#10935336:

+-------+-------+-----------+--------------------------+--------------------------+--------------------------+
|wiki_id|page_id|revision_id|row_content_update_dt     |row_visibility_update_dt  |row_move_update_dt        |
+-------+-------+-----------+--------------------------+--------------------------+--------------------------+
|nowiki |2110877|23482275   |2025-06-18 20:07:41.820995|2025-06-18 20:07:41.820995|2025-06-18 20:07:41.820995|
+-------+-------+-----------+--------------------------+--------------------------+--------------------------+

However, we do run MERGE INTO with what seems to be the correct ON clause to avoid this issue. ( MERGE INTO code for page change event ingest, MERGE INTO code for visibility change event ingest ).

A bug in Iceberg MERGE INTO logic that can lead to duplicate rows has been found at https://github.com/apache/iceberg/issues/12653, however, the repro doesn't match our use case. Additionally, a configuration that is mentioned on that bug, write.merge.isolation-level, is set to default in our case, which is serializable, which is exactly what we want to avoid duplicates.

TL;DR: Root cause found. The source data is an event table, and it can be reinstated. A reinstate did happen on hour=20 for a particular ingest. We calculate predicate pushdown on a query that is separate from the MERGE INTO query. It seems like the pushdown calculation happened precisely before the reinstante, and the MERGE INTO after. Thus we got duplicate data as the incoming pushdown predicates did not have the full context. I will discuss this issue with the team to see what we can do to avoid this in the future.


Longer:

As per our table's Iceberg metadata table, snapshot 8515433763264603745 was the one to add the offending rows:

Figure (1):

# drill down on a random dup: |2    |nowiki      |2110877|23482275   |
spark.sql("""
SELECT wiki_id, page_id, revision_id, row_content_update_dt, row_visibility_update_dt, row_move_update_dt
FROM wmf_content.mediawiki_content_history_v1 VERSION AS OF 8515433763264603745
WHERE wiki_id = 'nowiki'
  AND page_id = 2110877
  AND revision_id = 23482275
""").show(200, truncate=False)
+-------+-------+-----------+--------------------------+--------------------------+--------------------------+
|wiki_id|page_id|revision_id|row_content_update_dt     |row_visibility_update_dt  |row_move_update_dt        |
+-------+-------+-----------+--------------------------+--------------------------+--------------------------+
|nowiki |2110877|23482275   |2025-06-18 20:07:41.820995|2025-06-18 20:07:41.820995|2025-06-18 20:07:41.820995|
|nowiki |2110877|23482275   |2025-06-01 19:33:29.191161|2024-07-01 00:00:00       |null                      |
+-------+-------+-----------+--------------------------+--------------------------+--------------------------+

The corresponding Spark app id for this is application_1750253317527_36898.

Looking at Airflow, we see that we had three attempts:

https://yarn.wikimedia.org/spark-history/history/application_1750253317527_25640/jobs/

https://yarn.wikimedia.org/spark-history/history/application_1750253317527_29425/jobs/

https://yarn.wikimedia.org/spark-history/history/application_1750253317527_36898/jobs/

Only application_1750253317527_36898 was successful.

Interestingly, looking at the predicate pushdown for the corresponding Spark SQL job on the Spark History UI, I do *not* see page_id = 2110877 being pushed down? I would definitely have expected to see it, and in fact, if it was not included that would explain the behavior: in the set of queried data from the target table, we would not know that that page was in there, thus any updates would go thru the WHEN NOT MATCHED clause, which would INSERT.

That code path would also set all *_update_dt timestamps and indeed we can see above at Figure (1) that that is the case.

Let's confirm the exact SQL statement that was produced by looking at the YARN logs (and note that for spark application_1750253317527_36898, we have the corresponding skein driver at application_1750253317527_36892 as per Airflow.

ssh an-launcher1002.eqiad.wmnet

xcollazo@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics yarn logs -applicationId application_1750253317527_36892 > application_1750253317527_36892.log


# make sure this is the right file:
xcollazo@an-launcher1002:~$ cat application_1750253317527_36892.log | grep 'MERGE INTO wmf'
MERGE INTO wmf_content.mediawiki_content_history_v1 t
MERGE INTO wmf_content.mediawiki_content_history_v1 t
MERGE INTO wmf_content.mediawiki_content_history_v1 t
MERGE INTO wmf_content.mediawiki_content_history_v1 t
MERGE INTO wmf_content.mediawiki_content_history_v1 t
MERGE INTO wmf_content.mediawiki_content_history_v1 t
MERGE INTO wmf_content.mediawiki_content_history_v1 t

# confirm that the page_id for the nowiki example is not there:
xcollazo@an-launcher1002:~$ cat application_1750253317527_36892.log | grep '2110877' | wc -l
0

Very interesting. Now let's try to repro by recreating the SQL that this run would have done:

xcollazo@an-launcher1002:~$ cat application_1750253317527_36892.log | grep 'Now calculating page_ids to create or edit via:' -A 17
Now calculating page_ids to create or edit via:

            SELECT wiki_id,
                   sort_array(collect_set(page_id)) as page_ids
            FROM (
                SELECT
                    wiki_id,
                    page.page_id
                FROM event.mediawiki_content_history_reconcile_enriched_v1
                WHERE year=2025
                  AND month=6
                  AND day=18
                  
                  AND page_change_kind IN ('create', 'edit')
            )
            GROUP BY wiki_id
            ORDER BY wiki_id

Running this modified for our purposes yields:

spark.sql("""
SELECT wiki_id, array_contains(page_ids, 2110877) as contains_offending_page, cardinality(page_ids) as array_size FROM (
            SELECT wiki_id,
                   sort_array(collect_set(page_id)) as page_ids
            FROM (
                SELECT
                    wiki_id,
                    page.page_id
                FROM event.mediawiki_content_history_reconcile_enriched_v1
                WHERE year=2025
                  AND month=6
                  AND day=18
                  
                  AND page_change_kind IN ('create', 'edit')
                  AND wiki_id = 'nowiki'
            )
            GROUP BY wiki_id
            ORDER BY wiki_id
)
""").show(100, truncate=False)

+-------+-----------------------+----------+
|wiki_id|contains_offending_page|array_size|
+-------+-----------------------+----------+
|nowiki |true                   |1903      |
+-------+-----------------------+----------+

Thus the problem is not from that SQL statement.

Let's drill into the logs again to look for the exact pages that were pushed down:

xcollazo@an-launcher1002:~$ cat application_1750253317527_36892.log | grep 'remaining wikis together'
Processing 506 remaining wikis together
Processing delete for 79 remaining wikis together
Processing move for 53 remaining wikis together

xcollazo@an-launcher1002:~$ cat application_1750253317527_36892.log | grep 'Processing 506 remaining wikis together' -A 1150 > application_1750253317527_36892_extracted_offending_merge_into.log

xcollazo@an-launcher1002:~$ cat application_1750253317527_36892_extracted_offending_merge_into.log | grep nowiki
(t.wiki_id = 'nowiki' AND t.page_id IN (406686, 424376, 461463, 725210, 807763, 912526, 1382468, 1702519, 2008624, 2095515, 2104301, 2230005, 2252666, 2253056, 2261727, 2261751))

Bingo. For some reason, the 1903 page_ids became 16. In fact, many of the wikis have this behavior, and it is always the case that there are *more* affected page ids than there are listed as predicate pushdown.

Thus: perhaps the data changed from the time the pushdown query was executed, and the time the MERGE INTO executed? This is possible as doing the query planning can take long minutes, sometimes half an hour.

Looking at nowiki data, most of it came at hour=20:

spark.sql("""
SELECT wiki_id, hour, cardinality(page_ids) as array_size FROM (
            SELECT wiki_id,
                   hour,
                   sort_array(collect_set(page_id)) as page_ids
            FROM (
                SELECT
                    wiki_id,
                    hour,
                    page.page_id
                FROM event.mediawiki_content_history_reconcile_enriched_v1
                WHERE year=2025
                  AND month=6
                  AND day=18
                  AND page_change_kind IN ('create', 'edit')
                  AND wiki_id = 'nowiki'
            )
            GROUP BY wiki_id, hour
)
ORDER BY wiki_id, hour
""").show(900, truncate=False)

+-------+----+----------+
|wiki_id|hour|array_size|
+-------+----+----------+
|nowiki |4   |12        |
|nowiki |16  |5         |
|nowiki |20  |1888      |
+-------+----+----------+

And indeed, if we look at data landing times we can see a jump at hour=20:

xcollazo@stat1011:/mnt/hdfs/wmf/data/event/mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=18$ ls -vlsha
total 104K
4.0K drwxr-x--- 26 analytics analytics-privatedata-users 4.0K Jun 19 21:54  .
4.0K drwxr-x--- 26 analytics analytics-privatedata-users 4.0K Jun 24 03:26  ..
4.0K drwxr-x---  5 analytics analytics-privatedata-users 4.0K Jun 18 03:38 'hour=0'
4.0K drwxr-x---  5 analytics analytics-privatedata-users 4.0K Jun 18 04:28 'hour=1'
4.0K drwxr-x---  5 analytics analytics-privatedata-users 4.0K Jun 18 05:31 'hour=2'
4.0K drwxr-x---  6 analytics analytics-privatedata-users 4.0K Jun 18 06:26 'hour=3'
4.0K drwxr-x---  6 analytics analytics-privatedata-users 4.0K Jun 18 07:26 'hour=4'
4.0K drwxr-x---  6 analytics analytics-privatedata-users 4.0K Jun 18 08:30 'hour=5'
4.0K drwxr-x---  5 analytics analytics-privatedata-users 4.0K Jun 18 09:33 'hour=6'
4.0K drwxr-x---  5 analytics analytics-privatedata-users 4.0K Jun 18 10:28 'hour=7'
4.0K drwxr-x---  5 analytics analytics-privatedata-users 4.0K Jun 18 11:29 'hour=8'
4.0K drwxr-x---  5 analytics analytics-privatedata-users 4.0K Jun 18 12:35 'hour=9'
4.0K drwxr-x---  5 analytics analytics-privatedata-users 4.0K Jun 18 13:28 'hour=10'
4.0K drwxr-x---  5 analytics analytics-privatedata-users 4.0K Jun 18 14:32 'hour=11'
4.0K drwxr-x---  5 analytics analytics-privatedata-users 4.0K Jun 18 15:28 'hour=12'
4.0K drwxr-x---  5 analytics analytics-privatedata-users 4.0K Jun 18 16:28 'hour=13'
4.0K drwxr-x---  5 analytics analytics-privatedata-users 4.0K Jun 18 17:27 'hour=14'
4.0K drwxr-x---  5 analytics analytics-privatedata-users 4.0K Jun 18 18:27 'hour=15'
4.0K drwxr-x---  7 analytics analytics-privatedata-users 4.0K Jun 18 19:26 'hour=16'
4.0K drwxr-x---  7 analytics analytics-privatedata-users 4.0K Jun 18 20:29 'hour=17'
4.0K drwxr-x---  6 analytics analytics-privatedata-users 4.0K Jun 18 21:32 'hour=18'
4.0K drwxr-x---  6 analytics analytics-privatedata-users 4.0K Jun 18 22:28 'hour=19'
4.0K drwxr-x---  6 analytics analytics-privatedata-users 4.0K Jun 19 22:05 'hour=20'       <<<<<< almost 24h jump, likely reinstated?
4.0K drwxr-x---  6 analytics analytics-privatedata-users 4.0K Jun 19 00:51 'hour=21'
4.0K drwxr-x---  6 analytics analytics-privatedata-users 4.0K Jun 19 01:27 'hour=22'
4.0K drwxr-x---  5 analytics analytics-privatedata-users 4.0K Jun 19 02:26 'hour=23'

Finally, perused the refine failure emails, and indeed, there was a failure for this particular hour () and @amastilovic fixed it (which was the right thing to do as per OpsWeek duties). However, in our case we would rather not have reinstated data, and allow the system to reconcile later.

We need to discuss this with team to see what we can do to avoid the issue in the future.

xcollazo renamed this task from Duplicate rows wmf_content.mediawiki_content_history_v1 to Duplicate rows on wmf_content.mediawiki_content_history_v1 after big reconcile.Jun 25 2025, 2:26 PM

The following question was posed earlier today when we discussed this situation: “But why did the sensor trigger if one of the hours failed to refine?” Great question, and after investigating further, the sensor wait_for_event_mediawiki_content_history_reconcile_enriched_v1_partitions is the root cause. From the Airflow logs of the sensor's successful run we see that it was only poking the codfw datanceter:

mw-content-merge-events-to-mw-content-history-daily-wa-2onbouze
[2025-06-19, 05:27:21 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-06-19, 05:27:21 UTC] {connection.py:277} WARNING - Connection schemes (type: hive_metastore) shall not contain '_' according to RFC3986.
[2025-06-19, 05:27:21 UTC] {base.py:84} INFO - Retrieving connection 'analytics-hive'
[2025-06-19, 05:27:21 UTC] {hive.py:611} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083
[2025-06-19, 05:27:21 UTC] {hive.py:613} INFO - Connected to analytics-hive.eqiad.wmnet:9083
[2025-06-19, 05:27:21 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=0
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=1
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=2
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=3
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=4
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=5
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=6
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=7
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=8
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=9
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=10
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=11
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=12
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=13
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=14
[2025-06-19, 05:27:22 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=15
[2025-06-19, 05:27:23 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=16
[2025-06-19, 05:27:23 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=17
[2025-06-19, 05:27:23 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=18
[2025-06-19, 05:27:23 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=19
[2025-06-19, 05:27:23 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=20
[2025-06-19, 05:27:23 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=21
[2025-06-19, 05:27:23 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=22
[2025-06-19, 05:27:23 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=18/hour=23
[2025-06-19, 05:27:23 UTC] {base.py:339} INFO - Success criteria met. Exiting.
[2025-06-19, 05:27:23 UTC] {taskinstance.py:341} ▶ Post task execution logs

But we are switched over to eqiad and indeed the trouble hour, as per T397525#10944260, was from that datacenter:

xcollazo@stat1011:/mnt/hdfs/wmf/data/event/mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=18$ ls -vlsha
...
4.0K drwxr-x---  6 analytics analytics-privatedata-users 4.0K Jun 18 22:28 'hour=19'
4.0K drwxr-x---  6 analytics analytics-privatedata-users 4.0K Jun 19 22:05 'hour=20'       <<<<<< almost 24h jump, likely reinstated?
4.0K drwxr-x---  6 analytics analytics-privatedata-users 4.0K Jun 19 00:51 'hour=21'
...

When I saw this, this all rang a bell: T338796: Rewrite all Airflow sensors that use datacenter prepartitions to depend on both datacenters. In that ticket we did the right thing, which is that for all events we started depending on both datacenters, but then, we reverted it since it was unreliable as canary events were having all sorts of issues at that moment. We then forgot about it. But this issue affects most all of our event stream sensors; we are only depending on codfw while eqiad is the main datacenter, so this issue has been affecting us since at least the last datacenter switchover (and these happen twice a year at equinox, so since March 20 2025.).

We now have a more robust canary system, so I will now apply the changes from T338796 again.

CC @Antoine_Quhen and @JAllemandou for awareness.

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1467

global config: make all event dataset definitions depend on both datacenters

Latest wait_for_event_mediawiki_content_history_reconcile_enriched_v1_partitions sensor doing the right thing now:

mw-content-merge-events-to-mw-content-history-daily-wa-zp24rwrj
 ▶ Log message source details
[2025-06-26, 04:49:56 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-06-26, 04:49:57 UTC] {connection.py:277} WARNING - Connection schemes (type: hive_metastore) shall not contain '_' according to RFC3986.
[2025-06-26, 04:49:57 UTC] {base.py:84} INFO - Retrieving connection 'analytics-hive'
[2025-06-26, 04:49:57 UTC] {hive.py:611} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083
[2025-06-26, 04:49:57 UTC] {hive.py:613} INFO - Connected to analytics-hive.eqiad.wmnet:9083
[2025-06-26, 04:49:57 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=0
[2025-06-26, 04:49:57 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=1
[2025-06-26, 04:49:57 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=2
[2025-06-26, 04:49:57 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=3
[2025-06-26, 04:49:57 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=4
[2025-06-26, 04:49:57 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=5
[2025-06-26, 04:49:57 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=6
[2025-06-26, 04:49:57 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=7
[2025-06-26, 04:49:57 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=8
[2025-06-26, 04:49:57 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=9
[2025-06-26, 04:49:58 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=10
[2025-06-26, 04:49:58 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=11
[2025-06-26, 04:49:58 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=12
[2025-06-26, 04:49:58 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=13
[2025-06-26, 04:49:58 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=14
[2025-06-26, 04:49:58 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=15
[2025-06-26, 04:49:58 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=16
[2025-06-26, 04:49:58 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=17
[2025-06-26, 04:49:58 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=18
[2025-06-26, 04:49:58 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=19
[2025-06-26, 04:49:58 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=20
[2025-06-26, 04:49:58 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=21
[2025-06-26, 04:49:58 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=22
[2025-06-26, 04:49:58 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=eqiad/year=2025/month=6/day=25/hour=23
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=0
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=1
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=2
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=3
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=4
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=5
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=6
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=7
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=8
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=9
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=10
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=11
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=12
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=13
[2025-06-26, 04:49:59 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=14
[2025-06-26, 04:50:00 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=15
[2025-06-26, 04:50:00 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=16
[2025-06-26, 04:50:00 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=17
[2025-06-26, 04:50:00 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=18
[2025-06-26, 04:50:00 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=19
[2025-06-26, 04:50:00 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=20
[2025-06-26, 04:50:00 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=21
[2025-06-26, 04:50:00 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=22
[2025-06-26, 04:50:00 UTC] {named_hive_partition.py:94} INFO - Poking for event.mediawiki_content_history_reconcile_enriched_v1/datacenter=codfw/year=2025/month=6/day=25/hour=23
[2025-06-26, 04:50:00 UTC] {base.py:339} INFO - Success criteria met. Exiting.
[2025-06-26, 04:50:00 UTC] {taskinstance.py:341} ▶ Post task execution logs
Version: v2.10.5
Git Version: .release:b93c3db6b1641b0840bd15ac7d05bc58ff2cccbf

This confirms the patch is working. I think this concludes this particular saga.

We have also speculated that this change will also have other positive effect: it may explain the ~11k missing events from T395139. But, we will leave that assessment to that other ticket.