Page MenuHomePhabricator

Monthly reconcile continues to emit a really large amount of events after user_id changes
Open, Needs TriagePublic

Description

On T411803: Fix reconcile bug where user_id is not being populated correctly., we introduced reconciling for mismatch_user_id and mismatch_user_text.

Months down the line, we should have reconciled the vast majority of such issues. However looking at last 3 monthly reconciles we find that:

spark.sql("""
SELECT count(1), computation_dt
FROM wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
WHERE computation_class = 'all-of-wiki-time'
GROUP BY computation_dt
ORDER BY computation_dt DESC
""").show()
[Stage 0:======================================================>(733 + 1) / 734]
+---------+-------------------+
| count(1)|     computation_dt|
+---------+-------------------+
|179893699|2026-03-01 00:00:00|
|245030841|2026-02-01 00:00:00|
|361884464|2026-01-01 00:00:00|     <<< start of changes from T411803
|  2882708|2025-12-01 00:00:00|
|  1134547|2025-11-01 00:00:00|
|  1087936|2025-10-01 00:00:00|
|  2326440|2025-09-01 00:00:00|
|  3072885|2025-08-01 00:00:00|
|  2873142|2025-07-01 00:00:00|
|  2116596|2025-06-01 00:00:00|
+---------+-------------------+

Let's investigate why this is the case and wheter we are dropping reconcile events?

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
main: content: bump artifact to pickup user_id fix.repos/data-engineering/airflow-dags!2061xcollazobump-mw-content-for-reconcilemain
Add Spark integration tests for reconcile mismatch detection and event transformation.repos/data-engineering/mediawiki-content-pipelines!106xcollazofind-user-id-bug-with-unit-testingmain
Fix bug to emit reconcile events for anonymous users.repos/data-engineering/mediawiki-content-pipelines!105xcollazofind-user-id-bugmain
main: content: run reconcile ingest with its own set of spark tunings.repos/data-engineering/airflow-dags!2053xcollazofix-reconcile-ingest-resourcesmain
Customize query in GitLab

Event Timeline

Are we fixing over and over the same set of revisions?

# notice how dewiki consistency has 3.3M inconsistencies over the last 3 months for mismatch_user_id
# are they the same revisions?
df = spark.sql("""
WITH jan AS (
  SELECT wiki_id, page_id, revision_id, reasons
  FROM wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
  WHERE computation_class = 'all-of-wiki-time'
    AND computation_dt = TIMESTAMP '2026-01-01 00:00:00'
),

feb AS (
  SELECT wiki_id, page_id, revision_id, reasons
  FROM wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
  WHERE computation_class = 'all-of-wiki-time'
    AND computation_dt = TIMESTAMP '2026-02-01 00:00:00'
),

mar AS (
  SELECT wiki_id, page_id, revision_id, reasons
  FROM wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
  WHERE computation_class = 'all-of-wiki-time'
    AND computation_dt = TIMESTAMP '2026-03-01 00:00:00'
)


SELECT jan.wiki_id, jan.page_id, jan.revision_id, jan.reasons
FROM jan
INNER JOIN feb USING (wiki_id, page_id, revision_id, reasons)
INNER JOIN mar USING (wiki_id, page_id, revision_id, reasons)
""").cache()

df.count()                                                              
178459072

df.createOrReplaceTempView("same_inconsistencies")
spark.sql("""
SELECT count(1) as count,
       wiki_id,
       reasons
FROM same_inconsistencies
GROUP BY wiki_id, reasons
ORDER BY count DESC
""").show(truncate=False)

+---------+-------------+-------------------------------------+
|count    |wiki_id      |reasons                              |
+---------+-------------+-------------------------------------+
|168593863|commonswiki  |[mismatch_user_id]                   |
|3341995  |dewiki       |[mismatch_user_id]                   |
|742731   |commonswiki  |[mismatch_user_text]                 |
|551770   |mlwiki       |[mismatch_user_id]                   |
|324937   |bhwiki       |[mismatch_user_id]                   |
|269055   |dewiki       |[page_was_restored, mismatch_user_id]|
|267274   |enwiki       |[missing_from_target]                |
|232386   |tewiki       |[mismatch_user_id]                   |
|197759   |newiki       |[mismatch_user_id]                   |
|170074   |dtywiki      |[mismatch_user_id]                   |
|142712   |viwiki       |[mismatch_user_id]                   |
|136985   |enwikibooks  |[mismatch_user_id]                   |
|123599   |guwiki       |[mismatch_user_id]                   |
|119782   |maiwiki      |[mismatch_user_id]                   |
|107400   |dewiki       |[missing_from_target]                |
|103997   |knwiki       |[mismatch_user_id]                   |
|93103    |nostalgiawiki|[mismatch_user_id]                   |
|90458    |enwiki       |[mismatch_page]                      |
|85217    |commonswiki  |[page_was_restored, mismatch_user_id]|
|82394    |enwiktionary |[mismatch_user_id]                   |
+---------+-------------+-------------------------------------+
only showing top 20 rows

Let's look at a specific random set of commonswiki revisions that have been reconciled over and over:

spark.sql("""
SELECT wiki_id, page_id, revision_id, reasons
FROM same_inconsistencies
WHERE wiki_id = 'commonswiki'
""").show()

+-----------+-------+-----------+--------------------+
|    wiki_id|page_id|revision_id|             reasons|
+-----------+-------+-----------+--------------------+
|commonswiki|    919|  859435026|  [mismatch_user_id]|
|commonswiki|   2963|  718177397|  [mismatch_user_id]|
|commonswiki|   9588|  969896769|[mismatch_user_text]|
|commonswiki|  13336|  865071711|[mismatch_user_text]|
|commonswiki|  13947|  848625765|  [mismatch_user_id]|
|commonswiki|  14093|  845035145|  [mismatch_user_id]|
...

Let's check the reconcile stream, see if we have issues with the emitted events:

# check all 2026 reconcile events, see if we find the issue?
spark.sql("""
SELECT *
FROM event.mediawiki_content_history_reconcile_enriched_v1
WHERE wiki_id = 'commonswiki'
  AND year = 2026
  AND month >= 1
  AND revision.rev_id IN (859435026, 718177397, 969896769, 865071711, 848625765, 845035145)
""").show()
[Stage 113:====================================================>(943 + 1) / 944]
+-------+--------------+-------+---------------------+---+----+----+----------------+---------+-----------+--------+-------+-------------+---------------+----------+----+-----+---+----+
|_schema|changelog_kind|comment|created_redirect_page| dt|meta|page|page_change_kind|performer|prior_state|revision|wiki_id|is_wmf_domain|normalized_host|datacenter|year|month|day|hour|
+-------+--------------+-------+---------------------+---+----+----+----------------+---------+-----------+--------+-------+-------------+---------------+----------+----+-----+---+----+
+-------+--------------+-------+---------------------+---+----+----+----------------+---------+-----------+--------+-------+-------------+---------------+----------+----+-----+---+----+

No events emitted at all?

When did we supposedly emit these events in feb?

spark.sql("""
SELECT wiki_id, page_id, revision_id, reasons, reconcile_emit_dt
FROM wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
WHERE computation_class = 'all-of-wiki-time'
  AND computation_dt = TIMESTAMP '2026-02-01 00:00:00'
  AND wiki_id = 'commonswiki'
  AND revision_id IN (859435026, 718177397, 969896769, 865071711, 848625765, 845035145)
""").show(truncate=False)
[Stage 119:===============================================>         (5 + 1) / 6]
+-----------+-------+-----------+--------------------+-----------------------+
|wiki_id    |page_id|revision_id|reasons             |reconcile_emit_dt      |
+-----------+-------+-----------+--------------------+-----------------------+
|commonswiki|2963   |718177397  |[mismatch_user_id]  |2026-02-04 11:34:45.86 |
|commonswiki|13947  |848625765  |[mismatch_user_id]  |2026-02-05 10:13:17.871|
|commonswiki|919    |859435026  |[mismatch_user_id]  |2026-02-05 12:02:35.273|
|commonswiki|13336  |865071711  |[mismatch_user_text]|2026-02-05 12:57:17.831|
|commonswiki|9588   |969896769  |[mismatch_user_text]|2026-02-05 18:19:38.815|
|commonswiki|14093  |845035145  |[mismatch_user_id]  |2026-02-05 09:39:45.517|
+-----------+-------+-----------+--------------------+-----------------------+

Do we have failures in reconcile flink app?

spark.sql("""
SELECT count(1) as count
FROM event.mw_content_history_reconcile_enrich_error
WHERE year = 2026
  AND month >= 1
""").show()
+-----+
|count|
+-----+
|    0|
+-----+

I have renamed the last all-of-wiki-time reconcile run to all-of-wiki-time-first-run to be able to rerun reconcile via Airflow:

$ hostname -f
an-launcher1003.eqiad.wmnet
$ whoami
analytics

spark3-sql \
 --master yarn \
 --conf spark.dynamicAllocation.maxExecutors=16 \
 --driver-cores 4 \
 --driver-memory 16G \
 --executor-cores 1 \
 --executor-memory 32G
 

SELECT computation_dt 
FROM wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
WHERE computation_class = 'all-of-wiki-time'
GROUP BY computation_dt
ORDER BY computation_dt DESC

computation_dt
2026-03-01 00:00:00
2026-02-01 00:00:00
2026-01-01 00:00:00
2025-12-01 00:00:00
2025-11-01 00:00:00
2025-10-01 00:00:00
2025-09-01 00:00:00
2025-08-01 00:00:00
2025-07-01 00:00:00
2025-06-01 00:00:00
Time taken: 60.035 seconds, Fetched 10 row(s)


UPDATE wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
SET computation_class = 'all-of-wiki-time-first-run'
WHERE computation_class = 'all-of-wiki-time'
  AND computation_dt = TIMESTAMP '2026-03-01 00:00:00'

Response code
Time taken: 189.607 seconds



SELECT computation_dt, computation_class
FROM wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
WHERE computation_class LIKE 'all-of-wiki-time%'
GROUP BY computation_dt, computation_class
ORDER BY computation_dt DESC

computation_dt	computation_class
2026-03-01 00:00:00	all-of-wiki-time-first-run
2026-02-01 00:00:00	all-of-wiki-time
2026-01-01 00:00:00	all-of-wiki-time
2025-12-01 00:00:00	all-of-wiki-time
2025-11-01 00:00:00	all-of-wiki-time
2025-10-01 00:00:00	all-of-wiki-time
2025-09-01 00:00:00	all-of-wiki-time
2025-08-01 00:00:00	all-of-wiki-time
2025-07-01 00:00:00	all-of-wiki-time
2025-06-01 00:00:00	all-of-wiki-time
Time taken: 22.284 seconds, Fetched 10 row(s)

I wish we had a cleaner way to do this, perhaps we should modify the Airflow DAG to allow us to do manual runs. Anyhow, good enough for now.

Double checked that mw_content_reconcile_mw_content_history_monthly DAG has picked up the fix.

Rerun can be found in Airflow here. I am, however, preemptively failing tasks before event emission as I want to inspect the state of wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 before I let that go thru.

This is the state before allowing spark_emit_reconcile_events_to_kafka to run with the fix:

spark.sql("""
SELECT count(1), computation_dt, computation_class
FROM wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
WHERE computation_class LIKE 'all-of-wiki-time%'
GROUP BY computation_dt, computation_class
ORDER BY computation_dt DESC
""").show(truncate=False)

+---------+-------------------+--------------------------+
|count(1) |computation_dt     |computation_class         |
+---------+-------------------+--------------------------+
|122224778|2026-03-01 00:00:00|all-of-wiki-time          |    <<<<<<<<
|179893699|2026-03-01 00:00:00|all-of-wiki-time-first-run|
|245030841|2026-02-01 00:00:00|all-of-wiki-time          |
|361884464|2026-01-01 00:00:00|all-of-wiki-time          |
|2882708  |2025-12-01 00:00:00|all-of-wiki-time          |
|1134547  |2025-11-01 00:00:00|all-of-wiki-time          |
|1087936  |2025-10-01 00:00:00|all-of-wiki-time          |
|2326440  |2025-09-01 00:00:00|all-of-wiki-time          |
|3072885  |2025-08-01 00:00:00|all-of-wiki-time          |
|2873142  |2025-07-01 00:00:00|all-of-wiki-time          |
|2116596  |2025-06-01 00:00:00|all-of-wiki-time          |
+---------+-------------------+--------------------------+

Double checked that mw_content_reconcile_mw_content_history_monthly DAG has picked up the fix.

Rerun can be found in Airflow here. I am, however, preemptively failing tasks before event emission as I want to inspect the state of wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 before I let that go thru.

I am now clearing spark_emit_reconcile_events_to_kafka so that event emission happen. I expect warming MediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag to trigger, and we can ignore it.

Change #1253633 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/deployment-charts@master] Scale up mw-content-history-reconcile-enrich temporarily for big reconcile.

https://gerrit.wikimedia.org/r/1253633

Change #1253633 merged by jenkins-bot:

[operations/deployment-charts@master] Scale up mw-content-history-reconcile-enrich temporarily for big reconcile.

https://gerrit.wikimedia.org/r/1253633

spark_process_reconciliation_events ingest for 2026-03-16 failed multiple times. I suspect this happened due to the same cluster issues reported on T419291#11723404: T420168 and T415002.

Rerunning as is now via https://yarn.wikimedia.org/proxy/application_1773845446826_10194

spark_process_reconciliation_events ingest for 2026-03-16 failed multiple times. I suspect this happened due to the same cluster issues reported on T419291#11723404: T420168 and T415002.

Rerunning as is now via https://yarn.wikimedia.org/proxy/application_1773845446826_10194

2026-03-16 successful. Pipeline is all caught up.

(We are now waiting on the next monthly reconcile to happen to see where we are at.)

Ok confirming this particular issue seems to be solved now. Note how overall inconsistencies are way down:

# run from Apr 6 2026
spark.sql("""
SELECT count(1), computation_dt
FROM wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
WHERE computation_class = 'all-of-wiki-time'
GROUP BY computation_dt
ORDER BY computation_dt DESC
""").show()

+---------+-------------------+
| count(1)|     computation_dt|
+---------+-------------------+
| 35252504|2026-04-01 00:00:00|
|122224778|2026-03-01 00:00:00|
|245030841|2026-02-01 00:00:00|
|361884464|2026-01-01 00:00:00|
|  2882708|2025-12-01 00:00:00|
|  1134547|2025-11-01 00:00:00|
|  1087936|2025-10-01 00:00:00|
|  2326440|2025-09-01 00:00:00|
|  3072885|2025-08-01 00:00:00|
|  2873142|2025-07-01 00:00:00|
|  2116596|2025-06-01 00:00:00|
+---------+-------------------+
`

And when drilling down, the reasons are unrelated to mismatch_user_id:

# Apr 6 2026
spark.sql("""
SELECT count(1) as count,
       reasons
FROM wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
WHERE computation_class = 'all-of-wiki-time'
  AND computation_dt = TIMESTAMP '2026-04-01 00:00:00'
GROUP BY reasons
ORDER BY count DESC
""").show(truncate=False)

+--------+-----------------------------------------------------------------------------+
|count   |reasons                                                                      |
+--------+-----------------------------------------------------------------------------+
|27439961|[mismatch_redirect]                                                          |
|5738582 |[mismatch_user_text]                                                         |
|979219  |[missing_from_target]                                                        |
|456625  |[page_was_deleted, page_was_restored, mismatch_redirect]                     |
|255077  |[page_was_restored, mismatch_redirect]                                       |
|103623  |[missing_from_source, mismatch_user_id]                                      |
|100957  |[mismatch_page]                                                              |
|45831   |[mismatch_page, mismatch_redirect]                                           |
|36079   |[missing_from_target, page_was_deleted, page_was_restored]                   |
|30024   |[missing_from_source, mismatch_user_id, mismatch_redirect]                   |
|19002   |[missing_from_target, page_was_restored]                                     |
|10302   |[mismatch_user_text, mismatch_redirect]                                      |
|7399    |[page_was_restored, mismatch_user_text]                                      |
|5638    |[missing_from_source, page_was_deleted, mismatch_user_id]                    |
|4808    |[missing_from_target, mismatch_redirect]                                     |
|2502    |[missing_from_source]                                                        |
|2257    |[missing_from_target, page_was_deleted, page_was_restored, mismatch_redirect]|
|2228    |[page_was_deleted, page_was_restored, mismatch_user_text]                    |
|2047    |[mismatch_page, page_was_deleted]                                            |
|1683    |[mismatch_page, page_was_restored]                                           |
+--------+-----------------------------------------------------------------------------+
only showing top 20 rows

The main offender above, mismatch_redirect, seems like legit reconcile workload from T417596.