Page MenuHomePhabricator

Configuration error results in NULL content_body and content_format
Closed, ResolvedPublic

Description

On T381322, we created a renamed stream mediawiki.content_history_reconcile_enriched that should enrich reconcile events with content_body and content_format fetched from the MW API.

On T381375#10427763, however, we realized that all the events being generated are failing silently to do the enrichment:

Actually, every single row on event.mediawiki_content_history_reconcile_enriched_v1 is compromised, regardles of MCR or not:

presto:event> select year, month, day, count(1) as count from event.mediawiki_content_history_reconcile_enriched_v1 where revision.content_slots['main'].content_body is NULL group by year, month, day order by year, month, day;
 year | month | day |  count   
------+-------+-----+----------
 2024 |    12 |  19 |      950 
 2024 |    12 |  20 |    11138 
 2024 |    12 |  21 | 50939588 
 2024 |    12 |  22 | 50499466 
 2024 |    12 |  23 | 49399621 
 2024 |    12 |  24 | 49198996 
 2024 |    12 |  25 | 44499110 
 2024 |    12 |  26 | 43497915 
 2024 |    12 |  27 | 47597552 
 2024 |    12 |  28 | 49996717 
 2024 |    12 |  29 |  1521781 
 2024 |    12 |  30 |    12822 
 2024 |    12 |  31 |    14364 
 2025 |     1 |   1 |  4069711 
(14 rows)

Query 20250102_194941_00213_nx7ym, FINISHED, 15 nodes
Splits: 3,529 total, 3,529 done (100.00%)
[Latency: client-side: 0:13, server-side: 0:13] [391M rows, 55.4GB] [29.2M rows/s, 4.13GB/s]

presto:event> select year, month, day, count(1) as count from event.mediawiki_content_history_reconcile_enriched_v1 where revision.content_slots['main'].content_body is NOT NULL group by year, month, day order by year, month, day;
 year | month | day | count 
------+-------+-----+-------
(0 rows)

Random examples:

presto:event> select wiki_id, revision.rev_id, revision.content_slots['main'] from event.mediawiki_content_history_reconcile_enriched_v1 limit 10;
   wiki_id    |  rev_id   |                                                                                     _col2                                                            >
--------------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------>
 liwiktionary |    907945 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=d6aqvx842jujpdcv4hoh8gml4cvhkg0, content_size=107, origin_rev_id=907945>
 cawikisource |    176542 | {content_body=null, content_format=null, content_model=proofread-page, content_sha1=ie8dttghk236j5u80h4p9yrabgm5d5b, content_size=2632, origin_rev_id>
 srwiki       |  28937523 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=ikaehvsxemtljoq735ho453c1h47m8r, content_size=158, origin_rev_id=289375>
 trwiki       |  34553164 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=jfkg54p6qro8gqsldhrnq3nl15hyt9r, content_size=1433, origin_rev_id=34553>
 ruwiki       | 142346895 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=fcc2ey6rr8848rbx8bpupvi1rozq74n, content_size=41608, origin_rev_id=1423>
 idwiki       |  26696316 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=kn2cc2wukm8nm7xhckztskb9vzlny0i, content_size=1219, origin_rev_id=26696>
 kowiki       |  38377728 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=468nucbvtwoxmmjitbqjfmmms5l2pu3, content_size=1241, origin_rev_id=38377>
 zhwiki       |  85456613 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=pqrnxwvdn1164ppgqjrz0ylylnzse33, content_size=1298, origin_rev_id=85456>
 ruwiki       | 142337496 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=g7t12nckfl65gnfpb8puwpl6pkz48li, content_size=35, origin_rev_id=1423374>
 zhwiki       |  85453215 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=1yoeddo8uw1rafu5620do1ejpa4rmkp, content_size=58, origin_rev_id=8545321>
(10 rows)

I have stopped all data pipelines related to this table until we figure this out.

Unfortunately, we have been ingesting this stream into wmf_dumps.wikitext_raw_rc2 over the time of the winter holidays, thus we have ingested the majority of these compromised events.

In this task we should figure out what is going on and fix appropriately.

Event Timeline

xcollazo changed the task status from Open to In Progress.Jan 3 2025, 3:01 PM
xcollazo triaged this task as Unbreak Now! priority.
xcollazo moved this task from Sprint Backlog to In Process on the Dumps 2.0 (Kanban Board) board.

Only a few INFO logs in logstash on a 20-Dec to 25-Dec period where we know many NULLs were generated.

I am blocked here unfortunately. I think the best thing to do is to wait until the team comes back on Monday Jan 6 and reassess.

( Side note: I've suggested on https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/51#note_121360 to add a check for these missing values when doing our data quality metrics. )

@xcollazo f/up from [[ eqiad.mediawiki.content_history_reconcile_enriched.v1 | slack ]].

The streaming app consumes from eqiad.mediawiki.content_history_reconcile.v1 and produces into eqiad.mediawiki.content_history_reconcile_enriched.v1.

However, the dumps_reconcile_wikitext_raw_daily airflow dag was configured to produce into eqiad.mediawiki.content_history_reconcile_enriched.v1. The enrichment app did not see those messages.

There's a config change in the attached patch.

However, the dumps_reconcile_wikitext_raw_daily airflow dag was configured to produce into eqiad.mediawiki.content_history_reconcile_enriched.v1. The enrichment app did not see those messages.

Totally my bad. Thanks for the fix @gmodena. Will apply and clean things up now.

Ok now, since we have no current way of detecting bad content_body or content_format as these are not part of the MariaDB database, let's remove all offending rows manually and rerun the all-of-wiki-time reconcile. Let's think separately whether we could at least have a NULL check on these fields.

xcollazo renamed this task from Flink reconciliation application fails to resolve content_body and content_format to Configuration error results in NULL content_body and content_format.Jan 6 2025, 3:03 PM
#config
spark = wmfdata.spark.create_custom_session(
    master='yarn',
    spark_config={
        "spark.shuffle.service.name": 'spark_shuffle_3_3',
        "spark.shuffle.service.port": '7338',
        "spark.yarn.archive": "hdfs:///user/spark/share/lib/spark-3.3.2-assembly.zip",
        "spark.dynamicAllocation.maxExecutors": 256,
        "spark.executor.memory": "8g",
        "spark.driver.memory": "32g",
        "spark.driver.cores": "16",
        "spark.executor.cores": "1",
        "spark.sql.shuffle.partitions": 1024,
        "spark.driver.maxResultSize": "8G",
        ##
        # extras to make Iceberg and MariaDB work on 3.3.2:
        ##
        "spark.jars.packages": "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.6.1",
        "spark.jars.ivySettings": "/etc/maven/ivysettings.xml",  # fix jar pulling
        "spark.sql.iceberg.locality.enabled": False,
    }
)


# figure how many rows are bad on all but the biggesst wikis
spark.sql("""
SELECT count(1) as count
FROM wmf_dumps.wikitext_raw_rc2
WHERE revision_content_slots['main'].content_body IS NULL
  AND wiki_db NOT IN ('commonswiki', 'enwiki', 'wikidatawiki')
""").show()

+-------+
|  count|
+-------+
|3047048|
+-------+

# delete all bad data from all but the biggest wikis
spark.sql("""
DELETE
FROM wmf_dumps.wikitext_raw_rc2
WHERE revision_content_slots['main'].content_body IS NULL
  AND wiki_db NOT IN ('commonswiki', 'enwiki', 'wikidatawiki')
""").show()

The DELETE took ~2 hours. On to the other wikis...

# config
spark = wmfdata.spark.create_custom_session(
    master='yarn',
    spark_config={
        "spark.shuffle.service.name": 'spark_shuffle_3_3',
        "spark.shuffle.service.port": '7338',
        "spark.yarn.archive": "hdfs:///user/spark/share/lib/spark-3.3.2-assembly.zip",
        "spark.dynamicAllocation.maxExecutors": 256,
        "spark.executor.memory": "10g",
        "spark.driver.memory": "32g",
        "spark.driver.cores": "16",
        "spark.executor.cores": "1",
        "spark.sql.shuffle.partitions": 5120,
        "spark.driver.maxResultSize": "8G",
        ##
        # extras to make Iceberg and MariaDB work on 3.3.2:
        ##
        "spark.jars.packages": "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.6.1",
        "spark.jars.ivySettings": "/etc/maven/ivysettings.xml",  # fix jar pulling
        "spark.sql.iceberg.locality.enabled": False,
    }
)

spark.sql("""
SELECT count(1) as count
FROM wmf_dumps.wikitext_raw_rc2
WHERE revision_content_slots['main'].content_body IS NULL
  AND wiki_db IN ('wikidatawiki')
""").show()

+--------+
|   count|
+--------+
|10503623|
+--------+

spark.sql("""
DELETE
FROM wmf_dumps.wikitext_raw_rc2
WHERE revision_content_slots['main'].content_body IS NULL
  AND wiki_db IN ('wikidatawiki')
""").show()

DELETE took 17 minutes.

# config
Same as above

spark.sql("""
SELECT count(1) as count
FROM wmf_dumps.wikitext_raw_rc2
WHERE revision_content_slots['main'].content_body IS NULL
  AND wiki_db IN ('enwiki')
""").show()

+------+
| count|
+------+
|826797|
+------+

spark.sql("""
DELETE
FROM wmf_dumps.wikitext_raw_rc2
WHERE revision_content_slots['main'].content_body IS NULL
  AND wiki_db IN ('enwiki')
""").show()

DELETE took 57 minutes.

# config
Same as above

spark.sql("""
SELECT count(1) as count
FROM wmf_dumps.wikitext_raw_rc2
WHERE revision_content_slots['main'].content_body IS NULL
  AND wiki_db IN ('commonswiki')
""").show()

+---------+
|    count|
+---------+
|370620409|
+---------+

spark.sql("""
DELETE
FROM wmf_dumps.wikitext_raw_rc2
WHERE revision_content_slots['main'].content_body IS NULL
  AND wiki_db IN ('commonswiki')
""").show()

DELETE took ~6 minutes.

Confirming all offending rows are now gone:

spark.sql("""
SELECT count(1) as count
FROM wmf_dumps.wikitext_raw_rc2
WHERE revision_content_slots['main'].content_body IS NULL
""").show()

+-----+
|count|
+-----+
|    0|
+-----+

Ok, let's now reconcile.

I have triggered a manual run of dumps_reconcile_wikitext_raw_monthly here.

--computation_dt on this manual run resolved to 2025-01-01T00:00:00.

I want to generate run all spark_consistency_check tasks, but I have preemtively marked as failed all spark_emit_reconcile_events_to_kafka tasks as I want to control manually how they generate data to be ingested by the dumps_merge_events_to_wikitext_raw_daily DAG.

I will let this run overnight.


I've also skipped the runs of dumps_merge_events_to_wikitext_raw_daily since 2024-12-31 to 2025-01-05. I will now unpause dumps_reconcile_wikitext_raw_daily from 2024-12-30. This should recover all the recent missing events.

DAGs dumps_merge_events_to_wikitext_raw_daily and dumps_merge_events_to_wikitext_raw_daily and all caught up.

Now running spark_emit_reconcile_events_to_kafka for enwiki with 9928427 inconsistent rows. Should take ~1 hour.

Now running spark_emit_reconcile_events_to_kafka for enwiki with 9928427 inconsistent rows. Should take ~1 hour.

This is now done, with the following emits:

spark.sql("""
SELECT count(1) as count, wiki_db, reconcile_emit_dt
FROM wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
WHERE computation_dt = '2025-01-01T00:00:00'
  AND computation_class = 'all-of-wiki-time'
  AND wiki_db = 'enwiki'
GROUP BY wiki_db, reconcile_emit_dt
ORDER BY reconcile_emit_dt ASC NULLS LAST
""").show(truncate=False)
[Stage 38:>                                                         (0 + 1) / 1]
+-------+-------+-----------------------+
|count  |wiki_db|reconcile_emit_dt      |
+-------+-------+-----------------------+
|3641201|enwiki |2025-01-07 13:12:53.27 |
|2109097|enwiki |2025-01-07 13:15:08.146|
|2133667|enwiki |2025-01-07 13:17:40.39 |
|754142 |enwiki |2025-01-07 13:19:44.027|
|546055 |enwiki |2025-01-07 13:22:01.094|
|220736 |enwiki |2025-01-07 13:27:21.475|
|100000 |enwiki |2025-01-07 13:31:54.982|
|100000 |enwiki |2025-01-07 13:33:08.629|
|100000 |enwiki |2025-01-07 13:34:27.092|
|100000 |enwiki |2025-01-07 13:36:16.202|
|100000 |enwiki |2025-01-07 13:38:31.085|
|23529  |enwiki |2025-01-07 13:39:56.766|
+-------+-------+-----------------------+

However, I see no data on the Grafana dashboard for the Flink app on last 1h.

Confirming all I can see are canary events on the enrichment stream, so the app appears to not be running:

kafkacat -C -b kafka-jumbo1007.eqiad.wmnet:9092 -t eqiad.mediawiki.content_history_reconcile_enriched.v1 -o -1 -e
{"$schema":"/mediawiki/page/change/1.2.0","changelog_kind":"update","comment":"changed a thing","dt":"2021-01-01T00:00:00.0Z","meta":{"domain":"canary","dt":"2025-01-07T13:35:00.000Z","stream":"mediawiki.content_history_reconcile_enriched.v1","...

Flink app is back online as per @BTullis's help via slack thread.

Data shown in grafana.

Missing alerts patch has been merged (part of T379362).

App is now making progress, but slow:

Then we are going very slow. For the big reconcile we are doing now, just enwiki produced ~9M events, and we are doing ~50 events/sec on the flink app. It would finish in months.

I've asked to bump reasources 10x for the time being. We'll see if that is reasonable, will report back.

I've asked to bump reasources 10x for the time being. We'll see if that is reasonable, will report back.

After discussing with @gmodena, he suggest we discuss with MW SRE folks first. We are asking them what target RPS against the MW API for the flink app would be ok.

I've asked to bump reasources 10x for the time being. We'll see if that is reasonable, will report back.

After discussing with @gmodena, he suggest we discuss with MW SRE folks first. We are asking them what target RPS against the MW API for the flink app would be ok.

From #wikimedia-sre:

<xcollazo> Xabriel J. Collazo Mojica
Hello folks. Heads up: As part of https://phabricator.wikimedia.org/T382953, which runs a reconciliation process, the Data Eng team intends to bump up our Action API usage for the next 2 weeks or so. What is a reasonable rps target for temporary heavy internal use? Would 10K rps be ok? This would mostly be hitting enwiki, commonswiki and wikidatawiki APIs.

12:19 PM <akosiaris>
xcollazo: 10k rps is like 3 times the usual traffic patterns for mw-api-int (see grafana dashboard at https://grafana.wikimedia.org/d/t7EiVbdVk/mw-api-int?orgId=1&viewPanel=10&from=now-7d&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main&var-container_name=All)
12:19 PM
go for an order of magnitude less I 'd say and it's probably ok

Change #1108801 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/deployment-charts@master] dse-k8s: content-history: Temporarily 10x resources for initial reconcile

https://gerrit.wikimedia.org/r/1108801

Change #1108801 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s: content-history: Temporarily 10x resources for initial reconcile

https://gerrit.wikimedia.org/r/1108801

Long story in slack, but we tried to redeploy the k8s Flink app following regular procedure, and failed, multiple times.

Apparently we hit the same issues as in T347521.

@BTullis force killed everything and put the app back online.

However, as of now, the app is not making progress even though it has 20 Flink TaskManagers available.

Still investigating....

However, as of now, the app is not making progress even though it has 20 Flink TaskManagers available.

The reason was that the Kafka offset got set to latest, which is problematic as it doesn't seem to be transactional.

@Ottomata mentions that the offset is kept, and we should be able to reset, but I cannot find docs on that, so the easier thing to make progress here is to resent the enwiki events from T382953#10437310.

Thus I ran the following as analytics:

UPDATE wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
SET reconcile_emit_dt = NULL
WHERE computation_dt = '2025-01-01T00:00:00'
  AND computation_class = 'all-of-wiki-time'
  AND wiki_db IN ('enwiki')

We are now rerunning the emit to Kafka via this Airflow task.

Flink app processing at 1000-2000 events/s, which creates 1000-3000 RPS for MW API, so we are right on target as agreed with MW SRE on T382953#10437712.

enwiki is pretty much done already, which is surprising, I thought it would take half a day.

Let's start emitting for wikidatawiki via this Airflow task.

Emitting for commonswiki via this Airflow task.

The task spark_process_reconciliation_events run to ingest the enwiki and wikidatawiki events failed. I have bumped the Spark partitions from 1024 to 10240 and executors from 64 to 128 and its currently running via this Airflow task.

commonswiki continues to recover, but it looks like it will take long days.

Thus, let's move on with reconcile for all other wikis as well. I just restarted them all, except enwiki and wikidatawiki.

Flink app is now idle, meaning all events have been emitted. Confirming on inconsistencies table:

spark.sql("""
SELECT count(1) as count, wiki_db, reconcile_emit_dt is not null as emitted
FROM wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
WHERE computation_dt = '2025-01-01T00:00:00'
  AND computation_class = 'all-of-wiki-time'
GROUP BY wiki_db, emitted
ORDER BY count DESC
""").show(10, truncate=False)

+---------+------------+-------+
|count    |wiki_db     |emitted|
+---------+------------+-------+
|373136501|commonswiki |true   |
|11932174 |wikidatawiki|true   |
|9928427  |enwiki      |true   |
|2048873  |dewiki      |true   |
|1185119  |frwiki      |true   |
|1039269  |ruwiki      |true   |
|932953   |eswiki      |true   |
|771935   |zhwiki      |true   |
|646154   |labswiki    |false  |
|568189   |itwiki      |true   |
+---------+------------+-------+
only showing top 10 rows

Note that labswiki has failed to emit due to T382645, to be confirmed fixed over there.

Ok this particular issue is now taken care of. Closing.

Change #1114790 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/deployment-charts@master] Scale down mw-content-history-reconcile-enrich for nominal events intake

https://gerrit.wikimedia.org/r/1114790

Change #1114790 merged by jenkins-bot:

[operations/deployment-charts@master] Scale down mw-content-history-reconcile-enrich for nominal events intake

https://gerrit.wikimedia.org/r/1114790