On T381322, we created a renamed stream mediawiki.content_history_reconcile_enriched that should enrich reconcile events with content_body and content_format fetched from the MW API.
On T381375#10427763, however, we realized that all the events being generated are failing silently to do the enrichment:
In T381375#10427763, @xcollazo wrote:Actually, every single row on event.mediawiki_content_history_reconcile_enriched_v1 is compromised, regardles of MCR or not:
presto:event> select year, month, day, count(1) as count from event.mediawiki_content_history_reconcile_enriched_v1 where revision.content_slots['main'].content_body is NULL group by year, month, day order by year, month, day; year | month | day | count ------+-------+-----+---------- 2024 | 12 | 19 | 950 2024 | 12 | 20 | 11138 2024 | 12 | 21 | 50939588 2024 | 12 | 22 | 50499466 2024 | 12 | 23 | 49399621 2024 | 12 | 24 | 49198996 2024 | 12 | 25 | 44499110 2024 | 12 | 26 | 43497915 2024 | 12 | 27 | 47597552 2024 | 12 | 28 | 49996717 2024 | 12 | 29 | 1521781 2024 | 12 | 30 | 12822 2024 | 12 | 31 | 14364 2025 | 1 | 1 | 4069711 (14 rows) Query 20250102_194941_00213_nx7ym, FINISHED, 15 nodes Splits: 3,529 total, 3,529 done (100.00%) [Latency: client-side: 0:13, server-side: 0:13] [391M rows, 55.4GB] [29.2M rows/s, 4.13GB/s] presto:event> select year, month, day, count(1) as count from event.mediawiki_content_history_reconcile_enriched_v1 where revision.content_slots['main'].content_body is NOT NULL group by year, month, day order by year, month, day; year | month | day | count ------+-------+-----+------- (0 rows)Random examples:
presto:event> select wiki_id, revision.rev_id, revision.content_slots['main'] from event.mediawiki_content_history_reconcile_enriched_v1 limit 10; wiki_id | rev_id | _col2 > --------------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------> liwiktionary | 907945 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=d6aqvx842jujpdcv4hoh8gml4cvhkg0, content_size=107, origin_rev_id=907945> cawikisource | 176542 | {content_body=null, content_format=null, content_model=proofread-page, content_sha1=ie8dttghk236j5u80h4p9yrabgm5d5b, content_size=2632, origin_rev_id> srwiki | 28937523 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=ikaehvsxemtljoq735ho453c1h47m8r, content_size=158, origin_rev_id=289375> trwiki | 34553164 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=jfkg54p6qro8gqsldhrnq3nl15hyt9r, content_size=1433, origin_rev_id=34553> ruwiki | 142346895 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=fcc2ey6rr8848rbx8bpupvi1rozq74n, content_size=41608, origin_rev_id=1423> idwiki | 26696316 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=kn2cc2wukm8nm7xhckztskb9vzlny0i, content_size=1219, origin_rev_id=26696> kowiki | 38377728 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=468nucbvtwoxmmjitbqjfmmms5l2pu3, content_size=1241, origin_rev_id=38377> zhwiki | 85456613 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=pqrnxwvdn1164ppgqjrz0ylylnzse33, content_size=1298, origin_rev_id=85456> ruwiki | 142337496 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=g7t12nckfl65gnfpb8puwpl6pkz48li, content_size=35, origin_rev_id=1423374> zhwiki | 85453215 | {content_body=null, content_format=null, content_model=wikitext, content_sha1=1yoeddo8uw1rafu5620do1ejpa4rmkp, content_size=58, origin_rev_id=8545321> (10 rows)I have stopped all data pipelines related to this table until we figure this out.
Unfortunately, we have been ingesting this stream into wmf_dumps.wikitext_raw_rc2 over the time of the winter holidays, thus we have ingested the majority of these compromised events.
In this task we should figure out what is going on and fix appropriately.