Page MenuHomePhabricator

[Dumps 2] Reconciliation mechanism to detect and fetch missing/mismatched revisions
Closed, ResolvedPublic

Description

Data quality checks show that we are missing or have mismatched metadata for a significant amount of revisions on wmf_dumps.wikitext_raw_rc2.

Missing or bad revisions.png (449×570 px, 32 KB)
Bad revisions.png (449×561 px, 44 KB)

Code that generated these figures at : https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/main/notebooks/Can_we_emit_wiki_db__revision_id_pairs_for_reconciliation.ipynb?ref_type=heads

Status

Last updated: 2024-10

We adopted reconciliation strategy predicated around:

  • a batch job that runs periodically against analytics mysql replicas to detect missed/mismatched revisions.
  • the job will generate "reconciliation events" for the missing/mismatched revisions.
  • enrichment of these events (wikitext and redirect info) is performed by a streaming enrichment application, that piggy backs on page_content_change logic (and mediawiki-event-enrichment infra).
  • a batch job will join existing and reconciled data and write into the content history table.

A design doc for this approach - as well as alternatives we considered - currently lives in Google Doc. We'll move it to Wikitech once the current Dumps 2.0
implementation phase is concluded and docs are finalized.


The general idea for the reconciliation mechanism is as follows:

Reconciliation mechanism

The Dumps 2.0 project strives to have a correct and complete copy of all the revisions from all wikis over all of wiki time.

For recent data, we leverage data produced by Event Platform, namely tables event.mediawiki_page_content_change_v1 and event.mediawiki_revision_visibility_change. For historical data, we currently backfill from wmf.mediawiki_wikitext_history, which itself is populated via importing Dumps 1.0 output.

None of these systems guarantee complete, or correct, data. Event Platform, or the upstream Event Bus, may fail and miss events. Similarly, Dumps 1.0 frequently skips rows due to runtime errors. Additionally, we want to avoid the dependency on Dumps 1.0 infrastructure. For this, we need a reconciliation mechanism that should:

  • Replace the current backfill mechanism, thus decoupling us from Dumps 1.0.
  • Provide a way to sense whether revisions are missing or incomplete.
  • Provide a way to fetch revisions that are missing or incomplete.

Provide a way to sense whether revisions are missing or incomplete.

As part of our work to create data drift metrics (T354761), we also made a proof of concept mechanism to connect the Analytics Replicas to Apache Spark. We propose to use a similar mechanism to find inconsistencies between the source of truth and our data lake table in two phases: Recent Data and Historical Data. Both phases would utilize the same mechanism:

For each wiki do:

  • Fetch the revisions from wmf.wikitext_raw as of a time window (WHERE revision_dt >= dt1 AND revision_dt <= dt2)
  • Fetch the revisions from the Analytics Replica as of a time window (WHERE revision_dt >= dt1 AND revision_dt <= dt2)
  • Compare both, and figure whether we are missing records. If so, keep them on a list.
  • Compare both, and figure whether records disagree in terms of metadata and/or content. If so, keep them on a list.

For the run for 'Recent data', we would only run the reconciliation mechanism for a window of the last day minus a gap to account for the latency of our ingestion, which is typically ~1 hour. We will run this daily.

For the run for 'Historical data’, we would run the reconciliation mechanism for all the wiki time of a particular wiki, likely in windowed chunks so as to not overwhelm the replicas. We will run this monthly.

At the end of this process we will have a list of offending (wiki_db, revision_id) pairs. These pairs will be kept in a datalake table, for now named wmf_dumps.wikitext_ missing_or_innaccurate_rows for a reasonable amount of time, say 90 days.

This table will be used for the next steps.

Provide a way to fetch revisions that are missing or incomplete.

Now that we have an offending (wiki_db, revision_id) list, we need to somehow fetch this data. Sensing data at the wmf.wikitext_raw level makes sense, considering that we do ingest all events from event.mediawiki_page_content_change_v1 and event.mediawiki_revision_visibility_change.

However, other systems can leverage the fact that we have detected a set of revisions that have not made it to our eventing infrastructure. Considering this, we propose to not have a bespoke mechanism that would fetch the offending revisions for only our consumption. Rather, we propose that an upstream system, Event Bus, should have an API in which we can do the following:

  • Given a list of (wiki_db, revision_id) pairs, this API should accept these and fetch and produce the latest state of the pair to the a new 'reconciliation' Kafka topic. This Kafka topic will be consumed by a Flink job very similar if not the same job as the page_content_change job, and it will produce similar output. Since this new stream will be part of Event Platform, a gobblin process will materialize these late events into the datalake under the event schema.

This way, any interested system will also be able to consume 'reconciliation' events.

In the case of Dumps 2.0, consuming these new events will require a new MERGE INTO job to be put together, very similar to the existing events_merge_into.py. In fact, hopefully just changing the source table in this pipeline should suffice as the schema should be the same.

Mine the data available in wmf_dumps.wikitext_ missing_or_innaccurate_rows

We want to keep wmf_dumps.wikitext_ missing_or_innaccurate_rows around for a bit as it seems useful not just for emitting reconciliation events. It is, effectively, a good source for data quality metrics, and also for understanding data issues current and future.

From the point of view of data quality metrics, we should use this data to produce daily metrics on the 'recent' quality of the data in wmf_dumps.wikitext_raw as well as historic. This will superseed the metrics developed on T354761: Implement first set of data quality checks considering this is a cheaper way to get similar quality/drift information.

From the point of view of data issues, in the event that we see a lot of revisions having issues (say, more than 5%), we'd want to alert on it and perhaps even stop the emission of reconciliation events as to not overwhelm the system.

Related Objects

StatusSubtypeAssignedTask
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
ResolvedMilimetric
ResolvedMilimetric
DeclinedNone
Declinedgmodena
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
DuplicateNone
ResolvedJMeybohm
Resolvedxcollazo
ResolvedMilimetric
DuplicateNone
Resolvedxcollazo
DuplicateNone
Resolvedxcollazo
ResolvedBUG REPORTxcollazo
Resolvedgmodena
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
Resolvedgmodena
Resolvedtchin

Event Timeline

Ottomata renamed this task from PySpark job to detect and fetch missing/corrupted revisions to Reconcillation PySpark job to detect and fetch missing/corrupted revisions.Jun 5 2024, 3:25 PM
Ottomata renamed this task from Reconcillation PySpark job to detect and fetch missing/corrupted revisions to [Dumps 2] Reconcillation PySpark job to detect and fetch missing/corrupted revisions.

@xcollazo: Question related to reconciliation idea in T120242: Eventually Consistent MediaWiki State Change Events.

You are currently planning to query MariaDB Analytics replicas to find missing events, yes? Would it be possible for MW to handle this query for you? If there was a MW API endpoint for reconciliation queries like this, would that suffice?

I ask because e.g. diffing what you have vs what is in DB sounds like a pretty expensive query, but if you can run it in analytics replicas I don't see why you couldn't get MW to do it for you.

If it is possible for MW to do it, then making a generalized reconciliation API in MW (for more than just dumps) sounds more feasible than I would expect.

Ottomata renamed this task from [Dumps 2] Reconcillation PySpark job to detect and fetch missing/corrupted revisions to [Dumps 2] Reconcillation job to detect and fetch missing/corrupted revisions.Jun 10 2024, 6:55 PM

Here is the first draft of this mechanism:

Reconciliation mechanism

The Dumps 2.0 project strives to have a correct and complete copy of all the revisions from all wikis over all of wiki time.

For recent data, we leverage data produced by Event Platform, namely tables event.mediawiki_page_content_change_v1 and event.mediawiki_revision_visibility_change. For historical data, we currently backfill from wmf.mediawiki_wikitext_history, which itself is populated via importing Dumps 1.0 output.

None of these systems guarantee complete, or correct, data. Event Platform, or the upstream Event Bus, may fail and miss events. Similarly, Dumps 1.0 frequently skips rows due to runtime errors. Additionally, we want to avoid the dependency on Dumps 1.0 infrastructure. For this, we need a reconciliation mechanism that should:

  • Replace the current backfill mechanism, thus decoupling us from Dumps 1.0.
  • Provide a way to sense whether revisions are missing or incomplete.
  • Provide a way to fetch revisions that are missing or incomplete.

Provide a way to sense whether revisions are missing or incomplete.

As part of our work to create data drift metrics (T354761), we also made a proof of concept mechanism to connect the Analytics Replicas to Apache Spark. We propose to use a similar mechanism to find inconsistencies between the source of truth and our data lake table in two phases: Recent Data and Historical Data. Both phases would utilize the same mechanism:

For each wiki do:

  • Fetch the revisions from wmf.wikitext_raw as of a time window (WHERE revision_dt >= dt1 AND revision_dt <= dt2)
  • Fetch the revisions from the Analytics Replica as of a time window (WHERE revision_dt >= dt1 AND revision_dt <= dt2)
  • Compare both, and figure whether we are missing records. If so, keep them on a list.
  • Compare both, and figure whether records disagree in terms of metadata and/or content. If so, keep them on a list.

For the run for 'Recent data', we would only run the reconciliation mechanism for a window of the last day minus a gap to account for the latency of our ingestion, which is typically ~1 hour. We will run this daily.

For the run for 'Historical data’, we would run the reconciliation mechanism for all the wiki time of a particular wiki, likely in windowed chunks so as to not overwhelm the replicas. We will run this monthly.

At the end of this process we will have a list of offending (wiki_db, revision_id) pairs.

Provide a way to fetch revisions that are missing or incomplete.

Now that we have an offending (wiki_db, revision_id) list, we need to somehow fetch this data. Sensing data at the wmf.wikitext_raw level makes sense, considering that we do ingest all events from event.mediawiki_page_content_change_v1 and event.mediawiki_revision_visibility_change.

However, other systems can leverage the fact that we have detected a set of revisions that have not made it to our eventing infrastructure. Considering this, we propose to not have a bespoke mechanism that would fetch the offending revisions for only our consumption. Rather, we propose that an upstream system, perhaps Event Platform, should have an API in which we can do the following:

  • Given a list of (wiki_db, revision_id) pairs, this API should accept these and fetch and produce the latest state of the pair to the stream associated with table event.mediawiki_page_content_change_v1. Perhaps these events should be marked as a 'reconciliation' events, so that a consumer can distinguish them from regular revisions coming from EventBus.

This way, any system that consumes event.mediawiki_page_content_change_v1 will also be able to consume 'reconciliation' events.

In the case of Dumps 2.0, consuming these new events will require either minor or no changes.

This looks like a great system to get started with. I can think of some potential snags that come up, so as we build it let's keep an eye out for these and similar:

  • determinism. are mediawiki_page_content_change_v1 and mediawiki_revision_visibility_change events guaranteed to always apply in the same way? If not, it's possible receiving 'reconciliation' events won't be deterministic
  • timing. We have to figure out how new events streaming in play with the reconciliation logic. Because we could be pulling newer state in the reconciliation that is then overwritten by older events or vice versa. I remember Xabriel's logic used dates in a clever way but I'm not sure this is 100% guaranteed to not happen.
  • indices. For the (wiki_db, revision_id) pairs, we can certainly use the Maria DB indices and we should be able to get that lightning fast even for long periods of time. If these indices don't exist, we can always ask for them to be created on these replicas. However, for the wider tuples that allow us to check metadata as well, indices would probably be too expensive. We should still check, but most likely. I can't think of optimizations for that case. So I think the hard job to optimize will be the metadata reconciliation over wide time windows. The poor performance here will probably be motivation for pushing reconciliation concerns upstream to MW itself.

fetch and produce the latest state of the pair to the stream associated with table event.mediawiki_page_content_change_v1. Perhaps these events should be marked as a 'reconciliation' events, so that a consumer can distinguish them from regular revisions coming from EventBus.

I like this idea!

There are probably a few variations on this but I think keeping the late/backfilled events separate from the main streams might be helpful.

It might be better to

  • produce these to a 'mediawiki.page_change.late.v1' stream of some kind (without content).
  • Then, either the mw-page-content-change-enrich Flink job, or one similar to it, could consume this stream and produce the page_content_change events, most likely to a new mediawiki.page_content_change.late.v1 stream.
  • Then, any job that needs/wants to backfill can consume the appropriate streams (or in your case, the Hive tables) and join them together.

We might be able to produce directly into the main streams, but I'm worried about late events causing unexpected problems. We should think about that a little more.

@gmodena, @dcausse thoughts?

fetch and produce the latest state of the pair to the stream associated with table event.mediawiki_page_content_change_v1. Perhaps these events should be marked as a 'reconciliation' events, so that a consumer can distinguish them from regular revisions coming from EventBus.

I like this idea!

There are probably a few variations on this but I think keeping the late/backfilled events separate from the main streams might be helpful.

I need to think a bit more about this, but given how streams are ingested by downstream Gobblin consumers, I tend to agree with you here.

It might be better to

  • produce these to a 'mediawiki.page_change.late.v1' stream of some kind (without content).
  • Then, either the mw-page-content-change-enrich Flink job, or one similar to it, could consume this stream and produce the page_content_change events, most likely to a new mediawiki.page_content_change.late.v1 stream.

I like the idea of re-using mw-page-content-change-enrich, that would reduce operational burden. Do note however that our Python framework does not support consuming from (and producing to) multiple streams yet. It's a feature we planned for, but never got the time to implement. Should not be too complex (modulo semantics of stream composition), and just a matter of implementing it.

Another benefit of reusing mw-page-content-change-enrich is that we would get its nontrivial retry-on-error logic and failure handling semantics for free. We can of course extract that to a module and share it across jobs if needed.

  • Then, any job that needs/wants to backfill can consume the appropriate streams (or in your case, the Hive tables) and join them together.

We might be able to produce directly into the main streams, but I'm worried about late events causing unexpected problems. We should think about that a little more.

One thing that _might_ get funky is how time is modeled in Gobblin at ingestion time.

Hm, @xcollazo @gmodena, another thing to consider: how difficult/possible will it be to reconstruct a mediawiki/page/change event from the MariaDB replicas? Xabriel's proposal has a list of wiki_db and revision_id. We could surely get more, but, perhaps we could create an HTTP API endpoint in EventBus that would cause it to produce a page_change event for a specific revision.

This would also be better for single-producer principle: EventBus is responsible for producing all page change events.

our Python framework does not support consuming from (and producing to) multiple streams yet

We could do it now pyflink with the Event Platform integration, but we wouldn't be able to use the simple stream_manager abstraction.

Or, we could just deploy another mw-page-content-change-late-enrich Flink job. That'd mean an extra deployment in k8s to maintain though :/

@gmodena and I discussed the following in our last 1:1:

  • Is wmf_dumps.wikitext_raw the right 'place' to check whether we are missing events or not? Shouldn't we do these checks upstream? We agreed that this table is the only one that would have the context to make the decision, since it is the only table (other than wmf.mediawiki_wikitext_history, which depends on Dumps 1.0) that contains all the revisions in the data lake. Thus it is currently the only place where we can check against for (wiki_db, revision_id) pairs completeness.
  • What happens in the event of a catastrophic failure? We figured that although unlikely, we could get into a situation where we need to reconstuct the whole table. If Dumps 1.0, the current backfill source, is no longer around, then where do we get the revisions from? Our current Analytics infrastructure includes the typical 3 replicas for each block in HDFS, but it is all hosted in the same datacenter. So there is a nonzero chance of loosing the content in HDFS. But: considering that we will be exporting this data as XML, we can reconstruct the table from this XML. The XML will not include details of the revisions that have had visibility suppresion, but considering those are a small percentage of all of the revisions, we could fetch the remaining details using the monthly reconciliation.

In T358366#9831389 I asked if other fields could be added to the schema; in particular the diff between two revisions, which is frequently used by research (wikidiff). I agree with @xcollazo's concerns, but this lead me to think about the implications of computing the diff separately in regards to reconciliation.

  • the diff is expensive to compute, as a the parent revision might be at any moment in the past and is not necessarily the most recent previous revision. The wikidiff pipeline batches jobs by page (i.e. a batch contains the full history of the pages in the batch).
  • the full diff dataset computed for each snapshot to follow the "snapshot pattern". However it is not significantly cheaper to make this pipeline incremental (e.g. only append diffs for the new month of revisions) as any revision in the past can be a parent revision so the join is still expensive
  • so how would one go about "enriching" wmf_dumps.wikitext_raw_rc2 with a diff column? the job could filter the full history for only the pages changed in that hour (broadcast join) and then do the self join, but that would still require a full pass over the data which seems expensive. This certainly is solvable, e.g. one could decrease the update interval, but it is tempting to instead implement the diff as a streaming "enrichment" pipeline.
  • this would look similar to the existing "page change" job, e.g. query mediawiki for the current and parent revision text and compute the diff (maybe with a cache for the previous wikitext for each page which is the most common parent revision)
  • however, this leads to the question of correctness/reconciliation, since this diff dataset would not be derived from wmf_dumps.wikitext_raw_rc2 and would thus require its own reconciliation mechanism? Which would be an argument in favour of the "s wmf_dumps.wikitext_raw the right 'place' to check whether we are missing events or not? Shouldn't we do these checks upstream?" point raised above.

This example is based on the diff to be concrete, but is also applies to other derived fields that aren't part of the core schema.

@fkaelin I think your comment is an argument for pushing the 'eventual consistency' mechanism as far upstream as possible.

Either Outbox or equivalent solution, and/or T358373#9884375.

FYI, we discussed some of this in today's Dumps 2.0 meeting. Notes here

Action items:

I'm going to try to summarizing some discussion to help make a decision. There are details, but I think there are two main base ideas:

Reconciliation For Dumps Alone with MariaDB
  • Batch job that:
    • reconciles wikitext_raw against MariaDB analytics replica
    • select missing data from MariaDB
    • gets missing content from MW API
    • writes to Dumps' wikitext_raw table.

Work in T367810: Spike: Can we recreate a skeleton page_change (revision_change) event from DB replica alone? seems to indicate that this is possible.

Pros
  • No MediaWiki dev work needed, this can all be Spark in Data Lake.
Cons
  • Will not help any other downstream consumers of MediaWiki events. Search, Research, WDQS, etc will all have to figure out reconciliation on their own.
Reconciliation into Event Streams with custom MediaWiki API endpoint
  • Custom MediaWiki API endpoint that:
    • Can produce 'late' page and/or revision change events upon request
  • mediawiki event enrichment Flink job that:
    • consumes late page/revision change events and produces relevant content_change events.
  • Batch job that:
    • reconciles wikitext_raw against MariaDB analytics replica
    • Queries custom MediaWiki API endpoint, requesting it to produce late/missed events
  • Dumps processing changed to:
    • Read and merge from normal event tables (+ new late event tables if we make them)
Pros
  • Solves reconciliation for others.
  • EventBus already knows how to construct and produce events
  • No extra write step to wikitext_raw table. The table sources from the same event tables every time.
Cons
  • MediaWiki dev work needed: we'd have to add an API endpoint to EventBus.
  • Flink enrichment job work likely needed, if we decide to produce new late event streams.

Does that sound about right?

There are a few combinations of the above approaches we could do (e.g. can/should we produce into event streams from Data Lake?), but as I tried to write them the combos started expanding. I think they are all sub-optimal, but we should discuss still.

Change #1048385 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable the MariaDB binlog on the analytics mariadb replicas

https://gerrit.wikimedia.org/r/1048385

Does that sound about right?

It does to me.

xcollazo renamed this task from [Dumps 2] Reconcillation job to detect and fetch missing/corrupted revisions to [Dumps 2] Reconcillation job to detect and fetch missing/mismatched revisions.Jun 28 2024, 3:26 PM
xcollazo updated the task description. (Show Details)
xcollazo renamed this task from [Dumps 2] Reconcillation job to detect and fetch missing/mismatched revisions to [Dumps 2] Reconcillation mechanism to detect and fetch missing/mismatched revisions.Jun 28 2024, 8:10 PM

Quick summary of last meeting. Luke started working on a draft of what we were talking about (see the reconciliation flow on https://miro.com/app/board/uXjVNfaohl0=/).

What we're reconciling here are two datasets:

D1. the mediawiki version of reality, as reasonably defined by replication consensus in production MariaDB clusters
D2. the dumps 2 version of reality, as materialized from page_content_change

Some thoughts that came to mind as I reflected on reconciliation:

  • D1 can't get hit with hard queries, it would be nice to hit a replica of D1, because we may need to do heavy queries in cases where we're looking at large time spans. And we don't want to influence production as we do that.
  • D2 and a replica of D1 are easily queried from Spark
  • EventBus runs in production with a direct connection to D1

So my *first* instinct was to push in the direction of just creating a skeleton event from a D1-replica + D2 query. This would have almost everything we need except for textual content, as analyzed in my earlier spike on skeleton events.

@xcollazo, how will the wmf_dumps.wikitext_mismatch_rows table work? Once reconcilliation of the rows is done, will mismatched records from that table be removed? Or is it more like a snapshot based run?

@xcollazo, how will the wmf_dumps.wikitext_mismatch_rows table work? Once reconcilliation of the rows is done, will mismatched records from that table be removed?

I suggested in description above that we may want to keep it for, say, 90 days, just so that we can create a data quality dashboard on top of it, or if there is an incident of high numbers, that we have the data to inspect. I think this goes against the data quality framework @gmodena had put together, so happy to discuss.

If we do keep the data for ~90 days, we will need something like DELETE * FROM wmf_dumps.wikitext_mismatch_rows WHERE snapshot < NOW() - INTERVAL 90 DAYS to be part of the Airflow job.

Or is it more like a snapshot based run?

Right. As of now, I have a snapshot column defined like this:

snapshot            DATE            COMMENT 'the date at which the mismatch was calculated. Useful to see trends over time, and also to be able to delete data efficiently.'

I acknowledge though that this is an overload of the term snapshot as used elsewhere in the datalake. So column names suggestions accepted...maybe calculated_at? The intent of the column is as stated in the COMMENT. As in 'We ran reconciliation on this DATE, and we found the mismatches to be all the rows that match this DATE.'

Perhaps I should also have a separate column to discriminate recent from historical run, or perhaps it should just include the time window?

Makes sense, thank you!

column names suggestions accepted

A calculation_dt or computation_dt makes sense to me! And, I think you want a TIMESTAMP instead of a DATE, no?

it should just include the time window?

That sounds more flexible!

xcollazo renamed this task from [Dumps 2] Reconcillation mechanism to detect and fetch missing/mismatched revisions to [Dumps 2] Reconciliation mechanism to detect and fetch missing/mismatched revisions.Sep 27 2024, 7:09 PM

Quick update on this. We adopted reconciliation strategy predicated around:

  • a batch job that runs periodically against analytics mysql replicas to detect missed/mismatched revisions.
  • the job will generate "reconciliation events" for the missing/mismatched revisions.
  • enrichment of these events (wikitext and redirect info) is performed by a streaming enrichment application, that piggy backs on page_content_change logic (and mediawiki-event-enrichment infra).
  • a batch job will join existing and reconciled data and write into the content history table.

A design doc for this approach - as well as alternatives we considered - currently lives in Google Doc. We'll move it to Wikitech once the current Dumps 2.0
implementation phase is concluded and docs are finalized.

Thank you, I just pasted your comment into a Status section in the task description.

so how would one go about "enriching" wmf_dumps.wikitext_raw_rc2 with a diff column? the job could filter the full history for only the pages changed in that hour (broadcast join) and then do the self join, but that would still require a full pass over the data which seems expensive. This certainly is solvable, e.g. one could decrease the update interval, but it is tempting to instead implement the diff as a streaming "enrichment" pipeline.

@fkaelin I think that because the new mediawiki_content_history (Dumps 2) Iceberg table is primary keyed by revision id, that the lookup for the parent revision won't require a full table scan. I am really not sure though, maybe @xcollazo can comment?

so how would one go about "enriching" wmf_dumps.wikitext_raw_rc2 with a diff column? the job could filter the full history for only the pages changed in that hour (broadcast join) and then do the self join, but that would still require a full pass over the data which seems expensive. This certainly is solvable, e.g. one could decrease the update interval, but it is tempting to instead implement the diff as a streaming "enrichment" pipeline.

@fkaelin I think that because the new mediawiki_content_history (Dumps 2) Iceberg table is primary keyed by revision id, that the lookup for the parent revision won't require a full table scan. I am really not sure though, maybe @xcollazo can comment?

If mediawiki_content_history is the only source, you'd have to do a self join, which will indeed trigger a full table scan.

IIRC, @fkaelin had suggest we add a column with this diff to mediawiki_content_history. We could do that in future work, by modifying perhaps the page_content_change stream to also calculate and include this diff, and ingest it into mediawiki_content_history by the usual means.

xcollazo claimed this task.