Page MenuHomePhabricator

Event stream with latest revision HTML & parent revision HTML diff
Open, HighPublic8 Estimated Story Points

Description

It was determined that the current motivation for doing this task (T351225 & T410940) requires more than just latest page revision html. Edit types computation requires a diff of the latest revision html to its parent.

Including both parent and latest revision html in the same event is likely going to be too big for Kafka.
But, a stream containing latest revision html is still useful for many other use cases.

To support both T351225 and other use cases, we will do the following

Build a new streaming enrichment (Flink) job that:

  • Listens to mediawiki.page_change.v1
  • Calls MW API for HTML of latest revision + parent revision
  • Emits new stream with latest html + unified diff from parent

We hope to use this as the basis for also emitting an edit types stream for T351225: Productionized Edit Types. We may choose to emit both of these streams in the same Flink pipeline, TBD.

This job will be similar to the wikitext enrichment job.

Why build this?
Parsing HTML is easier than wikitext and an incremental stream of changes to a page, from a point in time, is useful to train models and/or track how pages are changing over time.

Important Notes:

What to do with each page_change_kind?

mediawiki.page_change.v1 stream contains a page_change_kind field that indicates what kind of page change the event is representing. Not all page change events change revision content. What should the new event stream with html + diff contain for each possible page_change_kind?

(Note: T409105#11460975 may also have implications on these choices.)

We will work out these details in comments of this ticket.

See T360794#11664477.

page_change_kindwhat do do?
editenrich with latest and diff with parent html
createenrich with latest html
moveenrich with latest and diff with parent html
deletepass through, no enrichment
undeleteenrich with latest and diff with parent html
visibility_changeenrich with latest and diff with parent html

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
fragment/mediawiki/entity/content and revision_slots - remove content_body fieldrepos/data-engineering/schemas-event-primary!36ottocontent_slot_content_body_refactormaster
Enrichment based on page_change_kindrepos/data-engineering/mediawiki-event-enrichment!115javiermontonfeature/enrich-by-change-typemain
[WIP] ]Use Java Records rather than `Row`repos/data-engineering/mediawiki-event-enrichment!107javiermontonfeature/html-enrichment-use-record-4feature/html-enrichment-3
[WIP] - HTML Content Enrich pipelinerepos/data-engineering/mediawiki-event-enrichment!106javiermontonfeature/html-enrichment-3feature/java-base-2
Move Python code to `python/` folderrepos/data-engineering/mediawiki-event-enrichment!103javiermontonfeature/move-pyflink-1main
Enrich Page Change with HTMLrepos/data-engineering/mediawiki-event-enrichment!101javiermontonfeature/enrich_html_page_contentmain
mediawiki-event-enrichment: add pipeline for enriching page change events with HTMLrepos/data-engineering/mediawiki-event-enrichment!86mnzmnz-html-enrichmentmain
Customize query in GitLab

Related Objects

StatusSubtypeAssignedTask
OpenIsaac
OpenAKhatun_WMF
OpenNone
OpenNone
OpenAKhatun_WMF
OpenNone
OpenOttomata
OpenJMonton-WMF
ResolvedJMonton-WMF
ResolvedJMonton-WMF
OpenJMonton-WMF
OpenJMonton-WMF
OpenNone
ResolvedJMonton-WMF
ResolvedJMonton-WMF
OpenNone
ResolvedOttomata
OpenJMonton-WMF
OpenJMonton-WMF
OpenJMonton-WMF
OpenJMonton-WMF
ResolvedJMonton-WMF
ResolvedOttomata
OpenOttomata

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Ottomata I've updated the MR to the latest structure inside the Python folder. I'm assuming the version .dev0 would be set in the helm files, so I'll start working on it.
Maybe the MR could be merged after a review, and we can iterate over it with the new schema after it's decided in T415158

Sweet! Just left some comments on the MR.

Change #1235827 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] topic: Flink enrichment pipeline

https://gerrit.wikimedia.org/r/1235827

Change #1236258 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] topic: Flink enrichment pipeline

https://gerrit.wikimedia.org/r/1236258

Change #1236302 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] topic: Flink enrichment pipeline

https://gerrit.wikimedia.org/r/1236302

Change #1236305 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/puppet@production] topic: New Flink application

https://gerrit.wikimedia.org/r/1236305

Change #1236305 merged by Btullis:

[operations/puppet@production] topic: New Flink application

https://gerrit.wikimedia.org/r/1236305

Change #1236302 merged by jenkins-bot:

[operations/deployment-charts@master] topic: Flink enrichment pipeline

https://gerrit.wikimedia.org/r/1236302

I have created the s3 user for the deployment of mw-page-html-content-change-enrich-next

btullis@cephosd1001:~$ sudo radosgw-admin user create --uid=mw-page-html-content-change-enrich-next --display-name=mw-page-html-content-change-enrich-next
{
    "user_id": "mw-page-html-content-change-enrich-next",
    "display_name": "mw-page-html-content-change-enrich-next",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,
    "subusers": [],
    "keys": [
        {
            "user": "mw-page-html-content-change-enrich-next",
            "access_key": "REDACTED",
            "secret_key": "REDACTED"
        }
    ],
    "swift_keys": [],
    "caps": [],
    "op_mask": "read, write, delete",
    "default_placement": "",
    "default_storage_class": "",
    "placement_tags": [],
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "user_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "temp_url_keys": [],
    "type": "rgw",
    "mfa_ids": []
}

I have added these secrets to hieradata/role/common/deployment_server/kubernetes.yaml in the private git repo, so they are now available to helmfile.

root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# run-puppet-agent
<snip>
Notice: /Stage[main]/Profile::Kubernetes::Deployment_server::Helmfile/File[/etc/helmfile-defaults/private/dse-k8s_services/mw-page-html-content-change-enrich-next]/ensure: created
Notice: /Stage[main]/Profile::Kubernetes::Deployment_server::Helmfile/File[/etc/helmfile-defaults/private/dse-k8s_services/mw-page-html-content-change-enrich-next/dse-k8s-eqiad.yaml]/ensure: defined content as '{sha256}21b813646e26fefde1936259ef802a75a186553daf97ad6bb2549b2a99b9bf34'
Notice: Applied catalog in 160.62 seconds

root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# ls -l /etc/helmfile-defaults/private/dse-k8s_services/mw-page-html-content-change-enrich-next/dse-k8s-eqiad.yaml
-rw-r----- 1 mwdeploy deployment 133 Feb  5 12:45 /etc/helmfile-defaults/private/dse-k8s_services/mw-page-html-content-change-enrich-next/dse-k8s-eqiad.yaml

Change #1236258 merged by jenkins-bot:

[operations/deployment-charts@master] topic: Flink enrichment pipeline

https://gerrit.wikimedia.org/r/1236258

Change #1237451 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/mediawiki-config@master] component: New Stream

https://gerrit.wikimedia.org/r/1237451

Change #1237451 merged by jenkins-bot:

[operations/mediawiki-config@master] component: mediawiki.page_html_content_change.dev0

https://gerrit.wikimedia.org/r/1237451

Change #1237870 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/mediawiki-config@master] component: mediawiki.page_html_content_change.dev0

https://gerrit.wikimedia.org/r/1237870

Change #1237870 merged by jenkins-bot:

[operations/mediawiki-config@master] component: mediawiki.page_html_content_change.dev0

https://gerrit.wikimedia.org/r/1237870

Mentioned in SAL (#wikimedia-operations) [2026-02-09T17:11:47Z] <otto@deploy2002> Started scap sync-world: Backport for [[gerrit:1237870|component: mediawiki.page_html_content_change.dev0 (T360794)]]

Mentioned in SAL (#wikimedia-operations) [2026-02-09T17:13:45Z] <otto@deploy2002> otto, javiermonton: Backport for [[gerrit:1237870|component: mediawiki.page_html_content_change.dev0 (T360794)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-02-09T17:19:22Z] <otto@deploy2002> Finished scap sync-world: Backport for [[gerrit:1237870|component: mediawiki.page_html_content_change.dev0 (T360794)]] (duration: 07m 34s)

Change #1238377 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] component: mediawiki.page_html_content_change.dev0

https://gerrit.wikimedia.org/r/1238377

Change #1238377 merged by jenkins-bot:

[operations/deployment-charts@master] component: mediawiki.page_html_content_change.dev0

https://gerrit.wikimedia.org/r/1238377

@JMonton-WMF I edited the task to include the diff.

@fkaelin is going to inform us as to which (python) diff library to use.

I will work on T415158: Common event data model for data derived from parsed page revision html (and more!) ASAP to account for this.

We will also need a task for emitting the edit type stream, but there are more unknowns for that (data model, pipeline layout, etc). At the very least, this html+diff stream will enable us to compute edit types in the data lake, even if we encounter roadblocks on the way to streaming edit types.

To keep you moving forward while I figure out data model, I suggest modifying your existing staging .dev0 enrichment job to:

  • also fetch parent revision html
  • compute diff
  • store diff in a different revision 'content slot' (this will change once we figure out data model).

That will let us build a job now that has the relevant data, even if not in the final shape.

The content diff pipeline uses diff-match-patch from google, see the pyspark code to use it here.

Options to consider:

  • diff_match_patch: A character-level patch format (similar to a unified diff) that is simple to generate and apply and integrates easily into PySpark/PyFlink pipelines. It is old and hasn't been changed in years and the repo is archived in the meantime. It works.
  • difflib: the python standard library has a built-in way to create a standard unified diff. At first it seemed like the clear choice, however there is not a built-in way to apply a unified diff as patch (it is for reporting/comparing). The absence of a default method makes me think it might be messy to do so ourselves.
  • use library to create a binary patch like bsdiff4. This diff bytes would likely be smaller than text patches, but it is not human-readable. Applying a patch is straightforward.

Either way, there is code to execute to use the parent revision content (e.g. using a library in a udf), so using the parent content in spark sql is not out of the box. Recommendation: I might be missing something and there is actually "a good choice", but from the options above I would still go with diff_match_patch.

I like that difflib generates a standard (unified?) diff that can be applied using e.g. GNU patch (IICU).

This might be more versatile than diff_match_patch, even though that does have a lot of language implementations.

If Unified Diff is a very standard thing, we should probably go with that. Imagine applying the patch in I dunno, Java or JavaScript or PHP?

@Isaac @JMonton-WMF Whatcha think?

Another q:

Which direction should the diff be created? On edit, we save latest rev html. We want to be able to apply the diff as a patch to recreate the parent rev. I suppose we want diff from latest to parent? Given latest, applying patch will create parent?

Which direction should the diff be created? On edit, we save latest rev html. We want to be able to apply the diff as a patch to recreate the parent rev. I suppose we want diff from latest to parent? Given latest, applying patch will create parent?

Yeah, I think you retain the current HTML as the full thing and save the parent as the diff. It's a little backwards perhaps but then the stream also easily supports the use-case of someone who wants to access just the HTML for every edit (but doesn't care about the diff).

Re: library choice: diff-match-patch does have its own format that is slightly different from a standard "unified diff" that could come out of difflib. I guess to Fabian's point, if we use difflib, we'd want to make sure 1) there's a standardized+simple way of regenerating the original content, and, 2) that the unified diff format isn't actually much larger than diff-match-patch's. To get specific about language, we actually want a patch (concise description of changes that can be easily applied to regenerate the original) as opposed to a diff (which is not always concise because it privileges legibility) because we're just using this for compression. One thing to note is that Google stopped supporting the library in 2024 and it's now community-maintained. Obviously we use that sort of stuff for all sorts of things but worth being aware: https://pypi.org/project/diff-match-patch/

if we use difflib, we'd want to make sure 1) there's a standardized+simple way of regenerating the original content, and, 2) that the unified diff format isn't actually much larger than diff-match-patch's.

Good points. Just from reading, I'm pretty sure point 1) is true. IIUC, the unified diff it creates is the standard one, so most things that know how to apply patches can do it, even with standard unix patch tool.

I suppose, another option: just use subprocess out to diff + patch commands. I expect diff to do the same thing as python difflib though.

For point 2) indeed. We need to do some size estimates! If @JMonton-WMF doesn't get around to writing a ticket before I do I will make one! (probably Monday).

Ottomata renamed this task from Implement stream of HTML content on mw.page_change event to Event stream with latest revision HTML & parent revision HTML diff.EditedMar 2 2026, 5:10 PM
Ottomata updated the task description. (Show Details)

I've edited the description with a table that will explain what should happen for each possible page_change_kind.

Let's work it out!

The page_content_change enrichment job adds wikitext revision content for every page_change_kind except for delete. It doesn't skip any events, even if wikitext content has not changed.

What should this stream do?

I think we want this stream to have mostly the same events for changelog tracking as the upstream page_change stream, then yes, we want to always forward unenriched events through where it makes sense.

Let's explore each page_change_kind...

  • edit: enrich with latest and diff with parent html
  • create: enrich with latest html
  • move: no new revision content for parent or latest revision. A dummy revision is created though, but the foreign key to the revision content will be the same. So, rev_id technically changes, even though content does not. I think we should: enrich with latest and diff with parent html. I think the rendered html content will actually change after a page move, since the page title has changed. The pages title is not part of the revision wikitext content, but I betcha it is present in the rendered html.
  • delete: We can't enrich with any content: the API will return a 404 for latest and parent revisions. Do we want delete events in this html stream? I think we do. If possible, this is still a 'changelog' stream, meaning the latest event for the key (wiki_id, page_id) is meant to represent the latest state. A consumer of this stream should be able to propagate the delete event and delete its own stored state. pass through, no enrichment
  • undelete: We should probably enrich with latest and diff with parent html. This is a little weird, because the parent and latest revisions were created at a prior date (potentially even years before this undelete event), but we will be enriching with what the html rendering is now.
  • visibility_change: For the edit types use case, they want to use the parent and latest html to understand what a page looked like to a user when they made a change. visibility_change events in the source page_change stream are only emitted if the revision visibility was changed on the latest revision. So there will have previously already been an enriched html event for this particular latest + parent revision pair in our new stream. Should we repeat the content again? I'm not sure, but seeing as page_content_change does repeat the content, perhaps we should too? I'm inclined to do so to keep the logic simpler, unless we have a reason not to. enrich with latest and diff with parent html. Not so sure about this one.

I had concerns about the delete as we are not doing anything with it, but it seems good to keep it to represent the state of the pages.

I'll try to adapt the code so it is clearer what we are doing with each case.

Change #1248539 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: mw-page-html-content-change

https://gerrit.wikimedia.org/r/1248539

Change #1248539 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-html-content-change

https://gerrit.wikimedia.org/r/1248539

Change #1248808 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich-next

https://gerrit.wikimedia.org/r/1248808

Change #1248808 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich-next

https://gerrit.wikimedia.org/r/1248808

Change #1235827 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1235827

Hi @BTullis,
Could you please help us doing this: https://phabricator.wikimedia.org/T360794#11587133 for the user mw-page-html-content-change-enrich please? We tried deploying an application with this username and namespace in the dse-k8s-eqiad cluster and it's giving a 403 accessing S3, and I see the secret-key is null:

high-availability.storageDir: s3://mw-page-html-content-change-enrich.dse-k8s-eqiad/production/high-availability                                                         │
s3.secret-key: null

I'm assuming I don't have access to do that, as I don't have sudo nor access to the private repository.

Thanks!

@JMonton-WMF from this slack discussion, I'm wondering if the right thing to do is error if html content is too large.

The html field has a uri field that can allow users of this data to fetch the content themselves. I wonder if a better behavior would be to set all the other fields, and add warning/message field that indicates why the content_body is absent for the particular event.

This would certainly be better for other streaming usages of HTML events (e.g. inputs to ML models), as they wouldn't have to also consume from the error stream to know that an event was missing just because its HTML was too large to stick into an event.

I'm not sure though. What do you think?

Hi @BTullis,
Could you please help us doing this: https://phabricator.wikimedia.org/T360794#11587133 for the user mw-page-html-content-change-enrich please? We tried deploying an application with this username and namespace in the dse-k8s-eqiad cluster and it's giving a 403 accessing S3, and I see the secret-key is null:

I have done that for you now. I just ran a helmfile -e dse-k8s-eqiad diff from /srv/deployment-charts/helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich and I can see the tokens that I created for this application.

Please let me know if you experience any issues with it.

Change #1251298 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1251298

Change #1251298 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1251298

Change #1254863 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1254863

Change #1254863 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1254863

Change #1254938 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] mw-page-html-content-change-enrich - increase taskmanager replicas to 20

https://gerrit.wikimedia.org/r/1254938

Change #1254938 merged by jenkins-bot:

[operations/deployment-charts@master] mw-page-html-content-change-enrich - increase taskmanager replicas to 20

https://gerrit.wikimedia.org/r/1254938

We spawned the html enrichment job today, and started consuming from the beginning of the mediawiki.page_change.v1 topic, from 7 days ago. Effectively a backfill test!

The pipeline is slow.

We encountered a few interesting questions to answer:

  • We increased MW API request timeout to 5 seconds. This helped a lot, but we still see some timeouts. Any repeat timeout we get will slow the pipeline, as each request is retried 6 times! @fkaelin's batch code with 16 Spark executors can do a day of all wikipedia rev and parent html in about an hour. (slack thread).
  • We increased the parallelism of the Flink pipeline first to 10, and then 20 task managers. 10 TMs got us to about 60 msgs / sec, 20 TMs got us to about 90 msgs/sec. At this rate, we can backfill 7 days in about 2 or 3 days, but that still seems quite slow. Why is this so slow? I would expect 2x TMS to roughly double the throughput (unless the API latency is affected). Where is the bottleneck?
  • Fabian is using the Wikimedia REST API, instead of the (newer) MediaWiki REST API. Does that make a difference somehow?
  • There are a bunch of tunables we should look into: Increasing source Kafka partitions, increasing pyflink processfunction batch size, increasing TM parallelism, etc.

Anyway, uh, it...works? Just needs some fine tuning I think.

This comment was removed by AKhatun_WMF.

Change #1260091 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] EventStreamConfig - Increase spark_job_ingestion_scale for larger event streams

https://gerrit.wikimedia.org/r/1260091

@JMonton-WMF looks like we should just set a 60 second timeout for MW API requests.

These long request times will stall the pipe, but hopefully there aren't too many of them in succession that the whole thing starts lagging.

I'll do the change and we'll see!

Change #1260637 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1260637

Change #1260637 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1260637

Fabian is using the Wikimedia REST API, instead of the (newer) MediaWiki REST API. Does that make a difference somehow?

As expected, the result is the same.

Both

time curl -I  "https://vi.wikipedia.org/w/rest.php/v1/revision/74883140/html"

and

time curl -I  'https://vi.wikipedia.org/api/rest_v1/page/html/Danh_sách_xã_tại_Việt_Nam/74883140'

Are about the same.

Change #1260091 merged by jenkins-bot:

[operations/mediawiki-config@master] EventStreamConfig - Increase spark_job_ingestion_scale for larger event streams

https://gerrit.wikimedia.org/r/1260091

Mentioned in SAL (#wikimedia-operations) [2026-03-25T13:42:01Z] <otto@deploy2002> Started scap sync-world: Backport for [[gerrit:1260091|EventStreamConfig - Increase spark_job_ingestion_scale for larger event streams (T360794 T351225)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-25T13:44:21Z] <otto@deploy2002> otto: Backport for [[gerrit:1260091|EventStreamConfig - Increase spark_job_ingestion_scale for larger event streams (T360794 T351225)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-25T13:49:49Z] <otto@deploy2002> Finished scap sync-world: Backport for [[gerrit:1260091|EventStreamConfig - Increase spark_job_ingestion_scale for larger event streams (T360794 T351225)]] (duration: 07m 48s)

Change #1267152 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] dse-k8s - add common dir for mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1267152

Change #1267152 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s - add common dir for mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1267152

Change #1267161 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] dse-k8s - set flinkConfiguration properly after directory reorg

https://gerrit.wikimedia.org/r/1267161

Change #1267161 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s - set flinkConfiguration properly after directory reorg

https://gerrit.wikimedia.org/r/1267161

@JMonton-WMF something we should keep an eye on: kafka topic size. I think the html topics will end up being the largest on kafka jumbo. It looks the largest topic is about 420GB right now.

Crunching some numbers based on T419495#11700223

p50 on output size of ~38kB

We are diffing, and want to estimate some worse case, so let's assume an average of 100KB messages.

For a week of data:
100KB * 20 msgs/sec * 60 * 60 * 24 * 7 / 1024 / 1024 == ~1.15TB.
Assuming snappy compression of maybe 50% (is that a good assumption), that will put us at ~600GB.

Or, maybe a more accurate estimate is from normal byte in rate of about 500KB / second.
500KB * 60 * 60 * 24 * 7 / 1024 / 1024 == ~290TB. That's better!

I don't think it is a cause for concern, but just in case I'll let our friendly DPE SREs know. :)

@Ottomata Assuming you mean 290GB and not 290TB, we should be all good :)