Page MenuHomePhabricator

Implement stream of HTML content on mw.page_change event
Open, Needs TriagePublic8 Estimated Story Points

Description

Build a new Flink job that:

  • Listens to mw.page_change.v1
  • Calls MW API for HTML of page
  • Outputs to new stream (name TBD)

Job will be very similar to the wikitext enrichment job.

Why build this?
Parsing HTML is easier than wikitext and an incremental stream of changes to a page, from a point in time, is useful to train models and/or track how pages are changing over time.

Important Notes:

  • Stream will contain the rendered HTML of a page when a page is created, edited or deleted. If a change to a template changes the HTML of the page you will not receive an event for this
  • The work in this ticket doesn’t cover the backfilling of the stream
  • It is expected that the existing schema for the wikitext stream could be used for this new stream but to be discussed.
  • It is likely we would need to increase the max message size of Kafka jumbo to ~15MB (currently at 10MB)


Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Move Python code to `pyflink/` folderrepos/data-engineering/mediawiki-event-enrichment!103javiermontonfeature/move-pyflink-1main
[WIP] - Enrich Page Change with HTMLrepos/data-engineering/mediawiki-event-enrichment!101javiermontonfeature/enrich_html_page_contentmain
mediawiki-event-enrichment: add pipeline for enriching page change events with HTMLrepos/data-engineering/mediawiki-event-enrichment!86mnzmnz-html-enrichmentmain
Customize query in GitLab

Related Objects

Event Timeline

Hey @lbowmaker -- I wanted to check in on the status of this. For the article quality model (T360455), I would like to run a batch job that builds a distribution of a bunch of features from article HTML. For the moment, I've had to move the individual dump files onto HDFS but this isn't sustainable long-term and the ability to make incremental updates to these distributions based on a stream would be fantastically helpful.

Also, I know this is closely related to T305688 (dump snapshots). I chose this ticket for commenting because it enables us to start building a historical record too but I'm also curious about that other aspect of making the HTML content available to us.

Hi @Isaac - we were hoping to get to this last quarter but didn’t manage to. This quarter, we are now working on the Dumps 2.0 implementation, annual plan KR for data lineage, probably temp account changes now, essential work and lots of vacations in the team.

I think we could try and get to this or the batch ingestion one next quarter (starting October).

This is something I really want to do but it’s hard to find the time.

Would you have any interest in building a Flink job with our support - that could help speed up getting this done? We have a very stable job that enriches page change events with wikitext and this would be very similar. Let me know what you think.

I think we could try and get to this or the batch ingestion one next quarter (starting October).
This is something I really want to do but it’s hard to find the time.
Would you have any interest in building a Flink job with our support - that could help speed up getting this done? We have a very stable job that enriches page change events with wikitext and this would be very similar. Let me know what you think.

Thanks @lbowmaker for the update and the pointers. I created an official request (T371062) for our Research Engineers to consider prioritizing it in the meantime. Flink is beyond me but I'll let them decide whether jumpstarting this makes sense :)

Hi, I took a stab at this and was able to put together a job that enriches page change events by retrieving the HTML from MW Rest API (the existing examples in mediawiki-event-enrichment really helped with this!). I've opened an MR against mediawiki-event-enrichment with a first draft and would really appreciate it if I could get someone to give me some feedback on it!

@MunizaA okay if feedback comes in a week or two (or even three)? Just wanted to know if it's on a time-sensitive critical path - folks are handling some matters with some time sensitivity.

I'd like to ponder a bit on the option for forward compatibility on the stuff in T305688: Make HTML Dumps available in hadoop, although just getting something to begin learning, assuming the scale can be managed in the earliest version of this, would be nice!

gmodena updated https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/86

mediawiki-event-enrichment: add pipeline for enriching page change events with HTML

@Ahoelzl (following up on our conversation on November 7th where I flagged the need for investing on HTML dumps to you) This task is the one that will be blocking future work on SDS 1.2.3 (who are moderators?) during Q3. As I mentioned to you, we have a workaround for moving work forward this quarter but it's clear to us that we will be blocked next quarter without this task being resolved. If you need additional information about why this is a high priority, please let me know.

In terms of deadline for the task: the sooner your team can do this the better because we will need time in Q3 to work with HTML dumps at scale to further build, scale and test the models. Would January 17th work?

@fkaelin is your point of contact on our end for technical specifications. @diego is the hypothesis owner. I will stick around for some time here to help with determining a deadline. Thank you for your help.

@leila I know that @Ahoelzl will be meeting next week with @XiaoXiao-WMF to talk through the next steps here. Please think about maintenance and sustainability plans prior to discussion of handovers.

@Ahoelzl and @XiaoXiao-WMF please keep this task updated post meeting so we can all stay on the same page here.

Thanks for following up, Virginia. Can you add the specific questions you want us to think about on our end somewhere in this task? thanks.

@Ahoelzl Please include Isaac in the meeting you will have with Xiao. Some context for this below (which hopefully answers some of your questions for me in Slack as well)


Product/User stakeholder: @SonjaPerry
For this specific task if you need research engineering support: @XiaoXiao-WMF
For next steps wrt T371865: @Isaac
For questions about the longer term maintenance of the metrics and measurements: @OSefu-WMF

FYI that I declined Research to spend time on T394065 b/c once this task is done, that issue will be addressed as well (Albeit it will be too late for the particular issue). You can add that request as a type of use-case for this task. If you have updates on the timelines for prioritizing this task, please share. thanks.

Hi all,
I have some room to work on Event Platform tasks and I could take the work done on this one and try to push it to the finish line.
As the ticket is a bit old, I'd like to confirm if this is still needed.

  • @MunizaA, can I take over your MR and continue working on it?
  • Is it ok if the data is written in the Kafka Jumbo (only Eqiad)? Not sure who can confirm this, maybe @Isaac? Recently we had a discussion about writing page_content_change on Kafka Main and it was discouraged (T409469). It's fine to have it in Jumbo, but we should open a wider discussion if it needs to be in Main.

This is very exciting @JMonton-WMF ! I think @fkaelin is the best person to answer both of your questions.

It would be great if you pick this up @JMonton-WMF .

Regarding kafka cluster, jumbo would seem ok, I defer to DPEon that; our main usage of this dataset will be offline. Storing the html dataset at the time a revision was created is a good first step in that direction.

  • For a dataset to be "correct", it will have be reconciled like the wikitext content history (ie. the page_content_change stream). There are ongoing discussions on how to do this in a general way, this html usecase is hopefully part of that.
  • Most analytics or ML use cases also require historical data, but backfilling this html dataset is non-trivial. Another complexity is that for research/ml use cases we often want the diff between a revision and it's parent revision. Research runs an offline html pipeline that creates a html research.mediawiki_content_html based on the content history - a daily dag that queries the mediawiki api for the html of each revision and its parent revision. This dataset was started on March 1st 2025, so contains over 6 month of data by now, and it is partially reconciled as it is based on the content history dag.

@fkaelin How urgent is the need for this stream? We're considering moving off of PyFlink and this would be a good opportunity to spike on a Java pipeline instead of a quick implementation now and then the complexities of dealing with any migration pains later

@fkaelin How urgent is the need for this stream? We're considering moving off of PyFlink and this would be a good opportunity to spike on a Java pipeline instead of a quick implementation now and then the complexities of dealing with any migration pains later

+ 1000

This html stream (or rather the events table that will be gobblined to the datalake) will be the first step towards a production html dataset in the datalake, followed by other challenges such as reconciliation and backfilling. Since the complexity of this stream is limited (almost identical to the page content change), having this initial building block in place is in my opinion a high priority from an essential work perspective. We have to get started somewhere, and once this stream is in place it will also be easier to scope and plan the other pieces that need to follow.

In regards to the java migration, there does not seem to be a difference between migrating the existing page content change pipeline vs piloting on the html content change pipeline (since they are almost identical). Given the amount of work & tooling that has gone into the python based event utilities etc, such a py->java switch seems to be a larger undertaking (is there phab/design doc for this, would be curious to learn more about the motivation); my guess would be that a product team will be blocked on an html dataset before that migration is done. I defer to management/pm for the urgency for product development. Examples of projects: moderator metrics (T410940) which depends on edit types (T351225), semantic search (for splitting content into sub-article chunks, in plaintext).