Build a new Flink job that:
- Listens to mw.page_change.v1
- Calls MW API for HTML of page
- Outputs to new stream (name TBD)
Job will be very similar to the wikitext enrichment job.
Why build this?
Parsing HTML is easier than wikitext and an incremental stream of changes to a page, from a point in time, is useful to train models and/or track how pages are changing over time.
Important Notes:
- Stream will contain the rendered HTML of a page when a page is created, edited or deleted. If a change to a template changes the HTML of the page you will not receive an event for this
- The work in this ticket doesn’t cover the backfilling of the stream
- It is expected that the existing schema for the wikitext stream could be used for this new stream but to be discussed.
- It is likely we would need to increase the max message size of Kafka jumbo to ~15MB (currently at 10MB)