It was determined that the current motivation for doing this task (T351225 & T410940) requires more than just latest page revision html. Edit types computation requires a diff of the latest revision html to its parent.
Including both parent and latest revision html in the same event is likely going to be too big for Kafka.
But, a stream containing latest revision html is still useful for many other use cases.
To support both T351225 and other use cases, we will do the following
Build a new streaming enrichment (Flink) job that:
- Listens to mediawiki.page_change.v1
- Calls MW API for HTML of latest revision + parent revision
- Emits new stream with latest html + unified diff from parent
We hope to use this as the basis for also emitting an edit types stream for T351225: Productionized Edit Types. We may choose to emit both of these streams in the same Flink pipeline, TBD.
This job will be similar to the wikitext enrichment job.
Why build this?
Parsing HTML is easier than wikitext and an incremental stream of changes to a page, from a point in time, is useful to train models and/or track how pages are changing over time.
Important Notes:
- Stream will contain the rendered HTML of a page when a page is created, edited or deleted. If a change to a template changes the HTML of the page you will not receive an event for this
- The work in this ticket doesn’t cover the backfilling of the stream
- Data model will be developed in T415158: Common event data model for data derived from parsed page revision html (and more!)
- It is likely we would need to increase the max message size of Kafka jumbo to at least ~15MB (currently at 10MB)
What to do with each page_change_kind?
mediawiki.page_change.v1 stream contains a page_change_kind field that indicates what kind of page change the event is representing. Not all page change events change revision content. What should the new event stream with html + diff contain for each possible page_change_kind?
(Note: T409105#11460975 may also have implications on these choices.)
We will work out these details in comments of this ticket.
See T360794#11664477.
| page_change_kind | what do do? |
|---|---|
| edit | enrich with latest and diff with parent html |
| create | enrich with latest html |
| move | enrich with latest and diff with parent html |
| delete | pass through, no enrichment |
| undelete | enrich with latest and diff with parent html |
| visibility_change | enrich with latest and diff with parent html |