To support T410940: WE1.5.3 Productize Data for Monthly Active Moderator Actions, Data Engineering will be deploying 2 pyflink streaming applications. These will result in 2 new event stream data products:
- mediawiki.page_html_content_change
- mediawiki.page_html_feature_counts_change
As of April 20 2026, we are ready to move from development phase to release candidate, and eventually to a final .v1 of these event streams.
This task is a checklist / container task to track the remaining work needed to reach v1 release for these streams.
Please update this task description with details (new subtasks) and additional work.
For release candidate
- Finalize schemas
- T415158: Common event data model for data derived from parsed page revision html (and more!)
- Move rendering_content_change and rendering_feature_counts_change out of development namespace.
- -> MR41
- Create mediawiki.page_html_content_change.rc0 and mediawiki.page_html_feature_counts_change.rc0 streams and use them with finalized schema URIs in stream jobs.
- Refactor edit type helmfiles to use _*_common_*/ values file symlinks pattern, rather than duplicating common settings in different files.
- Settle all helmfiles to their final values. (Stop using --set overrides)
- Create mw-page-html-feature-counts-change-enrich (edit type) helmfile for production deployment
- Rename mw-page-edit-type-enrich-next (staging) (edit-type) helmfile to mw-page-html-feature-counts-change-enrich-next
- Use process_async_enabled_default=false for edit types feature-change enrichment.
- Set up alerting for stream jobs (should be able to re-use prior work) T423996: HTML Enrichment - Alerting T424224: Edit type enrichment: Alerting
- ensure error DLQ streams are working properly
- Resolve any other outstanding TODOs in code and helmfiles
- Drop rc0 and dev tables, delete data from hdfs
For v1 release
- SLOs?
- Update https://wikitech.wikimedia.org/wiki/MediaWiki_Event_Enrichment docs
- Especially note which kind of events are and are not in these streams (e.g. content_model=wikitext, no enrichment on visibility change, etc.)
- Update datahub docs for the streams as well?
- Create mediawiki.page_html_content_change.v1 and mediawiki.page_html_feature_counts_change.v1 (name TBD) streams and use them in stream jobs.
- Remove .rc0 streams and no longer any remaining used development/ schemas
- Add relevant event table(s) to event_santized_main_allowlist.yaml so that we keep data longer than 90 days.
- mediawiki_html_feature_counts_change for sure
- Do we want to also keep HTML? Would be nice but: PII issues? Size issues?
Additional tasks
Not blocking v1 release.
- T418996: Audit and fix observability (logging and metrics) for pyflink jobs
- T422928: HTML Pipeline - Performance improvements
- T409464: mediawiki.page_change.v1 event - add a 'new revision created' field and use it to simplify enrichment conditional logic.
- Update Event_Platform/Schemas/Guidelines with docs about new delta state modeling convention. - @Ottomata TODO
- Expose mediawiki_html_feature_counts in EventStreams?
- Actually, to do this we'd need these events in kafka main (or other work). But, in hindsight, perhaps we should produce these to kafka main either way?
- Delete development schemas
- Delete unused Kafka topics: https://phabricator.wikimedia.org/T427951