Page MenuHomePhabricator

Streaming HTML & Edit Types - productionization checklist
Open, HighPublic

Description

To support T410940: WE1.5.3 Productize Data for Monthly Active Moderator Actions, Data Engineering will be deploying 2 pyflink streaming applications. These will result in 2 new event stream data products:

  • mediawiki.page_html_content_change
  • mediawiki.page_html_feature_counts_change

    As of April 20 2026, we are ready to move from development phase to release candidate, and eventually to a final .v1 of these event streams.

This task is a checklist / container task to track the remaining work needed to reach v1 release for these streams.

Please update this task description with details (new subtasks) and additional work.


For release candidate

For v1 release

Additional tasks

Not blocking v1 release.

Related Objects

StatusSubtypeAssignedTask
OpenIsaac
ResolvedAKhatun_WMF
OpenNone
OpenNone
ResolvedAKhatun_WMF
OpenNone
ResolvedOttomata
ResolvedJMonton-WMF
ResolvedJMonton-WMF
ResolvedJMonton-WMF
OpenJMonton-WMF
OpenJMonton-WMF
OpenNone
OpenNone
ResolvedJMonton-WMF
ResolvedJMonton-WMF
OpenNone
ResolvedOttomata
OpenJMonton-WMF
ResolvedJMonton-WMF
OpenJMonton-WMF
ResolvedJMonton-WMF
ResolvedJMonton-WMF
ResolvedOttomata
ResolvedOttomata
OpenJMonton-WMF
ResolvedJMonton-WMF
ResolvedJMonton-WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
OpenAKhatun_WMF
ResolvedJMonton-WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
OpenJMonton-WMF
OpenNone
OpenNone
OpenOttomata

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1276397 merged by jenkins-bot:

[operations/mediawiki-config@master] EventStreamConfig - add rc0 streams for html and feature count change

https://gerrit.wikimedia.org/r/1276397

Mentioned in SAL (#wikimedia-operations) [2026-04-23T19:06:32Z] <otto@deploy1003> Started scap sync-world: Backport for [[gerrit:1276699|Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' (T417694)]], [[gerrit:1276397|EventStreamConfig - add rc0 streams for html and feature count change (T423920)]]

Mentioned in SAL (#wikimedia-operations) [2026-04-23T19:14:07Z] <otto@deploy1003> xcollazo, otto: Backport for [[gerrit:1276699|Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' (T417694)]], [[gerrit:1276397|EventStreamConfig - add rc0 streams for html and feature count change (T423920)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-04-23T19:28:37Z] <otto@deploy1003> Finished scap sync-world: Backport for [[gerrit:1276699|Remove stream 'mediawiki.dump.revision_content_history.reconcile.rc0' (T417694)]], [[gerrit:1276397|EventStreamConfig - add rc0 streams for html and feature count change (T423920)]] (duration: 22m 05s)

Change #1277098 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1277098

Change #1277098 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1277098

Change #1278476 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] EventStreamConfig - Declare .v1 streams for html content and feature counts

https://gerrit.wikimedia.org/r/1278476

Change #1278476 merged by jenkins-bot:

[operations/mediawiki-config@master] EventStreamConfig - Declare .v1 streams for html content and feature counts

https://gerrit.wikimedia.org/r/1278476

Mentioned in SAL (#wikimedia-operations) [2026-04-28T15:19:44Z] <otto@deploy1003> Started scap sync-world: Backport for [[gerrit:1278476|EventStreamConfig - Declare .v1 streams for html content and feature counts (T423920)]]

Mentioned in SAL (#wikimedia-operations) [2026-04-28T15:21:34Z] <otto@deploy1003> otto: Backport for [[gerrit:1278476|EventStreamConfig - Declare .v1 streams for html content and feature counts (T423920)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-04-28T15:28:05Z] <otto@deploy1003> Finished scap sync-world: Backport for [[gerrit:1278476|EventStreamConfig - Declare .v1 streams for html content and feature counts (T423920)]] (duration: 08m 21s)

Change #1279247 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1279247

Change #1279247 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-html-content-change-enrich

https://gerrit.wikimedia.org/r/1279247

Change #1279429 had a related patch set uploaded (by AKhatun; author: AKhatun):

[operations/deployment-charts@master] stream: move mw-page-html-feature-counts-change-enrich to v1

https://gerrit.wikimedia.org/r/1279429

Change #1279429 merged by jenkins-bot:

[operations/deployment-charts@master] stream: move mw-page-html-feature-counts-change-enrich to v1

https://gerrit.wikimedia.org/r/1279429

What do we need to do to have these datasets in event_sanitized?

What do we need to do to have these datasets in event_sanitized?

Oh ya! Good thinking! I'll add to checklist.

Add the table here:
https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/static_data/sanitization/event_sanitized_main_allowlist.yaml

We should definitely add mediawiki_rendering_feature_counts_change. Let's discuss if we should add the html events. I think we should, but I'm not certain that we want to keep HTML forever? It will be big? OTOH, we did just have a really nice use for almost a year of data from Research's HTML.

Change #1280508 had a related patch set uploaded (by AKhatun; author: AKhatun):

[operations/deployment-charts@master] stream: mw-page-html-feature-counts-change-enrich; increase source parallelism to 6

https://gerrit.wikimedia.org/r/1280508

Change #1280508 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-html-feature-counts-change-enrich; increase source parallelism to 6

https://gerrit.wikimedia.org/r/1280508

Change #1287366 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/mediawiki-config@master] stream: mediawiki.page_html_content_change

https://gerrit.wikimedia.org/r/1287366

We have to delete the following topics in Kafka Jumbo:

codfw.mediawiki.page_html_content_change.dev0
codfw.mediawiki.page_html_content_change.dev1
codfw.mediawiki.page_html_content_change.dev4
codfw.mediawiki.page_html_content_change.dev5
codfw.mediawiki.page_html_content_change.rc0

codfw.mediawiki.page_html_feature_counts_change.rc0

eqiad.mediawiki.page_html_content_change.dev0
eqiad.mediawiki.page_html_content_change.dev1
eqiad.mediawiki.page_html_content_change.dev4
eqiad.mediawiki.page_html_content_change.dev5
eqiad.mediawiki.page_html_content_change.rc0

eqiad.mediawiki.page_html_feature_counts_change.rc0

staging.mediawiki.page_html_content_change.dev5
staging.mediawiki.page_html_content_change.rc0

codfw.mediawiki.page_edit_type_simple.dev0
codfw.mediawiki.page_edit_type_simple.dev1
eqiad.mediawiki.page_edit_type_simple.dev0
eqiad.mediawiki.page_edit_type_simple.dev1

eqiad.mw_page_edit_type_enrich.error

temp.page_change.v1_repartition

I tried to delete them but it seems I don't have permissions to do it, at least not using kafka topics from a broker. Maybe connecting to zookeeper I'd be able to do it, but I'd like to confirm it with SRE, as I'm not sure which Zookeeper server should we use.

Want to add

eqiad.mw_page_edit_type_enrich.error

We should also get rid of the hive tables for dev and rc0 versions. Can we just drop tables? Do we also need to cleanup the hdfs files?

Can we just drop tables? Do we also need to cleanup the hdfs files?

Since these are external tables with explicit LOCATIONS, Ya need both. DROP TABLE event.<table_name> and hdfs dfs -rm -r /wmf/event/<table_name>, or something like that.

Change #1287443 had a related patch set uploaded (by AKhatun; author: AKhatun):

[analytics/refinery@master] Add mw feature counts table for event sanitization

https://gerrit.wikimedia.org/r/1287443

Change #1287443 merged by AKhatun:

[analytics/refinery@master] Add mw feature counts table for event sanitization

https://gerrit.wikimedia.org/r/1287443

Change #1287366 merged by jenkins-bot:

[operations/mediawiki-config@master] stream: mediawiki.page_html_content_change

https://gerrit.wikimedia.org/r/1287366

Mentioned in SAL (#wikimedia-operations) [2026-05-18T08:31:14Z] <javiermonton@deploy1003> Started scap sync-world: Backport for [[gerrit:1287366|stream: mediawiki.page_html_content_change (T423920)]]

Mentioned in SAL (#wikimedia-operations) [2026-05-18T08:50:16Z] <javiermonton@deploy1003> javiermonton: Backport for [[gerrit:1287366|stream: mediawiki.page_html_content_change (T423920)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-05-18T09:02:49Z] <javiermonton@deploy1003> Finished scap sync-world: Backport for [[gerrit:1287366|stream: mediawiki.page_html_content_change (T423920)]] (duration: 31m 35s)

Wanted to note here:

  • The event_sanitized.mediawiki_page_html_feature_counts_change_v1 dataset has data from 2026-05-17 14 since the change was deployed on ~18th May.
  • event.mediawiki_page_html_feature_counts_change_v1 has from 2026-05-01 00. (A bit earlier than that actually)

@Ottomata I assumed it would take all data available. Apparently not. Do we need to get the data of beginning of May into event sanitized somehow?

The following tables were dropped and data deleted from hdfs

  • mediawiki_page_html_content_change_dev0
  • mediawiki_page_html_content_change_dev1
  • mediawiki_page_html_content_change_dev4
  • mediawiki_page_html_content_change_dev5
  • mediawiki_page_html_content_change_rc0
  • mediawiki_page_html_feature_counts_change_rc0
  • mediawiki_page_edit_type_simple_dev0
  • mediawiki_page_edit_type_simple_dev1
  • mw_page_edit_type_enrich_error

I assumed it would take all data available.

https://wikitech.wikimedia.org/wiki/Data_Platform/Event_Sanitization#Hive_event_sanitization_job

Hm, I think...it will eventually get picked up:

# RefineSanitize job declarations go below.
# Each job has an 'immediate' and a 'delayed' version.
# immediate is executed right after data collection. Runs once per hour.
# delayed is excuted on data that is 45 days old, to allow for automated backfilling
# data in the input database if it has changed since the immediate sanitize job ran.
# Jobs starts a few minutes after the hour, to leave time for the salt files to be updated.
$delayed_since = 1104 # 46 days ago
$delayed_until = 1080 # 45 days ago

event.mediawiki_page_html_feature_counts_change_v1 has from 2026-05-01 00. (A bit earlier than that actually)

So 45 days after the first data, the RefineSanitized delayed job will copy the data over.

It is pretty easy to launch a manual RefineSanitized job from an-launcher1003 if we want it to pick up all the data now. Lemme know and I'll show you (and uh, remember) how!

Ah, that makes sense! We don't need to get data into sanitized right now. Just wanted to inform. But looks like we are good. Thanks!

Change #1294113 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/alerts@master] html-enrichment: relax offset lag monitors

https://gerrit.wikimedia.org/r/1294113

Change #1294113 merged by jenkins-bot:

[operations/alerts@master] html-enrichment: relax offset lag monitors

https://gerrit.wikimedia.org/r/1294113

Change #1296623 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] mw_page_html_content_change_enrich_next - remove temporary kafka cluster override

https://gerrit.wikimedia.org/r/1296623

Change #1296623 merged by jenkins-bot:

[operations/deployment-charts@master] mw_page_html_content_change_enrich_next - remove temporary kafka cluster override

https://gerrit.wikimedia.org/r/1296623