Page MenuHomePhabricator

Incremental HTML wiki content dataset to support "Who are moderators"
Open, Needs TriagePublic

Description

The "who are moderators" SDS 1.2.3 (T371865) project requires html data. This task is for tracking Research's request for DE to prioritize the incremental HTML dataset described in T360794.

  • DE to deploy and monitor the streaming HTML pipeline (T360794) and configure Gobblin to create html dataset in the data lake.
  • A significant part of the technical work is done with (MR). Research would like for DE to deployment/own/maintain the service as Research doesn't own infrastructure and has no SRE. T371062.

Why do we need this?

  • Without an HTML dataset in the data lake, the model to classify moderator actions produced by SDS 1.2.3 can't be used beyond the development dataset (T380871)
  • An incremental html dataset will allow to create a dataset of moderator actions for future revisions (for >99%). It is the lowest effort option to start a html dataset in the data lake that research could use.

For context, some more background on the various types of datasets that are needed to match the wikitext based datsets already available:

  • the html diff dataset would require the historical html dumps (T333419), the equivalent to the wikitext history dataset. This is a significant effort and not an option for SDS 1.2.3
  • html snapshot datasets (equivalent to the wikitext current dataset) are partly available (T305688), but are not sufficient for sufficient for SDS 1.2.3 as they only contain the most recent html
  • [this request] an incremental dataset (T360794) is the basis for both the historical/snapshot datasets. Lowest effort of the html dataset options, and will allow to generate an incremental dataset of moderator actions

Event Timeline

Ottomata renamed this task from Incremental HTML dataset to support "Who are moderators" SDS 1.2.3 to Incremental HTML wiki content dataset to support "Who are moderators" SDS 1.2.3.Jan 6 2025, 6:41 PM

DP decided to not prioritize it in Q3. Moving to freezer.

leila renamed this task from Incremental HTML wiki content dataset to support "Who are moderators" SDS 1.2.3 to Incremental HTML wiki content dataset to support "Who are moderators".May 15 2025, 3:56 PM
leila removed a project: Research-Freezer.
leila moved this task from Backlog to Support Needed on the Research board.
leila removed a subscriber: XiaoXiao-WMF.