Page MenuHomePhabricator

[FY25-WE1.5.3] HTML wiki content dataset to support Wikipedia Patrolling Measurement
Closed, DeclinedPublic

Description

The [WE1.5.3] Wikipedia Patrolling Measurement (T392210) project requires html data. This task is for tracking Research's request for having a dataset similar to the one resulting from T380874.

Why do we need this? Without an HTML dataset in the data lake, the approach to characterize patrolling actions (e.g., addition/removal of templates, references, etc.) will be limited. We have been working from a static snapshot of HTML for several wikis from the month of October 2024 (T380871) but that poses a few problems as we continue with this research direction:

  • That snapshot is getting increasingly older, which means it may be less relevant to current trends.
  • As just a single month's worth of data, we don't know if that snapshot is actually representative of trends in general or was an outlier.
  • Due to an error, that snapshot includes arzwiki when the intention was to include arwiki. Though the other wikis are large wikis (dewiki, enwiki, eswiki, frwiki, itwiki, jawiki, nlwiki, plwiki, ruwiki, svwiki, zhwiki) and cover a lot of cultures/regions, but Arabic Wikipedia is an important missing piece.

For more context, further background on similar datasets available is provided in T380874.

Event Timeline

leila renamed this task from HTML wiki content dataset to support [WE1.5.3] Wikipedia Patrolling Measurement to [FY25-WE1.5.3] HTML wiki content dataset to support Wikipedia Patrolling Measurement.May 14 2025, 9:09 PM
leila closed this task as Declined.
leila added a subscriber: fkaelin.
Reedy renamed this task from [FY25-WE1.5.3] HTML wiki content dataset to support Wikipedia Patrolling Measurement to [FY25-WE1.5.3] HTML wiki content dataset to support Wikipedia Patrolling Measurement.May 14 2025, 9:11 PM

@Pablo thanks for capturing this request. I considered it for prioritization and after hearing the input from Isaac (who I know works closely with you on this front) and Fabian, I have decided not to prioritize it for research engineering. Some more information below:

  • My understanding is that not prioritizing this task has a negative consequence for the parent task in that the hope was to fix an issue with the dataset and replace a language with ar, and to have fresher data. Isaac shared that the first one will need to be delayed in light of the decision and the second one is a caveat for the work you will communicate as part of sharing the results.
  • This request would not come to research engineering had T360794 been resolved. I would like to use this usecase as another reason to nudge and request prioritization of T360794 which will help solve this and other needs we run into in this space.

Thanks.