The [WE1.5.3] Wikipedia Patrolling Measurement (T392210) project requires html data. This task is for tracking Research's request for having a dataset similar to the one resulting from T380874.
Why do we need this? Without an HTML dataset in the data lake, the approach to characterize patrolling actions (e.g., addition/removal of templates, references, etc.) will be limited. We have been working from a static snapshot of HTML for several wikis from the month of October 2024 (T380871) but that poses a few problems as we continue with this research direction:
- That snapshot is getting increasingly older, which means it may be less relevant to current trends.
- As just a single month's worth of data, we don't know if that snapshot is actually representative of trends in general or was an outlier.
- Due to an error, that snapshot includes arzwiki when the intention was to include arwiki. Though the other wikis are large wikis (dewiki, enwiki, eswiki, frwiki, itwiki, jawiki, nlwiki, plwiki, ruwiki, svwiki, zhwiki) and cover a lot of cultures/regions, but Arabic Wikipedia is an important missing piece.
For more context, further background on similar datasets available is provided in T380874.