Page MenuHomePhabricator

Data request: make rendered HTML page dumps available on stats machines or labs
Closed, ResolvedPublic

Description

There's a new data source produced by the Wikimedia Enterprise project, which includes the Parsoid-rendered HTML output for all pages in many wikis. This is exactly what my team needs in order to analyze reference usage in pages, since references are commonly rendered using templates that vary across every site. The HTML format makes it possible to find all rendered refs and also discover which templates are being used.

I see that this data source is also being considered for other offline projects such as T329779: Explore moving MWOffliner/Kiwix from API calls to Wikimedia Enterprise HTML dumps.

Can we provide access to this data set within one of our clusters, to save on external bandwidth? I can work either on stats machines or in a wmcloud instance.

Event Timeline

@awight it's already on Toolforge and WMCS, see /public/dumps/public/other/enterprise_html/

awight claimed this task.

@awight it's already on Toolforge and WMCS, see /public/dumps/public/other/enterprise_html/

Great to hear, thanks for pointing this out!