Page MenuHomePhabricator

Historical HTML dumps
Open, Needs TriagePublic

Description

Request Status: New Request
Request Type: project support request
Related OKRs:

Request Title: Generate HTML dumps

  • Request Description: Generate a dataset of the rendered HTML of all historical revisions.
  • Indicate Priority Level: Medium
  • Main Requestors: @Miriam @fkaelin
  • Ideal Delivery Date: August 2023
  • Stakeholders: Research team

Request Documentation

Document TypeRequired?Document/Link
Related PHAB TicketsYesT182351, T305688, T161773
Product One PagerYeshttps://docs.google.com/document/d/1UYzNHyq0kmfv4ehZtZk_QVCAvaWoSxrES0-3yRo_Wz0/edit#
Product Requirements Document (PRD)YesBusiness case here: https://docs.google.com/document/d/1wILUbuzz8NqKY6Q6TUHc03Y_RjkV_wmKafkz_xvMZCM/edit?usp=sharing
Product RoadmapNo<add link here>
Product Planning/Business CaseNo<add link here>
Product BriefNo<add link here>
Other LinksNo<add links here>

Event Timeline

The Structured Data team is actively processing Enterprise HTML dumps to detect tabular and list data in Wikipedias as per T330848: [XL] Exclude sections with non-standard tables and lists, part of the Section-Topics and Section-Level-Image-Suggestions projects.
The current implementation relies on the Network File System (NFS) disk where HTML dumps are available, thus limiting to machines where it's mounted, typically stat ones.

It would be beneficial to have those dumps available in the Data Lake, since it enables scheduling the task in a Data Pipelines fashion.

Summarizing some discussions with @ssastry, @Ottomata, @dr0ptp4kt:

Considerations:

  • How do we handle templates?
  • How far do we go back? Even going back to 2020 could take a significant amount of time to process depending on the method to get the historic HTML.
  • What is the best way to get historic HTML revisions? Need to work with content transform team to figure this out
  • How usable will the data be once it is in the data lake? Likely need to look at performance tuning (storage options, etc)

In summary, this is a large amount of effort and would need coordination/work across a few teams to get this done.

I could see this being easily over a quarters worth of work for the event platform team (with input needed from other groups too).

How do we handle templates?

And/or: Do we want the historical revision html to match the 'current' parsing? If we want to generate the html for all revisions only once, fine. But if we want to 'update' the parsed html based on changes to templates, and also changes to wikitext parser, we'll have to either periodically generate historical snapshots (like we do for mediawiki history), or we'll have to have a way to do incremental updates, probably via Iceberg.

We are still very green with Iceberg, so if we want historical updates (I think we do?) we should probably complete some other 'easier' Iceberg based projects (incremental wikitext dumps, incremental mediawiki history, etc.) before we go for this one :)