Historical HTML dumps
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	fkaelin
	Mar 29 2023, 3:42 AM

Description

Request Status: New Request
Request Type: project support request
Related OKRs:

Request Title: Generate HTML dumps

Request Description: Generate a dataset of the rendered HTML of all historical revisions.
Indicate Priority Level: Medium
Main Requestors: @Miriam @fkaelin
Ideal Delivery Date: August 2023
Stakeholders: Research team

Request Documentation

Document Type	Required?	Document/Link
Related PHAB Tickets	Yes	T182351, T305688, T161773
Product One Pager	Yes	https://docs.google.com/document/d/1UYzNHyq0kmfv4ehZtZk_QVCAvaWoSxrES0-3yRo_Wz0/edit#
Product Requirements Document (PRD)	Yes	Business case here: https://docs.google.com/document/d/1wILUbuzz8NqKY6Q6TUHc03Y_RjkV_wmKafkz_xvMZCM/edit?usp=sharing
Product Roadmap	No	<add link here>
Product Planning/Business Case	No	<add link here>
Product Brief	No	<add link here>
Other Links	No	<add links here>

Related Objects

Mentioned In: T182351: Make HTML dumps available
T305688: Make HTML Dumps available in hadoop
Mentioned Here: T330848: [XL] Exclude sections with non-standard tables and lists
T161773: Make a plain text dump of Wikipedia available alongside the XML
T182351: Make HTML dumps available
T305688: Make HTML Dumps available in hadoop

Event Timeline

fkaelin created this task.Mar 29 2023, 3:42 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 29 2023, 3:42 AM

I added more context about this task in the business case here: https://docs.google.com/document/d/1wILUbuzz8NqKY6Q6TUHc03Y_RjkV_wmKafkz_xvMZCM/edit?usp=sharing

Miriam added a subscriber: mfossati.Mar 29 2023, 2:27 PM

The Structured Data team is actively processing Enterprise HTML dumps to detect tabular and list data in Wikipedias as per T330848: [XL] Exclude sections with non-standard tables and lists, part of the Section-Topics and Section-Level-Image-Suggestions projects.
The current implementation relies on the Network File System (NFS) disk where HTML dumps are available, thus limiting to machines where it's mounted, typically stat ones.

It would be beneficial to have those dumps available in the Data Lake, since it enables scheduling the task in a Data Pipelines fashion.

fkaelin mentioned this in T305688: Make HTML Dumps available in hadoop.Mar 30 2023, 4:17 AM

Summarizing some discussions with @ssastry, @Ottomata, @dr0ptp4kt:

Considerations:

How do we handle templates?
How far do we go back? Even going back to 2020 could take a significant amount of time to process depending on the method to get the historic HTML.
What is the best way to get historic HTML revisions? Need to work with content transform team to figure this out
How usable will the data be once it is in the data lake? Likely need to look at performance tuning (storage options, etc)

In summary, this is a large amount of effort and would need coordination/work across a few teams to get this done.

I could see this being easily over a quarters worth of work for the event platform team (with input needed from other groups too).

How do we handle templates?

And/or: Do we want the historical revision html to match the 'current' parsing? If we want to generate the html for all revisions only once, fine. But if we want to 'update' the parsed html based on changes to templates, and also changes to wikitext parser, we'll have to either periodically generate historical snapshots (like we do for mediawiki history), or we'll have to have a way to do incremental updates, probably via Iceberg.

We are still very green with Iceberg, so if we want historical updates (I think we do?) we should probably complete some other 'easier' Iceberg based projects (incremental wikitext dumps, incremental mediawiki history, etc.) before we go for this one :)

fkaelin mentioned this in T182351: Make HTML dumps available.Jun 28 2023, 5:04 PM

TBurmeister subscribed.Nov 6 2023, 4:23 PM

Historical HTML dumpsOpen, Needs TriagePublicActions