Make HTML Dumps available in hadoop
Open, Needs TriagePublic13 Estimated Story Points
Actions

Assigned To

None

Authored By

	MGerlach
	Apr 8 2022, 12:39 PM

Description

NOTE: Original intent of ticket was to ingest https://dumps.wikimedia.org/other/enterprise_html/ but these files are experimental and not fully supported. To implement this ticket we plan to go directly to Enterprise for the files.

The Enterprise HTML dumps are a very valuable resource for many research purposes (see T182351 for a more detailed explanation). While they are available as json-files on the stat-machines locally, parsing the whole dump is computationally very expensive and takes a lot of time. Could we add the dumps to hadoop to make bulk-processing feasible? I am thinking of something similar to the wikitext_current dumps (T238858).

Implementation Steps:

Write up SLO (wikitech) for ingestion job
Setup Data Engineering login for Enterprise access (ask Enterprise to turn limits off)
Design Iceberg schema and deploy (schema for this version of the dumps is simpler than the file - we could possibly use that
Check how much space this will take up and review with team (if too much we can look at prioritizing specific wikis)
Files can be downloaded using the Enterprise snapshot API (need a pre-step to get available snapshots/projects)
Build Airflow job to read files and load to Iceberg table
Ensure that job retains only last 2 snapshots

Useful links:

Docs to get Enterprise keys -> https://enterprise.wikimedia.com/docs/#getting-api-keys
Docs to download Snapshot files -> https://enterprise.wikimedia.com/docs/snapshot/

Related Objects

Mentioned In: T371062: [Research Engineering Request] Build HTML page stream
T360794: Implement stream of HTML content on mw.page_change event
T182351: Make HTML dumps available
T333419: Historical HTML dumps
Mentioned Here: T120242: Eventually Consistent MediaWiki State Change Events
T333419: Historical HTML dumps
T298436: Wikimedia Enterprise HTML dumps as bzip2 archive
T182351: Make HTML dumps available
T238858: Make history and current wikitext available in hadoop

Event Timeline

MGerlach created this task.Apr 8 2022, 12:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 8 2022, 12:39 PM

• EChetty moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Apr 11 2022, 1:39 PM

Isaac subscribed.May 10 2022, 3:42 PM

diego subscribed.Jun 2 2022, 1:30 PM

fkaelin subscribed.Jun 30 2022, 1:20 PM

@MGerlach Where can one find the enterprise html json files on the stat machines? I didn't read this carefully enough before experimenting a bit - it will make things easier.

The .tar.gz format of the enterprise dumps is not ideal for batch processing. The unpacking of the tar into multiple uncompressed json files is not efficient, and in addition it takes more space on hdfs as the json data is not compressed anymore. See T298436 for discussion about data format.

That said, we can put the json files on hdfs as is. Then they can be read from spark.

wget https://dumps.wikimedia.org/other/enterprise_html/runs/20230320/simplewiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gz
mkdir -p html_enterprise/simple
tar -C html_enterprise/simple -I pigz -xf simplewiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gz
hdfs dfs -put html_enterprise/simple/simplewiki_*  html_enterprise/simple

Then in a notebook

import wmfdata
spark = wmfdata.spark.create_session(app_name='enterprise_html')
df = spark.read.json('/user/fab/html_enterprise/simple/*ndjson')
df.printSchema()

root
 |-- additional_entities: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- aspects: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- identifier: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- article_body: struct (nullable = true)
 |    |-- html: string (nullable = true)
 |    |-- wikitext: string (nullable = true)
 |-- categories: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- date_modified: string (nullable = true)
 |-- identifier: long (nullable = true)
 |-- in_language: struct (nullable = true)
 |    |-- identifier: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- is_part_of: struct (nullable = true)
 |    |-- identifier: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- license: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- identifier: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- main_entity: struct (nullable = true)
 |    |-- identifier: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- name: string (nullable = true)
 |-- namespace: struct (nullable = true)
 |    |-- identifier: long (nullable = true)
 |    |-- name: string (nullable = true)
 |-- protection: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- expiry: string (nullable = true)
 |    |    |-- level: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- redirects: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- templates: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- url: string (nullable = true)
 |-- version: struct (nullable = true)
 |    |-- comment: string (nullable = true)
 |    |-- editor: struct (nullable = true)
 |    |    |-- identifier: long (nullable = true)
 |    |    |-- is_anonymous: boolean (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |-- identifier: long (nullable = true)
 |    |-- is_minor_edit: boolean (nullable = true)
 |    |-- tags: array (nullable = true)
 |    |    |-- element: string (containsNull = true)

fkaelin mentioned this in T333419: Historical HTML dumps.Mar 29 2023, 3:42 AM

In T305688#8737195, @fkaelin wrote:

@MGerlach Where can one find the enterprise html json files on the stat machines? I didn't read this carefully enough before experimenting a bit - it will make things easier.

here you should find the different snapshots:

/mnt/data/xmldatadumps/public/other/enterprise_html/runs

Thanks @MGerlach - the most recent run of the data on /mnt/data is from October 2022. Luckily I had already started the download for the enwiki as well, so I went ahead an put the March 20th 2023 html dumps for simplewiki and enwiki on hdfs.

fab@stat1008:~$ hdfs dfs -du -h /wmf/data/research/html_enterprise/
631.4 G  /wmf/data/research/html_enterprise/enwiki
9.6 G    /wmf/data/research/html_enterprise/simplewiki

Doing this for all dumps would not be too cumbersome, though that should use the mechanism for getting the html dumps on /mnt/data, followed by a script to extract and put the json files on hdfs. This should be discussed with data engineering, especially if we want to do this regularly, and in view the request for the full historical dumps (T333419).

As a proof of concept this seem to work and scale well though, I created a quick notebook with an example job https://gitlab.wikimedia.org/repos/research/wikidiff/-/blob/main/notebooks/html_enterprise.ipynb. Processing the full enwiki dataset takes 5 minutes, albeit for a simple job that computes a histogram of the difference in number of bytes between wikitext and html.

dr0ptp4kt subscribed.Apr 21 2023, 11:34 AM

fkaelin mentioned this in T182351: Make HTML dumps available.Jun 28 2023, 5:04 PM

mfossati subscribed.Mar 1 2024, 10:52 AM

The Structured Content team is consuming the HTML dumps as a script in a production data pipeline. Data-Engineering: we'd also like to see this ticket happen!
Some pointers:

previous ask - T333419#8738578
dumps processing script
we're working on automating the script, for now through a mere copy from NFS to HDFS, which is certainly sub-optimal

CC @matthiasmullie .

lbowmaker moved this task from Event Platform Backlog to To be estimated/discussed on the Data-Engineering board.Mar 1 2024, 1:01 PM

Moving this to discuss with the team. Seems reasonable to have 1 or 2 versions of this if we source it from the Enterprise dumps.

Having historical revisions and creating this ourselves would need a lot more consideration: https://phabricator.wikimedia.org/T333419#8779470

Moving this to discuss with the team. Seems reasonable to have 1 or 2 versions of this if we source it from the Enterprise dumps.

Thanks @lbowmaker for considering and @mfossati for raising! Just chiming in to add my support that having a current snapshot of Parsoid HTML from Enterprise would be very helpful. We've developed a Python library (mwparserfromhtml) that enables us to extract lots of features (references, infoboxes, plaintext, etc.) easily from the HTML so are in a good position to make use of it. Within Research, we're working on switching more of our models to using it too because the gap between wikitext and HTML is definitely growing (example with references). For example, we have an intern who will be working on converting the quality model used for knowledge gap metrics from using wikitext to HTML for this reason, so having a regular snapshot that could be used for computing article quality for all articles would be very helpful.

From what I understand this would be the work:

Design schema and implement Iceberg table for the data
Build Airflow job that checks for latest Enterprise dumps file
Load the new data, drop the oldest (maybe we keep 2 versions?)

Anything else to consider?

I'll let others chime in but that would be my feeling about the correct scope. Going historical indeed adds a lot of complications and I think current snapshots are a huge first step. I'd coordinate with Enterprise obviously just to see if any changes are going to happen with schema etc. but hopefully relatively straightforward.

mfossati added a project: Structured-Data-Backlog.Mar 6 2024, 2:57 PM

mfossati moved this task from Triage to Tracking on the Structured-Data-Backlog board.

Pasting this reply from a slack thread for context

Agree with Isaac that no changes to the schema is necessary, especially since this is a custom format developed by enterprise (in fact it contains both the wikitext and the html)
The enterprise dumps are current snapshots, e.g. no historical html included. We want these html dumps available because they the best option currently available, but I want to make the larger point that this is not a good final solution, we do require the historical html data for ML use cases.
- The most common/convenient way to consume the html dumps is via spark from hdfs. In this notebook I was experimenting with downloading the enterprise dumps and putting them on hdfs, the unpacking steps don’t seem ideal for distributed compute, you end up with uncompressed json. Likely you would want to do some processing and have them end up in iceberg. This is a roundabout solution requiring some gluing, but seems to be the most possible/attainable at the moment
- That said, with the new page change event data, DE does have a source of html data if my understanding is right. That data is an incremental dataset of all revisions with their html. We could create a “current html snapshot” using an airflow job that runs e.g. biweekly and creates an something similar to the enterprise snapshot. The benefit of this is that you do have the incremental dataset, and after (albeit challenging) backfilling that dataset can become the historical html dataset as well.
- I am making this last point to reinforce the importance of T120242, to free ourselves from the snapshot prison. Or more appropriately, to move the control of how to create/expose snapshots into the DE itself.

Hi folks - chiming in here from the Enterprise side. @fkaelin is correct that our snapshots are not historical. We likely will not be able to support that piece of the request if a specific historical dump is needed.

We would be happy to help with access to the snapshots, and I would add that we are currently looking to add chunking to snapshots, which may make them much easier to batch process.

This work is slated for mid-Q4 and can be tracked via T355443.

I'm interested as well, as I intend to looking at some image dumping stuff, and the surrounding HTML will be important for understanding context.

If it isn't too much trouble and storage isn't too much a worry, a step toward the following may be interesting on a per wiki_db-page_id basis:

earliest captured HTML, revision ID, revision retrieval datetime, parser version
latest captured HTML, revision ID, revision retrieval datetime, parser version
previous captured HTML, revision ID, revision retrieval datetime, parser version
diff of HTML compared to previous capture, revision ID, revision retrieval datetime(s), parser version. This would probably allow reconstruction of HTML history in a coarse grained fashion without sacrificing so much storage.

Eventually we could fetch from Parsoid for the page change stream (this doesn't introduce a lot of extra load as it would prewarm the parser cache or when backlogged be getting a warmed response), or if we had dailies for HTML of all new revisions in another system (e.g., Enterprise) we could slide that in.

OTOH if it's too much work, monthlies (i.e., latest captured HTML, previous captured HTML) as said above would be an excellent start.

lbowmaker updated the task description. (Show Details)Mar 25 2024, 5:14 PM

lbowmaker set the point value for this task to 13.

lbowmaker moved this task from To be estimated/discussed to Q4 2024 April 1st - June 30th on the Data-Engineering board.

lbowmaker edited projects, added Data-Engineering (Q4 2024 April 1st - June 30th); removed Data-Engineering.

lbowmaker updated the task description. (Show Details)Apr 3 2024, 12:51 PM

lbowmaker added subscribers: gmodena, Ahoelzl.

leila added a project: Research.May 27 2024, 3:26 PM

leila moved this task from Backlog to Support Needed on the Research board.

lbowmaker moved this task from Q4 2024 April 1st - June 30th to Estimated (To be planned) on the Data-Engineering board.Mon, Jul 15, 2:37 PM

lbowmaker edited projects, added Data-Engineering; removed Data-Engineering (Q4 2024 April 1st - June 30th).

Isaac mentioned this in T360794: Implement stream of HTML content on mw.page_change event.Wed, Jul 17, 7:34 PM