Page MenuHomePhabricator

Make HTML Dumps available in hadoop
Open, Needs TriagePublic13 Estimated Story Points

Description

NOTE: Original intent of ticket was to ingest https://dumps.wikimedia.org/other/enterprise_html/ but these files are experimental and not fully supported. To implement this ticket we plan to go directly to Enterprise for the files.

The Enterprise HTML dumps are a very valuable resource for many research purposes (see T182351 for a more detailed explanation). While they are available as json-files on the stat-machines locally, parsing the whole dump is computationally very expensive and takes a lot of time. Could we add the dumps to hadoop to make bulk-processing feasible? I am thinking of something similar to the wikitext_current dumps (T238858).

Implementation Steps:

  • Write up SLO (wikitech) for ingestion job
  • Setup Data Engineering login for Enterprise access (ask Enterprise to turn limits off)
  • Design Iceberg schema and deploy (schema for this version of the dumps is simpler than the file - we could possibly use that
  • Check how much space this will take up and review with team (if too much we can look at prioritizing specific wikis)
  • Files can be downloaded using the Enterprise snapshot API (need a pre-step to get available snapshots/projects)
  • Build Airflow job to read files and load to Iceberg table
  • Ensure that job retains only last 2 snapshots

Useful links:

Event Timeline

@MGerlach Where can one find the enterprise html json files on the stat machines? I didn't read this carefully enough before experimenting a bit - it will make things easier.

The .tar.gz format of the enterprise dumps is not ideal for batch processing. The unpacking of the tar into multiple uncompressed json files is not efficient, and in addition it takes more space on hdfs as the json data is not compressed anymore. See T298436 for discussion about data format.

That said, we can put the json files on hdfs as is. Then they can be read from spark.

wget https://dumps.wikimedia.org/other/enterprise_html/runs/20230320/simplewiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gz
mkdir -p html_enterprise/simple
tar -C html_enterprise/simple -I pigz -xf simplewiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gz
hdfs dfs -put html_enterprise/simple/simplewiki_*  html_enterprise/simple

Then in a notebook

import wmfdata
spark = wmfdata.spark.create_session(app_name='enterprise_html')
df = spark.read.json('/user/fab/html_enterprise/simple/*ndjson')
df.printSchema()

root
 |-- additional_entities: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- aspects: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- identifier: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- article_body: struct (nullable = true)
 |    |-- html: string (nullable = true)
 |    |-- wikitext: string (nullable = true)
 |-- categories: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- date_modified: string (nullable = true)
 |-- identifier: long (nullable = true)
 |-- in_language: struct (nullable = true)
 |    |-- identifier: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- is_part_of: struct (nullable = true)
 |    |-- identifier: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- license: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- identifier: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- main_entity: struct (nullable = true)
 |    |-- identifier: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- name: string (nullable = true)
 |-- namespace: struct (nullable = true)
 |    |-- identifier: long (nullable = true)
 |    |-- name: string (nullable = true)
 |-- protection: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- expiry: string (nullable = true)
 |    |    |-- level: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- redirects: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- templates: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- url: string (nullable = true)
 |-- version: struct (nullable = true)
 |    |-- comment: string (nullable = true)
 |    |-- editor: struct (nullable = true)
 |    |    |-- identifier: long (nullable = true)
 |    |    |-- is_anonymous: boolean (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |-- identifier: long (nullable = true)
 |    |-- is_minor_edit: boolean (nullable = true)
 |    |-- tags: array (nullable = true)
 |    |    |-- element: string (containsNull = true)

@MGerlach Where can one find the enterprise html json files on the stat machines? I didn't read this carefully enough before experimenting a bit - it will make things easier.

here you should find the different snapshots:

/mnt/data/xmldatadumps/public/other/enterprise_html/runs

Thanks @MGerlach - the most recent run of the data on /mnt/data is from October 2022. Luckily I had already started the download for the enwiki as well, so I went ahead an put the March 20th 2023 html dumps for simplewiki and enwiki on hdfs.

fab@stat1008:~$ hdfs dfs -du -h /wmf/data/research/html_enterprise/
631.4 G  /wmf/data/research/html_enterprise/enwiki
9.6 G    /wmf/data/research/html_enterprise/simplewiki

Doing this for all dumps would not be too cumbersome, though that should use the mechanism for getting the html dumps on /mnt/data, followed by a script to extract and put the json files on hdfs. This should be discussed with data engineering, especially if we want to do this regularly, and in view the request for the full historical dumps (T333419).

As a proof of concept this seem to work and scale well though, I created a quick notebook with an example job https://gitlab.wikimedia.org/repos/research/wikidiff/-/blob/main/notebooks/html_enterprise.ipynb. Processing the full enwiki dataset takes 5 minutes, albeit for a simple job that computes a histogram of the difference in number of bytes between wikitext and html.

The Structured Content team is consuming the HTML dumps as a script in a production data pipeline. Data-Engineering: we'd also like to see this ticket happen!
Some pointers:

CC @matthiasmullie .

Moving this to discuss with the team. Seems reasonable to have 1 or 2 versions of this if we source it from the Enterprise dumps.

Having historical revisions and creating this ourselves would need a lot more consideration: https://phabricator.wikimedia.org/T333419#8779470

Moving this to discuss with the team. Seems reasonable to have 1 or 2 versions of this if we source it from the Enterprise dumps.

Thanks @lbowmaker for considering and @mfossati for raising! Just chiming in to add my support that having a current snapshot of Parsoid HTML from Enterprise would be very helpful. We've developed a Python library (mwparserfromhtml) that enables us to extract lots of features (references, infoboxes, plaintext, etc.) easily from the HTML so are in a good position to make use of it. Within Research, we're working on switching more of our models to using it too because the gap between wikitext and HTML is definitely growing (example with references). For example, we have an intern who will be working on converting the quality model used for knowledge gap metrics from using wikitext to HTML for this reason, so having a regular snapshot that could be used for computing article quality for all articles would be very helpful.

From what I understand this would be the work:

  • Design schema and implement Iceberg table for the data
  • Build Airflow job that checks for latest Enterprise dumps file
  • Load the new data, drop the oldest (maybe we keep 2 versions?)

Anything else to consider?

I'll let others chime in but that would be my feeling about the correct scope. Going historical indeed adds a lot of complications and I think current snapshots are a huge first step. I'd coordinate with Enterprise obviously just to see if any changes are going to happen with schema etc. but hopefully relatively straightforward.

Pasting this reply from a slack thread for context

  • Agree with Isaac that no changes to the schema is necessary, especially since this is a custom format developed by enterprise (in fact it contains both the wikitext and the html)
  • The enterprise dumps are current snapshots, e.g. no historical html included. We want these html dumps available because they the best option currently available, but I want to make the larger point that this is not a good final solution, we do require the historical html data for ML use cases.
    • The most common/convenient way to consume the html dumps is via spark from hdfs. In this notebook I was experimenting with downloading the enterprise dumps and putting them on hdfs, the unpacking steps don’t seem ideal for distributed compute, you end up with uncompressed json. Likely you would want to do some processing and have them end up in iceberg. This is a roundabout solution requiring some gluing, but seems to be the most possible/attainable at the moment
    • That said, with the new page change event data, DE does have a source of html data if my understanding is right. That data is an incremental dataset of all revisions with their html. We could create a “current html snapshot” using an airflow job that runs e.g. biweekly and creates an something similar to the enterprise snapshot. The benefit of this is that you do have the incremental dataset, and after (albeit challenging) backfilling that dataset can become the historical html dataset as well.
    • I am making this last point to reinforce the importance of T120242, to free ourselves from the snapshot prison. Or more appropriately, to move the control of how to create/expose snapshots into the DE itself.

Hi folks - chiming in here from the Enterprise side. @fkaelin is correct that our snapshots are not historical. We likely will not be able to support that piece of the request if a specific historical dump is needed.

We would be happy to help with access to the snapshots, and I would add that we are currently looking to add chunking to snapshots, which may make them much easier to batch process.

This work is slated for mid-Q4 and can be tracked via T355443.

I'm interested as well, as I intend to looking at some image dumping stuff, and the surrounding HTML will be important for understanding context.

If it isn't too much trouble and storage isn't too much a worry, a step toward the following may be interesting on a per wiki_db-page_id basis:

  • earliest captured HTML, revision ID, revision retrieval datetime, parser version
  • latest captured HTML, revision ID, revision retrieval datetime, parser version
  • previous captured HTML, revision ID, revision retrieval datetime, parser version
  • diff of HTML compared to previous capture, revision ID, revision retrieval datetime(s), parser version. This would probably allow reconstruction of HTML history in a coarse grained fashion without sacrificing so much storage.

Eventually we could fetch from Parsoid for the page change stream (this doesn't introduce a lot of extra load as it would prewarm the parser cache or when backlogged be getting a warmed response), or if we had dailies for HTML of all new revisions in another system (e.g., Enterprise) we could slide that in.

OTOH if it's too much work, monthlies (i.e., latest captured HTML, previous captured HTML) as said above would be an excellent start.