Page MenuHomePhabricator

Scraper will output to a simple ND-JSON file
Closed, ResolvedPublic

Description

We're having trouble with any direct Hadoop integration, so the next approach will be to write to a simple file, and then ingest that file with an Spark Airflow task in the same job.

  • Adapt the EventGate output stage to write JSON to a file.
  • The destination path is passed into the program as a command-line argument.
  • Run the scraper as an Airflow SimpleSkeinOperator with the output directory parameter set to a temporary directory which will have several GB available.
  • Write an Airflow SparkSqlOperator task to load this file into our destination table, overwriting the entire partition given by {wiki_dbname, snapshot_date}

Event Timeline

awight updated the task description. (Show Details)

Note that the file is reopened for each chunk, in an unexpected combination of reading saved chunks from separate files, and outputting to a file. I don't think this will hurt anything; the aggregation step rejects duplicate pages (ignoring whether the revisions change).

January data is being imported manually and provides a verification of the queries here.

Table creation is uneventful:

sudo -u analytics-wmde kerberos-run-command analytics-wmde spark3-sql -f create_table_cite_ref_errors_by_type.hql --database wmde -d location=/wmf/data/wmde/cite_ref_errors_by_type
sudo -u analytics-wmde kerberos-run-command analytics-wmde spark3-sql -f create_table_transclusions_containing_only_refs.hql --database wmde -d location=/wmf/data/wmde/transclusions_containing_only_refs
sudo -u analytics-wmde kerberos-run-command analytics-wmde spark3-sql -f create_table_transclusions_containing_refs.hql --database wmde -d location=/wmf/data/wmde/transclusions_containing_refs
sudo -u analytics-wmde kerberos-run-command analytics-wmde spark3-sql -f create_table_transclusions_within_refs.hql --database wmde -d location=/wmf/data/wmde/transclusions_within_refs
sudo -u analytics-wmde kerberos-run-command analytics-wmde spark3-sql -f create_table_wiki_page_cite_references_monthly.hql --database wmde -d location=/wmf/data/wmde/wiki_page_cite_references_monthly
sudo -u analytics-wmde kerberos-run-command analytics-wmde spark3-sql -f create_table_wiki_page_cite_references_raw.hql --database wmde -d location=/wmf/data/wmde/wiki_page_cite_references_raw

We load from a "local" file, which may cause problems later:

sudo -u analytics-wmde kerberos-run-command analytics-wmde spark3-sql -f load_data_wiki_page_cite_references_raw.hql --database wmde -d local_path=/home/awight/scrape-wiki-html-dump/dewiki-2026-02-02-page-summary.ndjson -d snapshot_date=2026-02-02 -d wiki_db=dewiki

However, aggregation succeeds:

sudo -u analytics-wmde kerberos-run-command analytics-wmde spark3-sql -f wiki_page_cite_references_monthly.hql --database wmde -d source_table=wiki_page_cite_references_raw -d snapshot_date=2026-02-02

Note that the file is reopened for each chunk, in an unexpected combination of reading saved chunks from separate files, and outputting to a file. I don't think this will hurt anything; the aggregation step rejects duplicate pages (ignoring whether the revisions change).

Why not write to multiple files? Depending on how many rows we are talking about, the downstream spark read via a CREATE TEMPORARY VIEW would be happy to read multiple files.

andrewtavis-wmde merged https://gitlab.wikimedia.org/repos/wmde/analytics/-/merge_requests/22

Change Cite reference pipeline to use an intermediate file

Tobi_WMDE_SW claimed this task.