Page MenuHomePhabricator

Implement a backfill job for the dumps hourly table that can handle simplewiki
Closed, ResolvedPublic8 Estimated Story Points

Description

For Data Products Sprint 00, we want to work on having a fully working dump process for one of the smaller wikis. We chose simplewki.

In this task we want to make sure we can backfill for simplewiki.

Done is:

  • the hourly table has a all the revisions of all articles of simplewiki.
  • there is an Airflow job in production running for simplewiki.

Details

TitleReferenceAuthorSource BranchDest Branch
Fix ivy issue on mediawiki dumps, and some cosmetic fixes.repos/data-engineering/airflow-dags!495xcollazofix-ivy-issuemain
Add DAG to backfill wmf_dumps.wikitext_raw.repos/data-engineering/airflow-dags!484xcollazoadd-mediawiki-dumps-backfillmain
Customize query in GitLab

Event Timeline

WDoranWMF set the point value for this task to 8.Aug 24 2023, 1:49 PM

wmf_dumps.wikitext_raw_rc1 has been backfilled with all of simplewiki:

spark-sql (default)> select count(1) as count from wikitext_raw_rc1 where wiki_db = 'simplewiki';
count
8087771
Time taken: 38.38 seconds, Fetched 1 row(s)

simplewiki scale does not trigger the issues we had seen on T340861.

T340861#9135613 should take care of delivering the Airflow job to production. Thus, marking this task as ready for code review.

With https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/484 and T341383 in production, we now have a full backfill that can definitely handle simplewiki. We're done here.