Page MenuHomePhabricator

[L] Periodically regenerate various variable data sets/files
Open, Needs TriagePublic

Description

Below is a list of datasets used across section-topics, section-image-recs and image-suggestions:

  • section titles denylist (section_titles_denylist.json)
  • table filter parquet (20230301_target_wikis_tables)
  • section alignments (/user/mnz/secmap_results/aligned_sections_subset/aligned_sections_subset_9.0_2022-02.parquet) See T325316
  • check_bad_parsing page filter (2022-10_ptwiki_bad)
  • QID filter (qids_for_all_points_in_time.txt & qids_for_media_outlets.txt)
  • placeholder images parquet (image_placeholders)

Some of these are currently bundled in the repo(s) using them while others are parquets in hive that are the result of one-off runs.

Either way, they all contain data that changes over time, and will need periodic regeneration in order to remain accurate.

Details

TitleReferenceAuthorSource BranchDest Branch
Draft: Update section-topics pipeline argumentsrepos/data-engineering/airflow-dags!584mlitnT339129_2main
Draft: Periodically invoke a bunch of section-topics scriptsrepos/data-engineering/airflow-dags!583mlitnT339129_1main
Draft: Accept denylist as parquetrepos/structured-data/image-suggestions!39mlitnT339129main
Draft: Consume parquets instead of static filesrepos/structured-data/section-topics!30mlitnT339129_2main
Alter scripts/*.py to write their outputs to parquetsrepos/structured-data/section-topics!29mlitnT339129_1main
Customize query in GitLab

Event Timeline

MarkTraceur renamed this task from Periodically regenerate various variable data sets/files to [L] Periodically regenerate various variable data sets/files.Jun 14 2023, 4:53 PM

All merge requests reviewed, Airflow test runs now ongoing.