[L] Periodically regenerate various variable data sets/files
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	matthiasmullie
	Jun 14 2023, 1:03 PM

Description

Below is a list of datasets used across section-topics, section-image-recs and image-suggestions:

section titles denylist (section_titles_denylist.json)
table filter parquet (20230301_target_wikis_tables)
~~section alignments (/user/mnz/secmap_results/aligned_sections_subset/aligned_sections_subset_9.0_2022-02.parquet)~~ See T325316
check_bad_parsing page filter (2022-10_ptwiki_bad)
QID filter (qids_for_all_points_in_time.txt & qids_for_media_outlets.txt)
placeholder images parquet (image_placeholders)

Some of these are currently bundled in the repo(s) using them while others are parquets in hive that are the result of one-off runs.

Either way, they all contain data that changes over time, and will need periodic regeneration in order to remain accurate.

Details

Title	Reference	Author	Source Branch	Dest Branch
Draft: Update section-topics pipeline arguments	repos/data-engineering/airflow-dags!584	mlitn	T339129_2	main
Draft: Periodically invoke a bunch of section-topics scripts	repos/data-engineering/airflow-dags!583	mlitn	T339129_1	main
Draft: Accept denylist as parquet	repos/structured-data/image-suggestions!39	mlitn	T339129	main
Draft: Consume parquets instead of static files	repos/structured-data/section-topics!30	mlitn	T339129_2	main
Alter scripts/*.py to write their outputs to parquets	repos/structured-data/section-topics!29	mlitn	T339129_1	main

Customize query in GitLab

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T340437 [EPIC] Image suggestions data pipelines maintenance
		Open		matthiasmullie	T339129 [L] Periodically regenerate various variable data sets/files

Event Timeline

matthiasmullie created this task.Jun 14 2023, 1:03 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 14 2023, 1:03 PM

matthiasmullie updated the task description. (Show Details)Jun 14 2023, 4:48 PM

MarkTraceur renamed this task from Periodically regenerate various variable data sets/files to [L] Periodically regenerate various variable data sets/files.Jun 14 2023, 4:53 PM

MarkTraceur edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.

MarkTraceur moved this task from Incoming to Ready for Development on the Structured-Data-Backlog (Current Work) board.

mfossati added parent tasks: T311745: [EPIC] Section topics data pipeline, T311814: [EPIC] Section-level image suggestions data pipeline.Jun 15 2023, 10:21 AM

AUgolnikova-WMF mentioned this in T340437: [EPIC] Image suggestions data pipelines maintenance .Jun 26 2023, 11:42 AM

AUgolnikova-WMF added a parent task: T340437: [EPIC] Image suggestions data pipelines maintenance .Jun 26 2023, 11:48 AM

AUgolnikova-WMF removed parent tasks: T311814: [EPIC] Section-level image suggestions data pipeline, T311745: [EPIC] Section topics data pipeline.

matthiasmullie claimed this task.Jun 27 2023, 12:18 PM

matthiasmullie moved this task from Ready for Development to Doing on the Structured-Data-Backlog (Current Work) board.

matthiasmullie mentioned this in T333699: Unify files that are duplicated between section-topics, section-image-recs and image-suggestions.Jun 27 2023, 12:20 PM

mfossati mentioned this in T347561: [Maintenance] Set up deletion jobs for Structured Data's data pipelines.Sep 28 2023, 10:06 AM

MarkTraceur edited projects, added Structured-Data-Backlog; removed Structured-Data-Backlog (Current Work).Nov 15 2023, 4:37 PM

MarkTraceur moved this task from Triage to Image Suggestions on the Structured-Data-Backlog board.Dec 4 2023, 5:38 PM