Page MenuHomePhabricator

[L] Periodically regenerate various variable data sets/files
Open, Needs TriagePublic

Description

Below is a list of datasets used across section-topics, section-image-recs and image-suggestions:

  • section titles denylist (section_titles_denylist.json)
  • table filter parquet (20230301_target_wikis_tables)
  • section alignments (/user/mnz/secmap_results/aligned_sections_subset/aligned_sections_subset_9.0_2022-02.parquet) See T325316
  • check_bad_parsing page filter (2022-10_ptwiki_bad)
  • QID filter (qids_for_all_points_in_time.txt & qids_for_media_outlets.txt)
  • placeholder images parquet (image_placeholders)

Some of these are currently bundled in the repo(s) using them while others are parquets in hive that are the result of one-off runs.

Either way, they all contain data that changes over time, and will need periodic regeneration in order to remain accurate.

NOTE: we'll manually copy a frozen snapshot of the HTML table filter until this ticket is unblocked.

Event Timeline

MarkTraceur renamed this task from Periodically regenerate various variable data sets/files to [L] Periodically regenerate various variable data sets/files.Jun 14 2023, 4:53 PM

All merge requests reviewed, Airflow test runs now ongoing.

Deployed. New DAGs will start next Monday 2024-09-23.
This ticket definitely needs monitoring.

I've tweaked this week's section topics DAG to let it generate the 2024-09-09 snapshot, so that image suggestions can run again after a (resolved) missing upstream dependency:

  • manually generated section titles denylist with the latest available SEAL - python section_topics/scripts/gather_section_titles_denylist.py -a /user/analytics-platform-eng/structured-data/seal/alignments/2024-08-19 -o 2024-08-19
  • bad parsing & HTML tables as per previous runs

The new section titles denylist takes as input SEAL alignments and is much more aggressive, thus significantly decreasing total section topics:

curr = spark.read.parquet('/user/analytics-platform-eng/structured-data/section_topics/2024-09-09')
prev = spark.read.parquet('/user/analytics-platform-eng/structured-data/section_topics/2024-08-19')
prev.count(), curr.count()

(258520059, 146648150)

For instance, ptwiki has an order of magnitude more section titles compared the old denylist, i.e., 1403 VS 172.

SLIS have also significantly decreased due to the denylist:

snapshotcount
2024-08-193504732
2024-09-091321659

Update

  • check bad parsing: the next run should have been on 2024-10-01, but for some reason the DAG didn't start, so I've manually triggered it. Now running and waiting for upstream dependencies (2024-09 wikitext)
  • detect HTML tables: started on 2024-10-03, now waiting for upstream dependencies (20241001 HTML dumps)
  • section titles denylist: started on 2024-09-30, successful run! 🎉
  • all data pipelines successful!!! 🚀

I'll keep monitoring the new DAGs.

Update

  • check bad parsing: 2024-09 wikitext is still missing, so the DAG run is likely to time out failed hotfixed, now running
  • detect HTML tables: blocked, see this Slack thread

Update

  • check bad parsing: successful run that generated the 2024-09 snapshot
  • detect HTML tables: paused. Manually fed a frozen snapshot to avoid breaking image suggestions,

This ticket is currently blocked: discussion needed to prioritize T305688: Make HTML Dumps available in hadoop.