Below is a list of datasets used across section-topics, section-image-recs and image-suggestions:
- section titles denylist (section_titles_denylist.json)
- table filter parquet (20230301_target_wikis_tables)
section alignments (/user/mnz/secmap_results/aligned_sections_subset/aligned_sections_subset_9.0_2022-02.parquet)See T325316- check_bad_parsing page filter (2022-10_ptwiki_bad)
- QID filter (qids_for_all_points_in_time.txt & qids_for_media_outlets.txt)
- placeholder images parquet (image_placeholders)
Some of these are currently bundled in the repo(s) using them while others are parquets in hive that are the result of one-off runs.
Either way, they all contain data that changes over time, and will need periodic regeneration in order to remain accurate.