Context
Main work is completed on article-level (ALIS) and section-level (SLIS) image suggestions:
- T318017: [EPIC] Section-level Image Suggestions Notifications for More Experienced Contributors
- T296814: [EPIC] Article-level image suggestions data pipeline
- T311814: [EPIC] Section-level image suggestions data pipeline
- T311745: [EPIC] Section topics data pipeline
This epic is about necessary maintenance work to ensure image suggestions' stability and quality.
There are 3 areas with relevant tickets, in order of priority.
Essential immediately [FY23/24]
Goals
- to enable scripts to run without having to constantly intervene
- to prevent pipelines from breaking
- to prevent image suggestions quality regression that would impact stakeholders and users' behavior
Tickets
- T339146: [M] Fix detect_html_tables.py execution error on incomplete article data
- T325316: [XL] Productionize section alignment model training
- T338013: [L] Create search index deltas by comparing to `discovery.cirrus_index_without_content` in hive
- T347566: [M] Send an alert in case of no ALIS or SLIS
- T347569: [L] Block search indices update and Cassandra tasks in case of no ALIS or SLIS data
- T339129: [Spike] Periodically regenerate various variable data sets/files
Essential longer term
SLO & SLA [FY23/24]
- Ensure that image suggestions can be used as a product for longer than a year: T338949: [L] Define SLOs/SLAs for image-suggestions pipelines
- T374434: [M] Decouple ALIS from SLIS
Code health [FY24/25]
All tickets are intertwined and require a high effort:
- T339120: [XL] Deduplicate code in section-topics, section-image-recs and image-suggestions
- T333699: [XL] Unify files that are duplicated between section-topics, section-image-recs and image-suggestions
- T331968: [XL] Let the model that learns section alignments consume section topics output
- T331522: [XL] Let section alignment consume section topics output
Infrastructure health [FY24/25]
- T347561: [Maintenance] Set up deletion jobs for Structured Data's data pipelines - depends on Data Engineering’s T262201: Gather all data-purge into a single job, which isn’t likely to happen anytime soon. Impacts the cluster, not data pipelines’ stability
- T350012: Schedule all data pipeline DAGs on Thursdays - tiny effort
Nice to have
Low priority, will most likely not do. Nice to have unless it becomes a problem in terms of resources.
- Improve data pipeline efficiency
- Data cleanliness