Page MenuHomePhabricator

Sections with images still appearing in the section-level image suggestions pipeline
Closed, ResolvedPublic

Description

Sections that already have images should not be suggested in the section-level image suggestions pipeline.

However, during manual testing, we discovered that some sections with images are still generating suggestions, even though there is some code called

prune_sections_with_images

This ticket is to look at the code in prune_sections_with_images and see where it's failing, suggest improvements, and make them (within reason, we know it won't be perfect).

Examples:

Event Timeline

matthiasmullie subscribed.

A discrepancy in page title format (spaces vs underscores) between 2 datasets caused all multi-word titles to not have suggestions for sections with existing images pruned.
Fixing this brings the total number of remaining suggestions for the datasets used from 237406321 down to 178212325, or a 25% decrease.

Reviewed relevant commit at https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/commit/448a05d400d00426bf9f14910affa4ac3532af83: looked like an easy fix, no need to branch out & submit a merge request.

I ran the pipeline with the latest section topics dataset (as per this commit, which should close T323505: [L] Exclude sections-tables from having section topics) and got a total of 110 M (109,627,229) suggestions.
This additional decrease may come from the exclusion of tables and lists, which usually contain plenty of links.

Closing.