Page MenuHomePhabricator

[XL] Deduplicate code in section-topics, section-image-recs and image-suggestions
Open, Needs TriagePublic

Description

There is a good amount of duplicate code in these 3 repos: section-topics, section-image-recs and image-suggestions.

This is brittle because a change/fix in one place will not automatically address the other(s) - it's all to easy to miss.
This is a burden because more work is required to see changes through in multiple places (where other logic might interfere or require alternative implementation)

In this task we should:

  1. Compare duplicate code between these repos. See https://phabricator.wikimedia.org/T339120#8931247
  1. For any duplicate(-ish) code, we should come up with a strategy to share it from one source. Some ideas:
  • keep it in one of them, delete from the other, and copy at compile time to the other.
  • keep it in one of them, delete from the other, and git submodule from the other
  • have a 'common' repo where these files reside.
  • merge scripts into a monorepo <- PREFERRED OPTION

Note: there is some overlap with T333699; both should follow a similar resolution.

Event Timeline

supported image extensions

section-image-recs: https://gitlab.wikimedia.org/repos/structured-data/section-image-recs/-/blob/main/imagerec/article_images.py#L46
image-suggestions: https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/main/image_suggestions/unillustratable.py#L68

Both serve a similar purpose of identifying types of images that we may suggest. They're currently a different list of file extensions, though.

minimum section character length

section-image-recs: https://gitlab.wikimedia.org/repos/structured-data/section-image-recs/-/blob/main/imagerec/article_images.py#L168
section-topics: https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/blob/main/section_topics/pipeline.py#L279

naive (wikitext) check for sections with lists/tables

section-image-recs: https://gitlab.wikimedia.org/repos/structured-data/section-image-recs/-/blob/main/imagerec/article_images.py#L154
section-topics: https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/blob/main/section_topics/pipeline.py#L270

full (html) check for sections with tables

section-image-recs: https://gitlab.wikimedia.org/repos/structured-data/section-image-recs/-/blob/main/imagerec/recommendation.py#L526
section-topics: https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/blob/main/section_topics/pipeline.py#L827

Note: the HTML parsing based check could maybe supplant the naive wikitext check altogether?

converting wikitext headings to id/anchor format

section-image-recs: https://gitlab.wikimedia.org/repos/structured-data/section-image-recs/-/blob/main/imagerec/article_images.py#L192
section-topics: https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/blob/main/section_topics/pipeline.py#L131

further normalizing headings to allow comparison with denylist or section alignment heading format

section-image-recs: https://gitlab.wikimedia.org/repos/structured-data/section-image-recs/-/blob/main/imagerec/recommendation.py#L244
section-topics: https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/blob/main/section_topics/pipeline.py#L204
image-suggestions: https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/main/image_suggestions/unillustratable.py#L258

Note: section-topics has second variant (https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/blob/main/section_topics/pipeline.py#L161) applying this transformation in python rather than spark. Might be good to check whether we can refactor that one away as well.

excluding denylisted sections

section-topics: https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/blob/main/section_topics/pipeline.py#L343
image-suggestions: https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/main/image_suggestions/unillustratable.py

Implementation is different in both cases, but they're doing the same thing.
Strictly speaking, the one in section-topics is not needed, as this is an image-suggestions specific transformation (read: we only discard these topics because they're in sections not suitable for suggestions, not because they're bad topics) and it is also already applied in image-suggestions (note: this argument is also true for the minimum section character length and tables detection). It would come at the cost of a larger section-topics dataset, though.

detecting media in sections
See T331522

MarkTraceur renamed this task from Deduplicate code in section-topics, section-image-recs and image-suggestions to [XL] Deduplicate code in section-topics, section-image-recs and image-suggestions.Jun 14 2023, 4:56 PM