Page MenuHomePhabricator

Unify files that are duplicated between section-topics, section-image-recs and image-suggestions
Open, Needs TriagePublic

Description

We now have duplicate data between section-topics, section-image-recs and image-suggestions.

In this task we should:

  1. Compare data between these two repos. See https://phabricator.wikimedia.org/T333699#8931094
  2. wikipedias.txt (section-topics) & wikipedias.json (section-image-recs)
  3. section_titles_denylist.json (image-suggestions & section-topics)
  1. For any identical data, come up with a strategy to share it from the one source. Some ideas:
  • keep it in one of them, delete from the other, and copy at compile time to the other.
  • keep it in one of them, delete from the other, and git submodule from the other
  • have a 'common' repo where these files reside.
  • merge scripts into a monorepo <- PREFERRED OPTION

Note: there is some overlap with T339120; both should follow a similar resolution.

Event Timeline

List of wikipedias

section-topics: https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/blob/main/section_topics/data/wikipedias.txt
section-image-recs: https://gitlab.wikimedia.org/repos/structured-data/section-image-recs/-/blob/main/imagerec/data/wikipedias.json

Not only is this data duplicated in 2 places, it's also in a different format (making it hard to spot divergence)

Section denylist

image-suggestions: https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/main/image_suggestions/data/section_titles_denylist.json
section-topics: https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/blob/main/section_topics/data/section_titles_denylist.json

Note: it is a little odd that section-topics pre-filters denylisted sections (this not generating suggestions for these) while alignment does not (those are only filtered out later, in image-suggestions)

matthiasmullie renamed this task from Unify files that are duplicated between image_suggestions and section_topics to Unify files that are duplicated between section-topics, section-image-recs and image-suggestions.Jun 14 2023, 11:50 AM
matthiasmullie updated the task description. (Show Details)
matthiasmullie updated the task description. (Show Details)

As part of T339129, I believe I'll end up deduplicating section_titles_denylist.json.