Page MenuHomePhabricator

Allow communities to configure which sections are excluded from link suggestion generation
Closed, ResolvedPublic

Description

In T279519: Add a link: algorithm improvements: Avoid recommending links in sections that usually don't have links we modified the link recommendation service to accept a query parameter for sections_to_exclude (pipe delimited list); this accepts a list of section titles that will then be skipped by the research/mwaddlink app when iterating over text in an article. The idea is that e.g. enwiki community could define "References" as a section title that should never have link suggestions, and the link recommendation service will honor that setting.

In this task, we need to:

Event Timeline

I didn't have a chance to start work on this, so unassigning myself.

Tgr changed the task status from Open to In Progress.Apr 20 2022, 7:00 PM
Tgr claimed this task.

Change 785243 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Add Link: Add 'excluded sections' task setting

https://gerrit.wikimedia.org/r/785243

Moving to Code Review for now, will return to the optional part.

Change 785822 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] Add Link: Add array validation for excludedSections field

https://gerrit.wikimedia.org/r/785822

Moving to Code Review for now, will return to the optional part.

I've made T306792: initWikiConfig should set excludedSections for link-recommendation task type for the optional part. There's a follow-up patch here that needs review, but after that this can go to QA.

Change 785243 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Add Link: Add 'excluded sections' task setting

https://gerrit.wikimedia.org/r/785243

Tgr changed the task status from In Progress to Open.Apr 25 2022, 1:54 PM
Tgr moved this task from Code Review to QA on the Growth-Team (Current Sprint) board.
Tgr moved this task from Backlog to Done / QA on the Add-Link board.

Change 785822 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Add Link: Add array validation for excludedSections field

https://gerrit.wikimedia.org/r/785822

Change 785926 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@wmf/1.39.0-wmf.8] Add Link: Add 'excluded sections' task setting

https://gerrit.wikimedia.org/r/785926

Change 785926 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.39.0-wmf.8] Add Link: Add 'excluded sections' task setting

https://gerrit.wikimedia.org/r/785926

Change 785937 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@wmf/1.39.0-wmf.8] Revert "Add Link: Add 'excluded sections' task setting"

https://gerrit.wikimedia.org/r/785937

Change 785937 merged by Urbanecm:

[mediawiki/extensions/GrowthExperiments@wmf/1.39.0-wmf.8] Revert "Add Link: Add 'excluded sections' task setting"

https://gerrit.wikimedia.org/r/785937

Here is a list of most commonly used section titles to exclude.

I used English as default, but I also checked in French, German, Japanese (by identification) and Spanish (~100 random articles each). Synonyms are separated by a /.

[en][fr][es][de][ja]
ReferencesRéférencesReferenciasEinzelnachweise参考文献 / 参考資料
NotesNotesNotasAnmerkungen脚注 / 注釈
Notes and referencesNotes et références---
See also / Further readingVoir aussi / Voir égalementVéase tambiénSiehe auch-
BibliographyBibliographie---
WeblinksWebographie-WeblinksWEB
External linksLiens externesEnlaces externos-外部リンク
SourcesSourcesFuentesQuellen / Literatur文献
-Articles connexes--関連項目
-Autorité--出典

The list is not exhaustive, and it is not our goal to treat all local cases, as communities will be able to use Special:EditGrowthconfig to edit the excluded section titles. The table only shows items when I spotted them at least at two wikis.
The terms are mostly plural, but they can be found in their singular form as well.

Some sections could be interesting to exclude, as they are listings of things (books, artworks...):

  • (Selected) Works
  • Œuvres
  • Publications, Publikationen, Veröffentlichungen
  • Galerie (where you only find images using <gallery>)
  • Rundfunkberichte

Some of the titles can be sub-section of a section title, so the entire section should be excluded:

== Voir aussi ==
 === Articles connexes ===
 === Liens externes ===

Here is a list of most commonly used section titles to exclude.

I used English as default, but I also checked in French, German, Japanese (by identification) and Spanish (~100 random articles each). Synonyms are separated by a /.

[en][fr][es][de][ja]
ReferencesRéférencesReferenciasEinzelnachweise参考文献 / 参考資料
NotesNotesNotasAnmerkungen脚注 / 注釈
Notes and referencesNotes et références---
See also / Further readingVoir aussi / Voir égalementVéase tambiénSiehe auch-
BibliographyBibliographie---
WeblinksWebographie-WeblinksWEB
External linksLiens externesEnlaces externos-外部リンク
SourcesSourcesFuentesQuellen / Literatur文献
-Articles connexes--関連項目
-Autorité--出典

The list is not exhaustive, and it is not our goal to treat all local cases, as communities will be able to use Special:EditGrowthconfig to edit the excluded section titles. The table only shows items when I spotted them at least at two wikis.
The terms are mostly plural, but they can be found in their singular form as well.

Some sections could be interesting to exclude, as they are listings of things (books, artworks...):

  • (Selected) Works
  • Œuvres
  • Publications, Publikationen, Veröffentlichungen
  • Galerie (where you only find images using <gallery>)
  • Rundfunkberichte

Some of the titles can be sub-section of a section title, so the entire section should be excluded:

== Voir aussi ==
 === Articles connexes ===
 === Liens externes ===

Thank you @Trizek-WMF! Just noting that the implementation code for the service is pretty basic -- it's an exact match on whatever strings are provided. So "Reference" will match but "References" won't unless it's specifically set in the configuration.

Also, the sections do not cascade -- you have to manually specify sub-sections if you want them excluded too.

kostajh triaged this task as Medium priority.May 12 2022, 8:58 AM