Page MenuHomePhabricator

initWikiConfig should set excludedSections for link-recommendation task type
Closed, ResolvedPublic

Description

Research has a database (https://analytics.wikimedia.org/published/datasets/one-off/section_alignment/enwiki/) containing section names for various languages, e.g. "References" or "Citations". We could potentially use that to pre-populate the excludedSections field for the link-recommendation task type.


Since initWikiConfig only works when deploying GrowthExperiments for the first time, and most wikis already have some Growth features, changeWikiConfig.php was updated instead to be able to write configuration sub-fields.

Event Timeline

Note this script is probably going to be executed very rarely those days, since Growth features are at all Wikiipedias as of today (and the script doesn't do anything if the config already exists).

Tgr moved this task from Incoming to In Progress on the Growth-Team (Sprint 0 (Growth Team)) board.
Tgr subscribed.

My plan for this is to extend changeWikiConfig.php to handle variable names like link-recommendation.excludedSections and then script it to add the sections to the config pages.

The Research docs aren't self-explanatory - do we take the enwiki-XXwiki file and then map the relevant sections to the highest-ranking matches? Or do we exclude all of them?

My plan for this is to extend changeWikiConfig.php to handle variable names like link-recommendation.excludedSections and then script it to add the sections to the config pages.

The Research docs aren't self-explanatory - do we take the enwiki-XXwiki file and then map the relevant sections to the highest-ranking matches? Or do we exclude all of them?

Yes, I think what you'd do is start with a list of section names we want from enwiki, then look up the entries for the relevant languages, e.g. WHERE source = "references" AND target_language = "elwiki"

image.png (1×2 px, 408 KB)

Unfortunately, it looks like all of that data is lowercased, and IIRC the linkrecommendation code is case sensitive, so we need a patch to research/mwaddlink to have it do a case insensitive check on the section title.

Change 787543 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[research/mwaddlink@main] Make section exclusion case insensitive

https://gerrit.wikimedia.org/r/787543

Change 787416 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Community configuration: Allow writing sub-fields programmatically

https://gerrit.wikimedia.org/r/787416

With the patch, something like this can be used to set the sections:

mwscript extensions/GrowthExperiments/maintenance/changeWikiConfig.php $WIKI --page MediaWiki:NewcomerTasks.json --json link-recommendation.excludedSections '["foo","bar","baz"]'

A simple script for processing the CSV files:

for file in enwiki-*; do wiki=${file:7:6}; csvgrep enwiki-$wiki-11-02-22.csv -c source -r 'Notes|Notes and references|References|Sources|External links|Weblinks|See also|Further reading|Bibliography' | csvcut -c target | tail -n+1 | sort | uniq | jq --raw-input --null-input --compact-output '[inputs]' > excludedSections-$wiki.json; done

(Uses csvkit and jq; --null-input avoids skipping the first input, not sure how that works.)

Once the patch is merged, the update can be done with a command like

for WIKI in bnwiki cawiki elwiki fiwiki guwiki hewiki idwiki mlwiki ruwiki; do mwscript extensions/GrowthExperiments/maintenance/changeWikiConfig.php $WIKI --page MediaWiki:NewcomerTasks.json --json link-recommendation.excludedSections "`cat excludedSections-$WIKI.json`"; done

(This assumes the tasktype config already has a link-recommendation field; for some wikis, it might have to be created first.)

Note that the list of wikis for which we have section maps (bnwiki cawiki elwiki fiwiki guwiki hewiki idwiki mlwiki ruwiki) and the list of AddLink tier 3 wikis (cawiki hewiki hiwiki kowiki nowiki ptwiki simplewiki svwiki ukwiki) is fairly different - only cawiki is present in both.

Here is a small script for extracting from enwiki_aligned_sections_2022-02.sqlite.gz (note this is an 1.7G file, 4.6G when unpacked):

sqlite3 enwiki_aligned_sections_2022-02.sqlite "SELECT json_object('wiki', target_language, 'section', target, 'probability', probability) FROM titles WHERE source_language = 'enwiki' AND lower(source) IN ('notes', 'notes and references', 'references', 'sources', 'external links', 'weblinks', 'see also', 'further reading', 'bibliography') ORDER BY target_language, source, target;" > wiki_sections.jsonl

# using a somewhat arbitrary .25 probability cutoff
for WIKI in cawiki hewiki hiwiki kowiki nowiki ptwiki simplewiki svwiki ukwiki; do jq "select(.wiki==\"$WIKI\" and .probability > 0.25) | .section" wiki_sections.jsonl | jq --slurp --compact-output unique | mwscript extensions/GrowthExperiments/maintenance/changeWikiConfig.php $WIKI --page MediaWiki:NewcomerTasks.json --json link-recommendation.excludedSections "`cat`"; done

Here's the wiki_sections.jsonl file for convenience:

It does seem to be unreliable though. This is the section list for simplewiki:

additional reading
albums
ancestors
ancient sources
awards
awards and honours
awards and records
bases
bibliography
biography
births
book
book writing
books
branches
camera shutters
causes
channel links
citations
connected counties
content
death
description
details
discography
education
events
examples
external link
external links
external links[edit
external references
external sources
external videos
famous people
filmography
footnotes
further information
further reading
further readings
gallery
geography
history
international links
international relations
later life
legacy
life and career
list
list of books
literature
literature and sources
memorials
more information
more reading
more readings
note
notes
notes and references
observances
official links
origin
origins
other
other books
other page
other pages
other reading
other sources
other uses
other website
other websites
overview
publications
quotes
recordings
records
reference
references
references and notes
related pages
resources
results and standings
reviews
roles
sales
see
see also
selected bibliography
sequel
sequels
services and connections
sights and attractions
sites
song
source
sources
statistics
story
threats
titles and styles
uses
weblinks
websites
works
writings

We probably wouldn't want to exclude most of these (Books, Albums, Filmography etc).

It does seem to be unreliable though. This is the section list for simplewiki:

additional reading
albums
ancestors
ancient sources
awards
awards and honours
awards and records
bases
bibliography
biography
births
book
book writing
books
branches
camera shutters
causes
channel links
citations
connected counties
content
death
description
details
discography
education
events
examples
external link
external links
external links[edit
external references
external sources
external videos
famous people
filmography
footnotes
further information
further reading
further readings
gallery
geography
history
international links
international relations
later life
legacy
life and career
list
list of books
literature
literature and sources
memorials
more information
more reading
more readings
note
notes
notes and references
observances
official links
origin
origins
other
other books
other page
other pages
other reading
other sources
other uses
other website
other websites
overview
publications
quotes
recordings
records
reference
references
references and notes
related pages
resources
results and standings
reviews
roles
sales
see
see also
selected bibliography
sequel
sequels
services and connections
sights and attractions
sites
song
source
sources
statistics
story
threats
titles and styles
uses
weblinks
websites
works
writings

We probably wouldn't want to exclude most of these (Books, Albums, Filmography etc).

I think we would want to use the probability score to remove results with a score of less than .9, can you try that to see how it looks?

Change 787543 merged by jenkins-bot:

[research/mwaddlink@main] Make section exclusion case insensitive

https://gerrit.wikimedia.org/r/787543

Change 788404 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] linkrecommendation: Bump version

https://gerrit.wikimedia.org/r/788404

Change 788404 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: Bump version

https://gerrit.wikimedia.org/r/788404

How good are the lists of excluded items? Is there any need to setup a list manually?

Here are the probabilities for simplewiki, filtering out the ones below 0.1:

cat wiki_sections.jsonl | jq "select(.wiki==\"simplewiki\" and .probability > 0.1)" | jq --slurp "sort_by(\"\(.section)|\(.probability)\") | .[]" | jq '{key: .section, value: (.probability *100.0 + 0.5 | floor / 100.0 )}' | jq --slurp 'from_entries'

{
  "additional reading": 0.25,
  "awards and records": 0.32,
  "bibliography": 0.99,
  "biography": 0.41,
  "books": 0.8,
  "book": 0.15,
  "discography": 0.67,
  "external links": 0.38,
  "filmography": 0.34,
  "footnotes": 0.65,
  "further reading": 0.98,
  "list of books": 0.16,
  "literature": 0.2,
  "memorials": 0.16,
  "more readings": 0.63,
  "more reading": 0.94,
  "notes and references": 0.97,
  "notes": 0.99,
  "note": 0.29,
  "origins": 0.53,
  "origin": 0.4,
  "other reading": 0.48,
  "other sources": 0.33,
  "other websites": 0.88,
  "other": 0.34,
  "quotes": 0.11,
  "recordings": 0.21,
  "records": 0.39,
  "references and notes": 0.2,
  "references": 0.99,
  "reference": 0.71,
  "related pages": 0.84,
  "resources": 0.34,
  "roles": 0.1,
  "see also": 0.57,
  "sources": 0.99,
  "source": 0.63,
  "threats": 0.15,
  "writings": 0.27
}

Or, sorted by probability:

cat wiki_sections.jsonl | jq "select(.wiki==\"simplewiki\" and .probability > 0.1)" | jq --slurp "sort_by(\"\(.section)|\(.probability)\") | .[]" | jq '{key: .section, value: (.probability *100.0 + 0.5 | floor / 100.0 )}' | jq --slurp 'from_entries | to_entries | sort_by(.value) | reverse | from_entries'

{
  "sources": 0.99,
  "references": 0.99,
  "notes": 0.99,
  "bibliography": 0.99,
  "further reading": 0.98,
  "notes and references": 0.97,
  "more reading": 0.94,
  "other websites": 0.88,
  "related pages": 0.84,
  "books": 0.8,
  "reference": 0.71,
  "discography": 0.67,
  "footnotes": 0.65,
  "source": 0.63,
  "more readings": 0.63,
  "see also": 0.57,
  "origins": 0.53,
  "other reading": 0.48,
  "biography": 0.41,
  "origin": 0.4,
  "records": 0.39,
  "external links": 0.38,
  "resources": 0.34,
  "other": 0.34,
  "filmography": 0.34,
  "other sources": 0.33,
  "awards and records": 0.32,
  "note": 0.29,
  "writings": 0.27,
  "additional reading": 0.25,
  "recordings": 0.21,
  "references and notes": 0.2,
  "literature": 0.2,
  "memorials": 0.16,
  "list of books": 0.16,
  "threats": 0.15,
  "book": 0.15,
  "quotes": 0.11,
  "roles": 0.1
}

Looking at this list, 0.25 seems like a good cutoff that includes all sections that should not be linked. It also includes a bunch of sections we have no reasons to exclude, but that's probably the better direction to fail to - it will result in tasks that maybe include less links than otherwise possible, but all links will be good. That also means the community can adjust the configuration without having to worry about how to apply that retroactively to existing tasks.

How good are the lists of excluded items? Is there any need to setup a list manually?

Here is the list with 0.25 probability threshold:

  • cawiki: altres lectures - anotacions - bibliografia - bibliografia addicional - bibliografia complementària - bibliografia i referències - bibliografía - biografia - cites - discografia - enllaços - enllaços externs - fons - fonts - fonts i referències - fuentes - literatura - llibres - nota - notes - notes i referències - observacions - origen - orígens - per a més informació - per llegir més - poblacions properes - premis i nominacions - references - referències - referències i notes - vegeu també
  • hewiki: אזכורים - ביאורים - ביבליוגרפיה - ביוגרפיה - הערות - הערות שוליים - לקריאה נוספת - מקור - מקורות - ספריה - ספריו - ספרים - קישורים חיצוניים - ראו גם - תצפית
  • hiwiki: external links - further reading - notes - references - see also - sources - अग्रिम पठन - अतिरिक्त पठन - अभिलेख - आगे की पढ़ाई - आगे की पढाई - आगे पढ़ने - आगे पढ़े - आगे पढ़ें - आगे पढें - इन्हें भी देखें - उद्गम - और पढ़ें - और पढें - ग्रंथ सूची - ग्रंथसूची - ग्रन्थ सूची - ग्रन्थसूची - जीवनी - टिप्पणियाँ - टिप्पणियाँ एवं सन्दर्भ - टिप्पणियाँ और सन्दर्भ - टिप्पणियां - टिप्पणियां और संदर्भ - टिप्पणी - टिप्पणी और संदर्भ - देखें - नोट - नोट्स - नोट्स और संदर्भ - नोट्स और सन्दर्भ - पुस्तक सूची - पुस्तकें - बाहरी कड़ियाँ - बाहरी कड़ियां - बाहरी कडियाँ - बाहरी कडियां - बाहरी लिंक - बाहरी संबंध - भी देखें - यह भी देखिए - यह भी देखिये - यह भी देखें - यह सभी देखें - ये भी देखें - विस्तृत पठन - श्रोत - संदर्भ - संदर्भ ग्रंथ - सन्दर्भ - सन्दर्भ ग्रन्थ - सूत्रों - स्त्रोत - स्रोत - स्रोत्र
  • kowiki: 각주 및 인용 - 각주 및 참고 문헌 - 각주 및 참고 자료 - 각주 및 참고문헌 - 각주와 참고자료 - 같이 보기 - 관련 서적 - 관련 항목 - 근원 - 기록 - 기록들 - 기원 - 노트 - 더 보기 - 더 읽기 - 더 읽어 보기 - 더 읽어보기 - 더 읽을거리 - 도서 - 메모 - 문헌 - 발생원인 - 서적 - 외부 링크 - 외부링크 - 원천 - 원천 자료 - 유래 - 읽어보기 - 자료 - 저서 - 저작 - 참고 - 참고 문헌 - 참고 및 참조 - 참고 사항 - 참고 서적 - 참고 자료 - 참고문헌 - 참고사항 - 참고자료 - 참조 - 추가 문헌 - 추가 읽기 - 추가 자료 - 추가읽기 - 출전 및 참고 자료 - 출전, 주해 및 참고 자료 - 출처 - 출처 자료
  • nowiki: bibliografi - bibliografi (utvalg) - bibliografi i utvalg - biografi - bøker - diskografi - eksterne lenker - fotnoter - fotnoter og referanser - kilde - kildene - kilder - litteratur - merknader - noter - noter og referanser - opphav - opprinnelse - publikasjoner - referanser - referanser og fotnoter - referanser og noter - se også - videre lesing - videre lesning
  • ptwiki: antecedentes - artigo & referências - bibiliografia - bibliografa - bibliografia - bibliografia e referências - bibliografias - bibliografía - bibliography - biblioteca - biografia - causas - citações - críticas - discografia - fonte - fontes - fontes e referências - further reading - indicações - leia mais - leia também - leitura adicional - leitura complementar - leitura posterior - leitura recomendada - leituras adicionais - ler mais - letra - ligação externa - ligações externas - links - links externos - literatura - livros - livros publicados - nota - notas - notas e referencias - notas e referências - notação - notes - notes and references - origem - origens - outras fontes - outras leituras - prêmios e indicações - publicações - recordes - references - referencias - referência - referências - referências bibliográficas - relações familiares - sources - veja também - ver também
  • simplewiki: additional reading - awards and records - bibliography - biography - books - discography - external links - filmography - footnotes - further reading - more reading - more readings - note - notes - notes and references - origin - origins - other - other reading - other sources - other websites - records - reference - references - related pages - resources - see also - source - sources - writings
  • svwiki: anmärkningar - anmärkningslista - att notera - bibliografi - bibliografi (i urval) - bibliografi (urval) - bibliografi (utgivet på svenska) - bibliografi i urval - bibliografier - bibliography - biografi - böcker - diskografi - extern länk - externa länkar - fortsatt läsning - fotnoter - geografi - kommentarer - kommunikationer - källa - källhänvisningar - källor - källor och referenser - litteratur - noter - noter och referenser - noteringar - referenser - se även - tryckta källor - ursprung - vidare läsning - weblinks - ytterligare läsning
  • ukwiki: bibliography - external links - notes - references - sources - бібліографія - біографія - вебпосилання - види - витоки - відзнаки - джерела - джерела та література - джерела та посилання - джерела і посилання - джерела інформації - джерело - див. також - див.також - дивись також - дивитися також - додаткова література - життєпис - записи - зауваги - зауваження - зноски - зовнішні посилання - книги - коментарі - література - література та джерела - нотатки - пам'ять - подальше читання - посилання - посилання на джерела - походження - примітка - примітки - примітки та джерела - примітки та посилання - притоки - ресурси інтернету - родина - родовід - см. також - список літератури - також

@Trizek-WMF what do you think? Can it be applied like this?

@Tgr, I think 0,25 is acceptable. I will suggest to communities to review the lists and remove the false positives.

Mentioned in SAL (#wikimedia-operations) [2022-05-04T21:11:05Z] <tgr> running extensions/GrowthExperiments/maintenance/changeWikiConfig.php for T306792

Change 787416 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Community configuration: Allow writing sub-fields programmatically

https://gerrit.wikimedia.org/r/787416

Change 789185 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@wmf/1.39.0-wmf.9] Community configuration: Allow writing sub-fields programmatically

https://gerrit.wikimedia.org/r/789185

Change 789326 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@wmf/1.39.0-wmf.10] Community configuration: Allow writing sub-fields programmatically

https://gerrit.wikimedia.org/r/789326

Change 789326 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.39.0-wmf.10] Community configuration: Allow writing sub-fields programmatically

https://gerrit.wikimedia.org/r/789326

Mentioned in SAL (#wikimedia-operations) [2022-05-05T07:34:44Z] <tgr@deploy1002> Synchronized php-1.39.0-wmf.10/extensions/GrowthExperiments: Backport: [[gerrit:789326|Community configuration: Allow writing sub-fields programmatically (T306792)]] (duration: 00m 54s)

Change 789185 merged by Gergő Tisza:

[mediawiki/extensions/GrowthExperiments@wmf/1.39.0-wmf.9] Community configuration: Allow writing sub-fields programmatically

https://gerrit.wikimedia.org/r/789185

Mentioned in SAL (#wikimedia-operations) [2022-05-05T07:38:46Z] <tgr@deploy1002> Synchronized php-1.39.0-wmf.9/extensions/GrowthExperiments: Backport: [[gerrit:789185|Community configuration: Allow writing sub-fields programmatically (T306792)]] (duration: 00m 52s)

Mentioned in SAL (#wikimedia-operations) [2022-05-05T07:39:02Z] <tgr> running extensions/GrowthExperiments/maintenance/changeWikiConfig.php for T306792

Configuration has been updated with machine-generated section exclusion data. The exact command was

for WIKI in cawiki hewiki hiwiki kowiki nowiki ptwiki simplewiki svwiki ukwiki; do jq "select(.wiki==\"$WIKI\" and .probability > 0.25) | .section" wiki_sections.jsonl | jq --slurp --compact-output unique | mwscript extensions/GrowthExperiments/maintenance/changeWikiConfig.php $WIKI --page MediaWiki:NewcomerTasks.json --json link-recommendation.excludedSections --summary 'machine-generated configuration for excluding sections from link recommendations ([[phab:T306792]]), feel free to improve' "`cat`"; read; done

It was a bit cumbersome because most of those wikis did not have a link-recommendation field in the task configuration yet, so that had to be added manually; that's not going to scale to more wikis. I suppose instead of writing link-recommendation.excludedSections, we can just write the whole link-recommendation field - that will overwrite any existing details though. I think we need a "write field if it doesn't exist" option for changeWikiConfig.php.

See also T307496: Community configuration descriptions should warn when changes aren't immediate.

kostajh triaged this task as Medium priority.May 12 2022, 8:58 AM

Growth is on the process to scale the Add a link feature to more wikis and we're finding missing wikis in the wiki_sections.jsonl file F35092312. For instance novwiki, nrmwiki and nvwiki, see T308138. How can I add more wikis to the file (or maybe separate file for sizing issues)? Is the existing file the result of executing the sqlite query (T306792#7897336) with different wikis? cc @Tgr Ty!

The file is the result of the command as written, and should contain all wikis (my memory is vague but I think each xxwiki_aligned_sections_2022-02.sqlite.gz file contained section pairs between xxwiki and every other wiki); maybe the source dataset just wasn't able to align any sections on some wikis?

With SDAW we now have a properly productized mechanism for section alignment, so maybe time to rethink what data to use.