Page MenuHomePhabricator

Recommendation API: resolve interlanguage confclits
Open, Stalled, HighPublic

Description

Problem 1: linking Q1216998 (en:Neoplasm) with Q133212 (de:Tumor)

en:Neoplasm used to have a link to de:Neoplasma, which redirects to de:Tumor. We could then infer that Q1216998 (en:neoplasm) and Q133212 (de:tumor) were describing the same concept. However, now editor-added inter-language links cannot be used as suggested in the paper, as links between articles are being generated by Wikidata only. See T203041#4641969 for more info. How do we make sure that we're not recommending an existing article for creation if it is represented differently in Wikidata?

Problem 2: linking en:Cherry with de:Vogel-Kirsche

Fruit cherry has its own article on enwiki, but is part of another article on dewiki. How do we make sure that we don't recommend "Cherry" for creation to dewiki users?

Possible solutions are being drafted here.

Event Timeline

bmansurov triaged this task as High priority.Oct 18 2018, 6:29 PM
bmansurov created this task.
bmansurov updated the task description. (Show Details)
leila moved this task from Staged to In Progress on the Research board.
leila added subscribers: diego, Cervisiarius.
bmansurov moved this task from In Progress to Staged on the Research board.Mar 4 2019, 2:28 PM
leila edited projects, added Research-Backlog; removed Research.Jul 11 2019, 3:57 PM
Isaac added a subscriber: Isaac.Feb 17 2020, 4:16 PM

Just a note if/when we return to this and apologies if this was already discussed but decided against: the editor-provided interlanguage links that override Wikidata are still part of articles and are extracted (alongside the Wikidata-maintained interlanguage links) in language-specific dump files. For English 2020-02-01, the file is named enwiki-20200201-langlinks.sql.gz and if you search for the page ID for Neoplasm (1236730), as follows, you get (with some cleaning up from my crude grep query), all of the Wikidata AND user-provided language links (in this case, German) out of the English article:

grep -o '(1236730,........................' ~/Downloads/enwiki-20200201-langlinks.sql
(1236730,'af','Gewas')
(1236730,'ar','ورم')
(1236730,'bn','নিওপ্লাজম')
(1236730,'ca','Neoplàsia')
(1236730,'cy','Tiwmor')
(1236730,'da','Neoplasi')
(1236730,'de','Neoplasie')
(1236730,'el','Νεόπλασμα')
(1236730,'es','Neoplasia')
(1236730,'et','Kasvaja')
(1236730,'eu','Neoplasia')
(1236730,'fa','نئوپلاسم')
(1236730,'fr','Néoplasie')
(1236730,'ga','Sceachaill')
(1236730,'gl','Neoplasia')
(1236730,'he','נאופלזיה')
(1236730,'hi','फुलाव')
(1236730,'hu','Neoplasia')
(1236730,'id','Neoplasma')
(1236730,'it','Neoplasia')
(1236730,'kn','ಗಂತಿ')
(1236730,'ko','신생물')
(1236730,'la','Neoplasma')
(1236730,'lt','Neoplazma')
(1236730,'ms','Neoplasma')
(1236730,'nl','Neoplasie')
(1236730,'nn','Neoplasi')
(1236730,'no','Neoplasi')
(1236730,'pl','Nowotwór')
(1236730,'pt','Neoplasma')
(1236730,'ro','Neoplasm')
(1236730,'sh','Neoplazma')
(1236730,'sr','Неоплазма')
(1236730,'sv','Neoplasi')
(1236730,'ta','கட்டி (உயிரியல்)')
(1236730,'th','เนื้องอก')
(1236730,'ur','نُفّاخ')
(1236730,'vi','Khối u')
(1236730,'zh','贅生物')

Cherry is much harder because no one has added the interlanguage links within the English article to indicate that the German article does exist but as a section under a slightly different topic. Uncovering missing interlanguage links like this though is very very difficult given all of the good work that has already been done by the editor community. That being said, the analogous example for spot welding, where the English article explicitly does link to article sections in German and Italian can also be recovered through the langlinks dump:

grep -o '(229103,....................' ~/Downloads/enwiki-20200201-langlinks.sql
(229103,'ar','لحام بقعة')
(229103,'ca','Soldadura per punts')
(229103,'de','Widerstandsschweißen#Widerstandspunkt- und Buckelschweißen')
(229103,'es','Soldadura por puntos')
(229103,'fa','جوشکاری نقطهای')
(229103,'he','ריתוך נקודות')
(229103,'it','Saldatura#Saldatura a punti')
(229103,'ja','スポット溶接')
(229103,'nl','Puntlassen')
(229103,'pt','Solda ponto')
(229103,'ro','Sudură în puncte')
(229103,'ru','Точечная контактная сварка')
(229103,'simple','Resistance welding')
(229103,'sv','Punktsvetsning')
(229103,'te','స్పాట్ వెల్డింగు')
(229103,'uk','Точкове зварювання')
Isaac added a comment.EditedFeb 17 2020, 10:09 PM

Following up because I stumbled across a prior discussion of the solution I raise above (T203041#4640203). Looks like the tumor example was handled differently back then, so the langlinks solution had been considered but deemed insufficient.

We could also consider mining article / sub-article relationships to identify topics that are likely to be a section of a larger article prior to being a stand-alone article. This specifically would help cover the Cherry use-case. In the English article for Prunus avium, a species of cherry that is also linked via Wikidata to the German article on cherries, there's a link to the English Cherry article as a main article for more information. We might only recommend creating the article for Cherry in another language if that language neither has an article for Cherry nor Prunus avium. This might be overly restrictive but should remove a bunch of false positives hopefully.
More information: http://brenthecht.com/publications/cscw17_subarticles.pdf