Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Stalled | None | T203041 Output 2.1: An improved task recommendation API | |||
Resolved | • bmansurov | T203263 Measure translation recommendations against the baseline | |||
Stalled | None | T207406 Recommendation API: resolve interlanguage conflicts | |||
Open | None | T210433 Identify and release data on similar Wikidata items | |||
Resolved | • bmansurov | T213222 Recommendation API improvements | |||
Resolved | • bmansurov | T213866 Filter out newly created articles |
Event Timeline
@leila I've been experimenting with the implementation of the section 2.1 of the paper. We can get redirects from Hive (prod.redirect), but not sure how to retrieve interlanguage links as they are not being used in Wikpedia according to this (see the intro). Do you know how?
Edit: Had a chat with @Amire80 and he's pointed out that I can use Wikidata for this too \o/.
Turns out we cannot reliably detect redirects across languages. For example, '"Them"' redirects to 'Them_(King_Diamond_album)' (Q1756739). Since we're trying to figure out the Wikidata ID of '"Them"' we can only search Wikidata items by English labels. There are many items with that label:
- Them (Q1338638)
- Them (Q37545106)
- Them (Q1112469)
- Them (Q3591139)
- etc.
@bmansurov it's probably worth going over Ellery's code. Please check get_missing_articles.py in /home/ellery/translation-recs-app/model_building/find_missing on stat1005 for example. the "find_missing" folder there can be helpful for you at this point. Let me know if you want me to dig deeper in it to refresh my mind. ;)
@leila thanks for the lead. Do you remember if in 2015 (when the scripts were written), Neoplasm (en) was linked to Neoplasma (de) in langlinks. Right now, it seems that's not the case:
mysql:research@analytics-store.eqiad.wmnet [enwiki]> SELECT page.page_title AS "source (en)", ll_lang AS "target language", ll_title AS "title (target lang)" FROM langlinks JOIN page ON ll_from=page.page_id WHERE ll_from=1236730 ORDER by ll_lang; +-------------+-----------------+-----------------------------+ | source (en) | target language | title (target lang) | +-------------+-----------------+-----------------------------+ | Neoplasm | af | Gewas | | Neoplasm | ar | ورم | | Neoplasm | bn | নিওপ্লাজম | | Neoplasm | ca | Neoplàsia | | Neoplasm | cy | Tiwmor | | Neoplasm | da | Neoplasi | | Neoplasm | el | Νεόπλασμα | | Neoplasm | es | Neoplasia | | Neoplasm | et | Kasvaja | | Neoplasm | eu | Neoplasia | | Neoplasm | fa | نئوپلاسم | | Neoplasm | fr | Néoplasie | | Neoplasm | ga | Sceachaill | | Neoplasm | gl | Neoplasia | | Neoplasm | he | נאופלזיה | | Neoplasm | hi | फुलाव | | Neoplasm | hu | Neoplasia | | Neoplasm | id | Neoplasma | | Neoplasm | ko | 신생물 | | Neoplasm | la | Neoplasma | | Neoplasm | lt | Neoplazma | | Neoplasm | nl | Neoplasie | | Neoplasm | nn | Neoplasi | | Neoplasm | no | Neoplasi | | Neoplasm | pl | Nowotwór | | Neoplasm | pt | Neoplasma | | Neoplasm | sh | Neoplazma | | Neoplasm | sr | Неоплазма | | Neoplasm | sv | Neoplasi | | Neoplasm | th | เนื้องอก | | Neoplasm | ur | نُفّاخ | | Neoplasm | zh | 贅生物 | +-------------+-----------------+-----------------------------+ 32 rows in set (0.00 sec)
@bmansurov to be clear: if I understand correctly, Neoplasm doesn't exist as a concept in de, it only exists in en. What does exist in de is Tumor. So the system should figure out that these two concepts are the same and it should not recommend Neoplasm article to be created in de, because it's already covered in Tumor. (Figure 2 of https://arxiv.org/pdf/1604.03235.pdf explains this as well)
@leila that's what I understood too. So in order to link Neoplasm (en) to Tumor (de), we'd go from Neoplasm (en) to Neoplasma (de) and then from that article to Tumor (de).
To quote the paper:
In order to group Wikidata concepts that are semantically nearly
identical, we leverage two signals. First, we extract inter-language
links which Wikipedia editors use to override the mapping speci-
fied by Wikidata and to directly link articles across languages (e.g.,
in Fig. 2, English NEOPLASM is linked via an inter-language link
to German NEOPLASMA ).
My problem is this part:
English NEOPLASM is linked via an inter-language link to German NEOPLASMA
As you can see from the SQL output above, there is no such link.
@bmansurov I may be blanking a bit here. ;) Here is I /think/ what's happening in the paper. Please double-check with Bob and confirm. What is, in the text of the paper, called ILL is inline interlanguage links and not the ones that are captured through the language bar of the article page (check https://en.wikipedia.org/wiki/Help:Interlanguage_links for a distinction). The former is not captured in Wikidata dumps. We should have a dataset somewhere (or we can create it ourselves) which contains these in-text interlanguage links or ILLs. That's what the paper refers to. So basically, we get from Wikidata that Neoplasm (en) and Tumor (de) are connected. From redirects, we get that Neoplasma (de) is the same as Tumor (de), and the ILL data will tell us that last missing piece: that Neoplasm (en) is the same as Neoplasma (de).
@bmansurov asked me to chime in, and since I've been obsessed with interlanguage links for many, many years, I'm happy to help :)
The Neoplasm / Tumor situation issue is an example of an interlanguage conflict, also known as interwiki conflict. It happens in the following interrelated cases:
- When old-style interlanguage links are used, especially with links to redirects, sections (with #), or esoteric spelling (as silly as it sounds, "hedgehog" instead of "Hedgehog" may cause a conflict!) I think that @Ladsgroup has a bot script that can find remaining articles that have this.
- Articles were simply connected incorrectly by a human or a bot at some point. This happen because of misunderstandings, translation mistakes, and so on. (For example, the Hebrew article for Glacier was once connected to Iceberg, because the Hebrew word is the same.)
- When some languages have an article about topics that are closely related, but not identical. For example, "Emirate of Dubai" and "Dubai" (I remember cleaning it up a long time ago). Neoplasm / Tumor is probably also such a case. In such a case languages can be roughly broken up to four groups: those with articles about topic A, those with articles about topic B, those with both articles, and those with an article that covers both topics. A human must go over all the articles manually and clean then up.
I don't know how many conflicts are there. My super-rough ballpark estimation is about 5000, but this estimation can be far off. The real number can be 100, or 10,000, or even more. If you have tools that find suspect clusters of conflicting, it will be very useful if you publish a list.
Resolving these conflicts is challenging and time-consuming, but it's nevertheless feasible. There are Wikipedians and Wikidatans (including myself) who will enjoy fixing them, especially if you publish such a list.
@Amire80 thanks for chiming in. I think we'll all benefit from identifying these problematic interlanguage links and fixing them. Hopefully we can publish a list of issues.
@leila turns out, interlanguage links on the article have been removed in recent editions of the article. If you compare the most recent article version with an earlier version from 2015, you'll notice in the language sidebar that a link to German wikipedia exists in the earlier version of the article, while it's gone from the newer version. This is also obvious from the source of the earlier article that contains these lines:
[[de:Neoplasma]] [[ru:Опухоль]] [[ta:கட்டி (உயிரியல்)]] [[uk:Пухлина]]
Given the above information, while I think we can resolve some missing articles using Wikidata links, we'll have some issues where Wikipedia interlanguage links are missing (as is the case with dewiki), which I hope we can resolve over time and with the help of @Amire80 and others.
Resolving these conflicts is challenging and time-consuming, but it's nevertheless feasible.
In the case of Tumor/Neoplasm, what's preventing us from merging these two Wikidata items? Also, do you see cases where two related articles cannot be merged in Wikidata? What can we do about those?
If the articles describe the same thing, then the items should indeed be merged. (I'm really not sure that they describe the same thing because I'm not a doctor.)
Now, even if they indeed are about the same thing, they cannot be merged as long as there is at least one language to which there is a link in both items. I didn't check the whole list of languages in both items, but I can see that both items have links to Danish (da), Spanish (es), and probably more languages.
Resolving this will require a bit of work:
- Understanding the terms Neoplasm and Tumor. (I don't understand them, but somebody who studied something about medicine probably does.)
- Going to each article in each language in each item and checking: Is this about Neoplasm, about Tumor, or about both things?
- Can the articles in Danish, Spanish, and other languages that have two articles be merged? If yes, talk to the community in the relevant language and merge them. (And if you know the language yourself, merge them yourself!)
- If the topics are distinct and there should be two articles, then there should probably be two Wikidata items. Keep them distinct, and if needed, move the articles that are more about Neoplasm to the Neoplasm item, and move the articles that are more about Tumor to the Tumor item.
Yes, it's time-consuming, but it's totally doable. And it's even fun! I did it to some other articles in the past.
Status update: Directly implementing the section 2.1 wasn't feasible because site links are being generated at Wikidata only. As an alternative, I've tried using Wikidata items labels, descriptions, and aliases to figure out whether I can link tumor to neoplasm. I've used word2vec (and doc2vec separately) over Wikidata item descriptions and cosine similarity to find similar items. There were man false positives because (my guess is that) a lot of the Wikidata items have short and similar descriptions. Here are the results: T210433: Identify and release data on similar Wikidata items.