[Story] Phase 0: Automate interwiki language links for Wiktionary
Closed, ResolvedPublic

Description

Links between the individual language versions of Wiktionary work differently than on Wikipedia and the other sister projects. They are based on representation not concept. So for the main content Wikidata is not the right way to link them together. We need to develop a new extension that automatically links pages based on their page title. For a given page we need to check if a page with the same title exists on any of the other Wiktionary language editions and then add links to the sidebar accordingly. Some normalization is needed here.
Project pages, help pages etc will be linked on Wikidata as usual because they are based on concepts, not representation.

There are a very large number of changes, so older changes are hidden. Show Older Changes
GPHemsley updated the task description. (Show Details)Oct 30 2014, 12:31 AM
GPHemsley updated the task description. (Show Details)
GPHemsley updated the task description. (Show Details)
Gilles triaged this task as Normal priority.Nov 24 2014, 1:44 PM
Gilles added a subscriber: Gilles.
Lydia_Pintscher renamed this task from Implement Phase 0: Interwiki links to language links for Wiktionary.Nov 27 2014, 10:39 AM
Lydia_Pintscher lowered the priority of this task from Normal to Low.
GPHemsley updated the task description. (Show Details)Dec 1 2014, 3:18 AM
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
GPHemsley renamed this task from language links for Wiktionary to Phase 0: Centralize interwiki language links for Wiktionary.Mar 8 2015, 3:19 PM
GPHemsley raised the priority of this task from Low to Normal.
GPHemsley moved this task from Backlog to Up Next on the Wiktionary board.Mar 8 2015, 3:26 PM

Is there a reason why only "main namespace" pages are mentioned in the above proposal?
Also, does this proposal takes into account the different ways the different communities represent words? E.g. different apostrophes used for aujourd'hui in French and in English (with corresponding redirects in each project).

mxn added a subscriber: mxn.Mar 17 2015, 6:21 AM
Gilles removed a subscriber: Gilles.Apr 23 2015, 6:11 AM

When will this task get completed ?

I unfortunately can't tell you yet. I am currently looking for someone to work on it.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 28 2015, 5:54 AM
JAnD added a subscriber: JAnD.Oct 9 2015, 9:36 AM
dg711 added a subscriber: dg711.Dec 30 2015, 1:20 AM

The Wikimedia-Hackathon-2016 starts tomorrow and this task is featured at T119703. We want to use T130776: Wikimedia Hackathon 2016 Opening Session to promote these projects and help recruiting volunteers to work for them.

If this task is ripe for hackathon work, please follow these instructions. If it is not ready, remove it from T119703 in order to avoid volunteers' frustration. Thank you!

gabriel-wmde updated the task description. (Show Details)Apr 2 2016, 9:16 AM
gabriel-wmde added a comment.EditedApr 2 2016, 9:53 AM

Some preliminary thoughts regarding the implementation:

The task will be implemented as an extension that does two things:

  • On page edit, update a central data store that the page exists in the language of the currently used wiktionary.
  • On page render, query the central data store for all existing translations and insert the information in the page structure.

The "central data store" must be a database that is accessible from all wiktionary projects. Proposed table structure:

langtitlelast_updated

lang and title must be indexed.

On initialization the table must be filled with the existing data. This could be done by querying the API and crawling each wiktionary with a bot or by direct database access. This still has to be determined.

Regarding the comment by @Darkdadaah: Wiktionary is in the unique situation that each word should be the same in each language version. Where they differ, there must be a redirect page, as in the case pointed out by Darkdadaah.

JAnD added a comment.Apr 2 2016, 6:47 PM

Regarding the comment by @Darkdadaah: Wiktionary is in the unique situation that each word should be the same in each language version. Where they differ, there must be a redirect page, as in the case pointed out by Darkdadaah.

There are some exception cases:

  • proverbs are in some anguage as sentence (First capital and dot at the end.), in some with firs lowcase and without dot
  • words containing '
Qgil removed a subscriber: Qgil.Apr 3 2016, 7:18 AM

@gabriel-wmde worked on this at Wikimedia-Hackathon-2016 - "not ready but getting there"

@gabriel-wmde It'd be great if you can post a status update here. Then I can help you figure out who can review it and so on. If you want I can also see if you can continue this on work-time.

gabriel-wmde added a comment.EditedApr 6 2016, 9:14 AM

Current status:

Next steps:

  • Add code for the remaining TODO comments.
  • Address the cases where the title is different in the original language (as raised by @JAnD and @Darkdadaah, I'll add a separate comment with a proposal on how to address this).
  • Create a maintenance script that fills the database.
  • ???
  • Profit Deploy

After the comments from @JAnD and @Darkdadaah I have lots of questions on how disparate the page titles for each language are.

As far as I understand, some Wiktionary projects have their own, localized spelling that deviates from the "standard" spelling of Wiktionary, for adhering to the spelling standards of the language they are in.

Some questions:

  • Do wiktionary words exist in more than 2 variations of a word across all translations? If they do, how big is the variation?
  • Is the wiktionary project doing the localized spelling only for the language they are for (e.g. the French Wiktionnaire has French apostrophes in french words and phrases, but leaves the apostrophes of other languages words and phrases as-is).
  • Is the localized spelling consistent inside one wiktionary?
  • For which percentage (roughly) of localized pages does a redirect from the standard spelling to the localized spelling exist?
  • Would it be possible to "Normalize" page titles algorithmically (search and replace with regular expressions)?
  • Words in different scripts (Cyrillic, Hebrew, Arabic, etc.) don't have these issues? Or Do they have even graver issues?

Is there someone who can reliably answer these questions for all wiktionaries? Otherwise the next step in this issue would be to write a program that analyzes the deviations.

hoo added a subscriber: hoo.Apr 14 2016, 9:27 AM

@gabriel-wmde I did some work like this a long time ago, see https://www.wikidata.org/wiki/Wikidata_talk:Wiktionary#First_and_second_phases. I'll see if I can do an update...

Here is an update! See details here https://fr.wiktionary.org/wiki/Utilisateur:Darkdadaah/Analyses/Interwikis.

To put it simply, as far as I know, the only accepted differences are apostrophes and punctuation differences. So it is not a matter of localized spelling, but of typographic rules.

To answer your questions @gabriel-wmde :

  • Words variations have separate pages in all chapters, so it is not an issue for interwiki links. The only cases where a difference can be seen is when the typographic rules differ.
  • Typographic apostrophes are used for all languages.
  • The rules are supposedly consistent in each Wiktionary.
  • I found that only a small fraction of pages have a link to a different spelling (1% of 24M pages). For "acceptable" differences, the fraction is even smaller (0.01%).
  • Normalization is possible for apostrophes. For punctuations it may be possible, assuming the punctuation mark is not an integral part of the phrase. And pages about punctuations should be avoided.
  • The only problems I found in other scripts are punctuations (e.g. ellipsis with one character or three periods).
mxn added a comment.May 26 2016, 5:52 PM

The Vietnamese Wiktionary uses redirects for systematic variations involving diacritics, for example xoá to xóa, whereas the English Wiktionary does not. This probably doesn't show up in interwiki stats because the English Wiktionary has relatively few Vietnamese words and the French and Chinese Wiktionaries have little coverage of these variations (which affect maybe a tenth of the overall corpus). On the flip side, automatically normalizing diacritics would be problematic for the same reasons described in T78485.

It wouldn't be the end of the world for the Vietnamese Wiktionary to turn its redirects into soft redirects to preserve these links, but soft redirects are highly inconvenient for very systematic variations.

Nikki added a subscriber: Nikki.May 26 2016, 6:00 PM

In the French Wiktionary our rules would make xoá and xóa in two different articles, with a soft redirect to the corresponding content. The rule of thumb is that hard redirects shouldn't imply any linguistic information, e.g. a variation is informative and should be explained explicitly if both articles.

A hard redirection should only mean that the 2 pages are typographically equivalent. In that regard, the interwiki language links should be able to link to any typography variation chosen by the various projects.

Note that the projects would still be free to use hard redirects when they choose to, but interwiki links should only link to the corresponding graphy, not the corresponding "word". We will need an actual word/lexeme entity for that.

Noe added a subscriber: Noe.Jul 1 2016, 2:37 PM

Current status:

The new location is R1890 extension-Cognate.

Lydia_Pintscher renamed this task from Phase 0: Centralize interwiki language links for Wiktionary to Phase 0: Automate interwiki language links for Wiktionary.Sep 11 2016, 3:33 PM
Meno25 added a subscriber: Meno25.Oct 10 2016, 4:20 PM
Lydia_Pintscher renamed this task from Phase 0: Automate interwiki language links for Wiktionary to [Story] Phase 0: Automate interwiki language links for Wiktionary.Jan 3 2017, 3:07 PM
-sche added a subscriber: -sche.Apr 13 2017, 8:43 PM

Late reply to "[do] words in different scripts" have issues like French/English wikis' apostrophes:

Wiktionarians regularly fail to create redirects / interwiki links for this, so if Cognate ignores this until later, it doesn't make anything any worse than it is, IMO.

Sometimes, the English Wiktionary uses a straight and curly or modifier apostrophe contrastively: Mopan Maya ka'an ("sky") vs Yucatec Maya ka’an ("sky"): https://en.wiktionary.org/wiki/Wiktionary:Beer_parlour/2013/July#Apostrophe_conflict_between_Yucatec_Maya_and_Mopan_Maya . This may be an issue the English Wiktionary needs to fix internally by standardizing on one apostrophe! It may also not be any obstacle to what you're doing here.

Btw, hard redirects like the Vietnamese Wiktionary uses for xoá→xóa (mentioned above) would never be used by the English Wiktionary because the strings could be separate words in other languages, like Icelandic sóa and Hungarian soá are.

Question: will it be possible for templates/modules to query the existence of a page in another language's Wiktionary? This would be useful to the English Wiktionary's https://en.wiktionary.org/wiki/Template:t%2B.

Lydia_Pintscher closed this task as Resolved.Apr 24 2017, 3:01 PM
Lydia_Pintscher claimed this task.