Page MenuHomePhabricator

[Story] Phase 0: Automate interwiki language links for Wiktionary
Closed, ResolvedPublic

Description

Links between the individual language versions of Wiktionary work differently than on Wikipedia and the other sister projects. They are based on representation not concept. So for the main content Wikidata is not the right way to link them together. We need to develop a new extension that automatically links pages based on their page title. For a given page we need to check if a page with the same title exists on any of the other Wiktionary language editions and then add links to the sidebar accordingly. Some normalization is needed here.
Project pages, help pages etc will be linked on Wikidata as usual because they are based on concepts, not representation.

Related Objects

StatusSubtypeAssignedTask
OpenFeatureNone
OpenFeatureNone
OpenFeatureNone
OpenFeatureNone
OpenNone
OpenNone
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
ResolvedLydia_Pintscher
ResolvedAddshore
ResolvedAddshore
Resolved jcrespo
ResolvedAddshore
ResolvedAddshore
ResolvedBawolff
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
DuplicateWMDE-leszek
ResolvedWMDE-leszek
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
ResolvedMarostegui
ResolvedAddshore
ResolvedAddshore
DeclinedNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I unfortunately can't tell you yet. I am currently looking for someone to work on it.

The Wikimedia-Hackathon-2016 starts tomorrow and this task is featured at T119703. We want to use T130776: Wikimedia Hackathon 2016 Opening Session to promote these projects and help recruiting volunteers to work for them.

If this task is ripe for hackathon work, please follow these instructions. If it is not ready, remove it from T119703 in order to avoid volunteers' frustration. Thank you!

Some preliminary thoughts regarding the implementation:

The task will be implemented as an extension that does two things:

  • On page edit, update a central data store that the page exists in the language of the currently used wiktionary.
  • On page render, query the central data store for all existing translations and insert the information in the page structure.

The "central data store" must be a database that is accessible from all wiktionary projects. Proposed table structure:

langtitlelast_updated

lang and title must be indexed.

On initialization the table must be filled with the existing data. This could be done by querying the API and crawling each wiktionary with a bot or by direct database access. This still has to be determined.

Regarding the comment by @Darkdadaah: Wiktionary is in the unique situation that each word should be the same in each language version. Where they differ, there must be a redirect page, as in the case pointed out by Darkdadaah.

Regarding the comment by @Darkdadaah: Wiktionary is in the unique situation that each word should be the same in each language version. Where they differ, there must be a redirect page, as in the case pointed out by Darkdadaah.

There are some exception cases:

  • proverbs are in some anguage as sentence (First capital and dot at the end.), in some with firs lowcase and without dot
  • words containing '

@gabriel-wmde It'd be great if you can post a status update here. Then I can help you figure out who can review it and so on. If you want I can also see if you can continue this on work-time.

Current status:

Next steps:

  • Add code for the remaining TODO comments.
  • Address the cases where the title is different in the original language (as raised by @JAnD and @Darkdadaah, I'll add a separate comment with a proposal on how to address this).
  • Create a maintenance script that fills the database.
  • ???
  • Profit Deploy

After the comments from @JAnD and @Darkdadaah I have lots of questions on how disparate the page titles for each language are.

As far as I understand, some Wiktionary projects have their own, localized spelling that deviates from the "standard" spelling of Wiktionary, for adhering to the spelling standards of the language they are in.

Some questions:

  • Do wiktionary words exist in more than 2 variations of a word across all translations? If they do, how big is the variation?
  • Is the wiktionary project doing the localized spelling only for the language they are for (e.g. the French Wiktionnaire has French apostrophes in french words and phrases, but leaves the apostrophes of other languages words and phrases as-is).
  • Is the localized spelling consistent inside one wiktionary?
  • For which percentage (roughly) of localized pages does a redirect from the standard spelling to the localized spelling exist?
  • Would it be possible to "Normalize" page titles algorithmically (search and replace with regular expressions)?
  • Words in different scripts (Cyrillic, Hebrew, Arabic, etc.) don't have these issues? Or Do they have even graver issues?

Is there someone who can reliably answer these questions for all wiktionaries? Otherwise the next step in this issue would be to write a program that analyzes the deviations.

Here is an update! See details here https://fr.wiktionary.org/wiki/Utilisateur:Darkdadaah/Analyses/Interwikis.

To put it simply, as far as I know, the only accepted differences are apostrophes and punctuation differences. So it is not a matter of localized spelling, but of typographic rules.

To answer your questions @gabriel-wmde :

  • Words variations have separate pages in all chapters, so it is not an issue for interwiki links. The only cases where a difference can be seen is when the typographic rules differ.
  • Typographic apostrophes are used for all languages.
  • The rules are supposedly consistent in each Wiktionary.
  • I found that only a small fraction of pages have a link to a different spelling (1% of 24M pages). For "acceptable" differences, the fraction is even smaller (0.01%).
  • Normalization is possible for apostrophes. For punctuations it may be possible, assuming the punctuation mark is not an integral part of the phrase. And pages about punctuations should be avoided.
  • The only problems I found in other scripts are punctuations (e.g. ellipsis with one character or three periods).

The Vietnamese Wiktionary uses redirects for systematic variations involving diacritics, for example xoá to xóa, whereas the English Wiktionary does not. This probably doesn't show up in interwiki stats because the English Wiktionary has relatively few Vietnamese words and the French and Chinese Wiktionaries have little coverage of these variations (which affect maybe a tenth of the overall corpus). On the flip side, automatically normalizing diacritics would be problematic for the same reasons described in T78485.

It wouldn't be the end of the world for the Vietnamese Wiktionary to turn its redirects into soft redirects to preserve these links, but soft redirects are highly inconvenient for very systematic variations.

In the French Wiktionary our rules would make xoá and xóa in two different articles, with a soft redirect to the corresponding content. The rule of thumb is that hard redirects shouldn't imply any linguistic information, e.g. a variation is informative and should be explained explicitly if both articles.

A hard redirection should only mean that the 2 pages are typographically equivalent. In that regard, the interwiki language links should be able to link to any typography variation chosen by the various projects.

Note that the projects would still be free to use hard redirects when they choose to, but interwiki links should only link to the corresponding graphy, not the corresponding "word". We will need an actual word/lexeme entity for that.

Lydia_Pintscher renamed this task from Phase 0: Centralize interwiki language links for Wiktionary to Phase 0: Automate interwiki language links for Wiktionary.Sep 11 2016, 3:33 PM
Lydia_Pintscher renamed this task from Phase 0: Automate interwiki language links for Wiktionary to [Story] Phase 0: Automate interwiki language links for Wiktionary.Jan 3 2017, 3:07 PM

Late reply to "[do] words in different scripts" have issues like French/English wikis' apostrophes:

Wiktionarians regularly fail to create redirects / interwiki links for this, so if Cognate ignores this until later, it doesn't make anything any worse than it is, IMO.

Sometimes, the English Wiktionary uses a straight and curly or modifier apostrophe contrastively: Mopan Maya ka'an ("sky") vs Yucatec Maya ka’an ("sky"): https://en.wiktionary.org/wiki/Wiktionary:Beer_parlour/2013/July#Apostrophe_conflict_between_Yucatec_Maya_and_Mopan_Maya . This may be an issue the English Wiktionary needs to fix internally by standardizing on one apostrophe! It may also not be any obstacle to what you're doing here.

Btw, hard redirects like the Vietnamese Wiktionary uses for xoá→xóa (mentioned above) would never be used by the English Wiktionary because the strings could be separate words in other languages, like Icelandic sóa and Hungarian soá are.

Question: will it be possible for templates/modules to query the existence of a page in another language's Wiktionary? This would be useful to the English Wiktionary's https://en.wiktionary.org/wiki/Template:t%2B.

Ad_job set Security to Software security bug.Jan 17 2018, 11:18 PM
Ad_job added a project: acl*security.
Ad_job changed the visibility from "Public (No Login Required)" to "Custom Policy".
Ad_job subscribed.
This comment was removed by Reedy.
Ad_job set Security to Software security bug.Jan 17 2018, 11:18 PM
Ad_job added a project: acl*security.
Ad_job changed the visibility from "Public (No Login Required)" to "Custom Policy".
Ad_job subscribed.
This comment was removed by Reedy.
Reedy changed the visibility from "Custom Policy" to "Public (No Login Required)".Jan 17 2018, 11:22 PM
Reedy removed a project: acl*security.

Is it possible to use Cognate in a private mediawiki installation to keep the interwiki links updated?
How do I configure it? It is no descriptions about the settings in https://www.mediawiki.org/wiki/Extension:Cognate

In T987#3971550, @Magol wrote:

Is it possible to use Cognate in a private mediawiki installation to keep the interwiki links updated?
How do I configure it? It is no descriptions about the settings in https://www.mediawiki.org/wiki/Extension:Cognate

Hi @Magol. This is totally possible, I added the few missing bits of docs for the config vars to the extension.json file in https://gerrit.wikimedia.org/r/#/c/410789/

Example config for the wmf sites is:

	wfLoadExtension( 'Cognate' );
	$wgCognateDb = 'cognate_wiktionary';
	$wgCognateCluster = 'extension1';
	$wgCognateNamespaces = [ 0 ];

@Addshore If I understand this right, this is only usable if I have a cluster of Mediawiki sites. Or am I wrong?
How do I have to configure my private Mediawiki site if I want Cognate to show links to pages with the same name in Wikipedia, Wikisource, Commons etc?

In T987#3975186, @Magol wrote:

@Addshore If I understand this right, this is only usable if I have a cluster of Mediawiki sites. Or am I wrong?

Yes

How do I have to configure my private Mediawiki site if I want Cognate to show links to pages with the same name in Wikipedia, Wikisource, Commons etc?

So, that isn't really what Cognate is for :)

@Addshore Ok, then I have to continue to keep the Interwiki links to other languages in wikipedia updated manually. That's okay

But how do I do if and want to have the same type of interwiki links to lets say Wikisource?
In Wikipedia, it is a "In other projects" in the sidebar. Is it possible to manually create interwiki links to eg. Wikisource in the "In other projects" section in my private wiki?

image.png (293×217 px, 10 KB)

In other projects side bar is provided by https://www.mediawiki.org/wiki/Beta_Features/Other_projects_sidebar which I believe is part of wikibase

在T987#4006828中,@Addshore写道:

In other projects side bar is provided by https://www.mediawiki.org/wiki/Beta_Features/Other_projects_sidebar which I believe is part of wikibase

see also T173626