Develop a standalone classifier for section translation (alignment) across languages
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	diego
	Dec 6 2017, 4:54 PM

Description

The goal is to create a data set with section names aligned across languages, meaning that given a section name in source language, we want to create a mapping for the equivalent section name in other languages.

Example: Given the section name: 'Awards' (English), we want to map this title to 'Premios' (Spanish) and 'Prêmios' (Portuguese).

Our preliminary studies shows that automatic translation (a.k.a machine translation) does not give enough accuracy. There are several reasons for that, for example, conventions changes across languages, (e.g. English Wikipedia usually have one section References, and another Section notes, while French Wikipedia usually use just one named "Notes et références"), and also that section names are not translated literally .

Within this task we will study, compare, and combine several approaches such as the aforementioned Machine Translation, cross-lingual word embedding, and heuristics based on Wikidata information.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		leila	T171224 [Objective 9.1.1] Article expansion recommendations
		Resolved		diego	T182211 Develop a standalone classifier for section translation (alignment) across languages

Event Timeline

diego created this task.Dec 6 2017, 4:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 6 2017, 4:54 PM

leila added a parent task: T171224: [Objective 9.1.1] Article expansion recommendations.Dec 6 2017, 5:00 PM

leila added a subscriber: Cervisiarius.

leila triaged this task as High priority.Dec 15 2017, 9:00 PM

leila renamed this task from Section Alignment across languages to Develop a standalone classifier for section translation (alignment) across languages.Dec 15 2017, 9:02 PM

leila mentioned this in T183039: Gather labels as ground truth for translation and synonym section classifiers.Dec 15 2017, 9:10 PM

diego moved this task from Backlog to In Progress on the Research board.Jan 2 2018, 11:35 PM

• bmansurov subscribed.Jan 16 2018, 5:25 PM

leila mentioned this in T184212: Gather labels as ground truth for section translation.Jan 17 2018, 11:08 PM

@bmansurov please find the candidates here: @stat1005:/home/dsaez/code/alignment/resultsMapping

You should find 6 files like this: mapping_xx.json, where xx is the source language. The information inside is formated in the following way:

key1: {"rank": n, "l1: [candidate1, ..., candidate5], ...'l5':[candidate1, ... candidate5]}

where:
key1: string, Section name in lang xx , eg: 'References'
rank: int, section 'popularity' ranking in source language, where n=1 is the most frequent section, eg: 45
l1: string -> list, l1 is one target language, and the list are the candidates translations/maps from key1, eg: 'es': ['Referencias', 'Notas', ...]

@diego would it be possible to generate each entry on a separate line and mappings bundled under the key "targets"? Something like this:

{key1: {"rank": n, "targets: { "l1": [candidate1, ..., candidate5], ..., "l5":[candidate1, ... candidate5]}}}

{key2: {"rank": n, "targets: { "l1": [candidate1, ..., candidate5], ..., "l5":[candidate1, ... candidate5]}}}

This would allow me to not load all data in memory in order to parse it.

In T182211#3967826, @bmansurov wrote:
@diego would it be possible to generate each entry on a separate line and mappings bundled under the key "targets"? Something like this:
{key1: {"rank": n, "targets: { "l1": [candidate1, ..., candidate5], ..., "l5":[candidate1, ... candidate5]}}}
{key2: {"rank": n, "targets: { "l1": [candidate1, ..., candidate5], ..., "l5":[candidate1, ... candidate5]}}}
This would allow me to not load all data in memory in order to parse it.

Served: /home/dsaez/code/alignment/resultsMappingLines

@diego thanks. I also updated the request above, so you must have missed the new format. Please add "targets" to the output.

Another thing I noticed is that some entries are missing mappings. For example, in the English file here are some missing entries:

External features is missing ar mappings.
External features is missing ja mappings.
Extracurriculars is missing ar mappings.
Location and features is missing ar mappings.
International honours is missing ja mappings.
Junction list is missing ar mappings.
New books is missing es mappings.
Opinion of the Court is missing ar mappings.
List of representatives is missing ar mappings.
Construction and commissioning is missing ar mappings.
Ladder is missing ar mappings.
Ladder is missing es mappings.
Ladder is missing ja mappings.
Ladder is missing ru mappings.
Assembly segments is missing ar mappings.

Amire80 mentioned this in T188099: Content Translation should suggest automatic translations for common heading names.Feb 23 2018, 1:59 PM

@diego I'm moving this task to Done as you've built the first version of the classifier and we're gathering labels now to improve it. Improvements will be tracked at T190770

leila moved this task from In Progress to Done (current quarter) on the Research board.Mar 29 2018, 9:02 PM

• DarTar closed this task as Resolved.May 2 2018, 9:54 PM

• DarTar edited projects, added Research-Archive; removed Research.

• DarTar moved this task from Default to Q3-FY18 on the Research-Archive board.May 2 2018, 10:41 PM

Develop a standalone classifier for section translation (alignment) across languagesClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Develop a standalone classifier for section translation (alignment) across languages
Closed, ResolvedPublic
Actions

Related Objects
Search...