Page MenuHomePhabricator

Develop a standalone classifier for section translation (alignment) across languages
Closed, ResolvedPublic

Description

The goal is to create a data set with section names aligned across languages, meaning that given a section name in source language, we want to create a mapping for the equivalent section name in other languages.

Example: Given the section name: 'Awards' (English), we want to map this title to 'Premios' (Spanish) and 'Prêmios' (Portuguese).

Our preliminary studies shows that automatic translation (a.k.a machine translation) does not give enough accuracy. There are several reasons for that, for example, conventions changes across languages, (e.g. English Wikipedia usually have one section References, and another Section notes, while French Wikipedia usually use just one named "Notes et références"), and also that section names are not translated literally .

Within this task we will study, compare, and combine several approaches such as the aforementioned Machine Translation, cross-lingual word embedding, and heuristics based on Wikidata information.

Event Timeline

leila triaged this task as High priority.Dec 15 2017, 9:00 PM
leila renamed this task from Section Alignment across languages to Develop a standalone classifier for section translation (alignment) across languages.Dec 15 2017, 9:02 PM

@bmansurov please find the candidates here: @stat1005:/home/dsaez/code/alignment/resultsMapping

You should find 6 files like this: mapping_xx.json, where xx is the source language. The information inside is formated in the following way:

key1: {"rank": n, "l1: [candidate1, ..., candidate5], ...'l5':[candidate1, ... candidate5]}

where:
key1: string, Section name in lang xx , eg: 'References'
rank: int, section 'popularity' ranking in source language, where n=1 is the most frequent section, eg: 45
l1: string -> list, l1 is one target language, and the list are the candidates translations/maps from key1, eg: 'es': ['Referencias', 'Notas', ...]

@diego would it be possible to generate each entry on a separate line and mappings bundled under the key "targets"? Something like this:

{key1: {"rank": n, "targets: { "l1": [candidate1, ..., candidate5], ..., "l5":[candidate1, ... candidate5]}}}
{key2: {"rank": n, "targets: { "l1": [candidate1, ..., candidate5], ..., "l5":[candidate1, ... candidate5]}}}

This would allow me to not load all data in memory in order to parse it.

@diego would it be possible to generate each entry on a separate line and mappings bundled under the key "targets"? Something like this:

{key1: {"rank": n, "targets: { "l1": [candidate1, ..., candidate5], ..., "l5":[candidate1, ... candidate5]}}}
{key2: {"rank": n, "targets: { "l1": [candidate1, ..., candidate5], ..., "l5":[candidate1, ... candidate5]}}}

This would allow me to not load all data in memory in order to parse it.

Served: /home/dsaez/code/alignment/resultsMappingLines

@diego thanks. I also updated the request above, so you must have missed the new format. Please add "targets" to the output.

Another thing I noticed is that some entries are missing mappings. For example, in the English file here are some missing entries:

External features is missing ar mappings.
External features is missing ja mappings.
Extracurriculars is missing ar mappings.
Location and features is missing ar mappings.
International honours is missing ja mappings.
Junction list is missing ar mappings.
New books is missing es mappings.
Opinion of the Court is missing ar mappings.
List of representatives is missing ar mappings.
Construction and commissioning is missing ar mappings.
Ladder is missing ar mappings.
Ladder is missing es mappings.
Ladder is missing ja mappings.
Ladder is missing ru mappings.
Assembly segments is missing ar mappings.

@diego I'm moving this task to Done as you've built the first version of the classifier and we're gathering labels now to improve it. Improvements will be tracked at T190770

DarTar edited projects, added Research-Archive; removed Research.