Page MenuHomePhabricator

Annotation mapping for MT
Closed, ResolvedPublic

Description

Context

Implementation of http://etherpad.wikimedia.org/p/cx-markup-alignment

Implementation plan

  1. MT backed will create an interface named TranslateHTML
  2. Create an annotation mapping module - it exposes an interface to provide plain text word sequences for a given HTML source input. They are the full text version of the html, word sequences from the inline annotations of the html.
  3. Use these subsequence annotations and pass them to MT. once received, The input array of sequences and plain text MT sequences are passed to annotation module again
  4. Implement a generic minimal algorithm that uses only edit distance to find ranges in translated plain text MT corresponding to each tag in source HTML. Use lineardoc and apply these annotations
  5. Enhance the above implementation so that algorithm step about finding match can be overridden in language specific modules similar to segmentation language modules
  6. Enhance the algorithm to use n-grams to support word order changes in subsequence matching

Event Timeline

Arrbee created this task.Nov 28 2014, 6:58 AM
Arrbee assigned this task to santhosh.
Arrbee raised the priority of this task from to High.
Arrbee updated the task description. (Show Details)
Arrbee changed Security from none to None.

Change 175420 had a related patch set uploaded (by Santhosh):
MT: Subsequence extraction and mapping

https://gerrit.wikimedia.org/r/175420

Patch-For-Review

Arrbee moved this task from Backlog to Sprint Backlog on the Language-Team board.Nov 28 2014, 9:06 AM
Arrbee moved this task from In Progress to In Review on the Language-Team board.
Nemo_bis updated the task description. (Show Details)Nov 28 2014, 9:14 AM
Arrbee moved this task from Sprint 79 to Sprint 80 on the ContentTranslation-Release3 board.
Arrbee moved this task from Backlog to In Review on the LE-Sprint-79 board.Nov 28 2014, 12:23 PM

I did a quick comparison of these two algorithms here: http://etherpad.wikimedia.org/p/cx-markup-alignment-uppercase

You can see that the existing uppercasing algorithm sends a huge amount of data to the MT engine.

Two samples:

Number of sentences: 3
Data multiplication by uppercase algorithm,: 25 copies of the source text.
Data multiplication by new algorithm: single version of source text + 12 subsequences(single word or a group of 2-3 words)

Number of sentences: 4
Data multiplication by uppercase algorithm,: 49 copies of the source text.
Data multiplication by new algorithm: single version of source text + 45 subsequences(single word or a group of 2-3 words)

In addition to big difference in data and hence the savings in bandwidth and time taken for MT, the output html is more clean(no duplicate segments) in new algorithm. And the new algorithm is capable of mapping more annotations and potentially not limited by the nature of language/script.

The algorithms are documented at https://www.mediawiki.org/wiki/Content_translation/Markup

Change 175420 merged by jenkins-bot:
MT: Subsequence extraction and mapping

https://gerrit.wikimedia.org/r/175420

santhosh closed this task as Resolved.Dec 4 2014, 1:29 PM

We can have separate task for improving the algorithm for languages we support in future.

santhosh moved this task from In Review to Done on the LE-Sprint-79 board.Dec 4 2014, 1:29 PM