MT: Subsequence extraction and mapping

Authored by santhosh.


MT: Subsequence extraction and mapping

This is a new algorithm to map the annotations from source text
to translated text when the machine translation engine does not
support HTML translation (Example: Apertium)

This replaces the upper casing algorithm used for the same purpose.

A brief explanation of the algorithm is given below.

1 For the text to translate, find the text of inline annotations like

bold, italics, links etc. We call it subsequences

2 Pass the full text and subsequences to the plain text machine translation

engine. Use some delimiter so that we can do the array mapping between
source items (full text and subsequences) and translated items.

3 The translated full text will have the subsequences somewhere in the text.

To locate the subsequence translation in full text translation, use an
approximate search algorithm

4 The approximate search algorithm will return the start position of

match and length of match. To that range we map the annotation from the
source html.

5 The approximate match involves calculating the edit distance between

words in translated full text and translated subsequence. It is not strings
being searched, but ngrams with n=number of words in subsequence. Each
word in ngram will be matched independently.

LinearDoc version of the source and translated html is used heavily to
work with HTML structure and to apply annotations.

The approximate match algorithm is tailorable per language. Currently there
is only a generic matching implementation. Future commits will introduce language
specific matching algorithms

Depending on the capability of machine translation engines, the clients can
inherit the MTClient and override any of the following methods.

  • translate - If the MT engine support html and text translations
  • translateHTML - if the MT engine support html translations
  • translateText - if the MT engine support plain text translations-which all machine translation engines do and hence need to be written per MT client. If MT engine support HTML translation, it is implied that it will support plain text too.

Existing MT unit tests should pass. More tests will follow after some more refactoring
in follow up patches

Bug: T76189
Change-Id: I5b97362d1bd75f7719eabd85bea19169ef3bc230