MT: Subsequence extraction and mapping
This is a new algorithm to map the annotations from source text
to translated text when the machine translation engine does not
support HTML translation (Example: Apertium)
This replaces the upper casing algorithm used for the same purpose.
A brief explanation of the algorithm is given below.
1 For the text to translate, find the text of inline annotations like
bold, italics, links etc. We call it subsequences
2 Pass the full text and subsequences to the plain text machine translation
engine. Use some delimiter so that we can do the array mapping between source items (full text and subsequences) and translated items.
3 The translated full text will have the subsequences somewhere in the text.
To locate the subsequence translation in full text translation, use an approximate search algorithm
4 The approximate search algorithm will return the start position of
match and length of match. To that range we map the annotation from the source html.
5 The approximate match involves calculating the edit distance between
words in translated full text and translated subsequence. It is not strings being searched, but ngrams with n=number of words in subsequence. Each word in ngram will be matched independently.
LinearDoc version of the source and translated html is used heavily to
work with HTML structure and to apply annotations.
The approximate match algorithm is tailorable per language. Currently there
is only a generic matching implementation. Future commits will introduce language
specific matching algorithms
Depending on the capability of machine translation engines, the clients can
inherit the MTClient and override any of the following methods.
- translate - If the MT engine support html and text translations
- translateHTML - if the MT engine support html translations
- translateText - if the MT engine support plain text translations-which all machine translation engines do and hence need to be written per MT client. If MT engine support HTML translation, it is implied that it will support plain text too.
Existing MT unit tests should pass. More tests will follow after some more refactoring
in follow up patches