Implementation of http://etherpad.wikimedia.org/p/cx-markup-alignment
- MT backed will create an interface named TranslateHTML
- Create an annotation mapping module - it exposes an interface to provide plain text word sequences for a given HTML source input. They are the full text version of the html, word sequences from the inline annotations of the html.
- Use these subsequence annotations and pass them to MT. once received, The input array of sequences and plain text MT sequences are passed to annotation module again
- Implement a generic minimal algorithm that uses only edit distance to find ranges in translated plain text MT corresponding to each tag in source HTML. Use lineardoc and apply these annotations
- Enhance the above implementation so that algorithm step about finding match can be overridden in language specific modules similar to segmentation language modules
- Enhance the algorithm to use n-grams to support word order changes in subsequence matching