Map tokens from TTS responses to HTML
Added mapping between the tokens received from the TTS server to the "words"
in the html. Tokens are stored in the utterance elements and are assigned a
position attribute, which is the index of the start of the corresponding
html substring. Removed HTML tags are stored in the tokens element.
Also removed segmenting by newlines, since this doesn't really make
sense with HTML.
Bug: T140105
Change-Id: Ie784328fa3d7bcf7941b6b89146687272fe3b0ca