In order to highlight the text being recited (T122158), skip by token (T140089, T133687) etc., the time stamps token information returned from the TTS must be mapped to the HTML on the page.
This HTML is passed to the TTS via the Cleaner hence the need for a mapping.
Expected result:
Map the tokens in the TTS response to the words in the HTML.
Ideas:
- Add markup for elements removed by the Cleaner and make TTS ignore these for audio generation but keep them in the response.
- Make Cleaner add marker to the page HTML for any skipped elements. These can then be ignored when doing sequential mapping of tokens.