Page MenuHomePhabricator

Unicode characters increase length of highlighting
Closed, ResolvedPublic2.5 Estimated Story Points

Description

If an utterance contains unicode characters, the highlighting ends beyond the utterance. This is likely caused by how Segmenter calculates endOffset.

Event Timeline

Lokal_Profil set the point value for this task to 4.5.
Sebastian_Berlin-WMSE renamed this task from Non ASCII characters increase length of highlighting to Unicode characters increase length of highlighting.Mar 10 2017, 7:22 AM
Sebastian_Berlin-WMSE updated the task description. (Show Details)

I found that many (all?) string functions have a mulitbyte version. Switching to these will hopefully be enough.

This is implemented in a local branch. Review will wait until T158954: Use XPath to get text nodes related to utterances is done. In solving this, a fair bit of the segmenting was rewritten, which also solves the following tasks:

and the following may well be solved (and should be rechecked) when both this and T158954 are done:

Sebastian_Berlin-WMSE changed the point value for this task from 4.5 to 2.5.Mar 22 2017, 9:44 AM

Worked on in Wikispeech (Sprint 2017-03-08):

  • Local implementation.

To do in Wikispeech (Sprint 2017-03-22):

  • Upload patch to gerrit.
  • Review.

Change 344616 had a related patch set uploaded (by Lokal Profil; owner: Sebastian Berlin (WMSE)):
[mediawiki/extensions/Wikispeech@master] Calculate correct offsets for unicode characters

https://gerrit.wikimedia.org/r/344616

Change 344616 had a related patch set uploaded (by Sebastian Berlin (WMSE)):
[mediawiki/extensions/Wikispeech@master] Calculate correct offsets for unicode characters

https://gerrit.wikimedia.org/r/344616

Change 344616 merged by jenkins-bot:
[mediawiki/extensions/Wikispeech@master] Calculate correct offsets for unicode characters

https://gerrit.wikimedia.org/r/344616

Mentioned in SAL (#wikimedia-labs) [2017-03-30T10:15:39Z] <Sebastian-WMSE> Deploy latest from Git master: 64cbd96 (T159545, T159811, T159809)