Page MenuHomePhabricator

Reconcile byte offsets coming from Tokenizer with unicode char offsets used by rest of ported code
Closed, ResolvedPublic

Description

PEG tokenizer is emitting byte offsets. Rest of the ported code use mb_* string functions that use unicode char offsets. JS uses char offsets but those are UCS2 offsets.

Strategy 1
One strategy to reconcile is to use byte offsets everywhere internally in the ported codebase but at the boundaries where Parsoid offsets meet the world, translate them to JS offsets via PHPUtils::convertOffsets routine.
But, using byte offsets internally requires us to do some minimal audit to identify fixes that might be necessary in the codebase around char accesses, strlen, and substring operations, i.e. does any code somehow depend on substrings returning full chars. If so, byte offsets would have to be aligned at char boundaries (which they are likely to be). Offhand, without doing a real code audit, the only potential place that might have gotchas wrt use of byte offsets might be the diffing code in bin/roundtrip-test.js .. but given that it is not performance critical, there might be ways to work around that.

Strategy 2
Convert tokenizer offsets to unicode char offsets and use mb_* functionality elsewhere. But, convert offsets to UCS2 offsets on demand where we need to compare them with JS generated offsets.

Event Timeline

ssastry renamed this task from Reconcile byte offsets coming from Tokenizer with unicode char offsets used by rest of porte code to Reconcile byte offsets coming from Tokenizer with unicode char offsets used by rest of ported code.Mar 23 2019, 2:09 PM
ssastry triaged this task as High priority.

We definitely want strategy 1, at least long-term. The mb_* functions all take O(length of string) time, since you need to scan the string from the beginning in order to count codepoints.

Change 502868 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Convert TSR/DSR to UTF-8 byte indices

https://gerrit.wikimedia.org/r/502868

Change 502868 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Convert TSR/DSR to UTF-8 byte indices

https://gerrit.wikimedia.org/r/502868