Maniphest T219069

Reconcile byte offsets coming from Tokenizer with unicode char offsets used by rest of ported code
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ssastry
	Mar 23 2019, 2:08 PM

Description

PEG tokenizer is emitting byte offsets. Rest of the ported code use mb_* string functions that use unicode char offsets. JS uses char offsets but those are UCS2 offsets.

Strategy 1
One strategy to reconcile is to use byte offsets everywhere internally in the ported codebase but at the boundaries where Parsoid offsets meet the world, translate them to JS offsets via PHPUtils::convertOffsets routine.
But, using byte offsets internally requires us to do some minimal audit to identify fixes that might be necessary in the codebase around char accesses, strlen, and substring operations, i.e. does any code somehow depend on substrings returning full chars. If so, byte offsets would have to be aligned at char boundaries (which they are likely to be). Offhand, without doing a real code audit, the only potential place that might have gotchas wrt use of byte offsets might be the diffing code in bin/roundtrip-test.js .. but given that it is not performance critical, there might be ways to work around that.

Strategy 2
Convert tokenizer offsets to unicode char offsets and use mb_* functionality elsewhere. But, convert offsets to UCS2 offsets on demand where we need to compare them with JS generated offsets.

Details

	Subject	Repo	Branch	Lines +/-
	Convert TSR/DSR to UTF-8 byte indices	mediawiki/services/parsoid	master	+313 -100

Customize query in gerrit

Related Objects

Mentioned In: T219072: Extend JS/PHP hybrid testing to other Parsoid components

Event Timeline

ssastry created this task.Mar 23 2019, 2:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 23 2019, 2:08 PM

ssastry renamed this task from Reconcile byte offsets coming from Tokenizer with unicode char offsets used by rest of porte code to Reconcile byte offsets coming from Tokenizer with unicode char offsets used by rest of ported code.Mar 23 2019, 2:09 PM

ssastry triaged this task as High priority.

ssastry moved this task from Backlog to Porting on the Parsoid-PHP board.Mar 23 2019, 2:16 PM

ssastry updated the task description. (Show Details)Mar 25 2019, 1:45 PM

ssastry mentioned this in T219072: Extend JS/PHP hybrid testing to other Parsoid components.Mar 25 2019, 3:04 PM

We definitely want strategy 1, at least long-term. The mb_* functions all take O(length of string) time, since you need to scan the string from the beginning in order to count codepoints.

ssastry assigned this task to cscott.Apr 9 2019, 10:23 PM

Change 502868 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Convert TSR/DSR to UTF-8 byte indices

https://gerrit.wikimedia.org/r/502868

gerritbot added a project: Patch-For-Review.Apr 10 2019, 7:27 PM

Pastakhov subscribed.Apr 15 2019, 8:32 PM

Change 502868 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Convert TSR/DSR to UTF-8 byte indices

https://gerrit.wikimedia.org/r/502868

ssastry closed this task as Resolved.Jul 4 2019, 3:26 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 4 2019, 4:10 PM

Reconcile byte offsets coming from Tokenizer with unicode char offsets used by rest of ported codeClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Reconcile byte offsets coming from Tokenizer with unicode char offsets used by rest of ported code
Closed, ResolvedPublic
Actions