PEG tokenizer is emitting byte offsets. Rest of the ported code use mb_* string functions that use unicode char offsets. JS uses char offsets but those are UCS2 offsets.
One strategy to reconcile is to use byte offsets everywhere internally in the ported codebase but at the boundaries where Parsoid offsets meet the world, translate them to JS offsets via PHPUtils::convertOffsets routine.
But, using byte offsets internally requires us to do some minimal audit to identify fixes that might be necessary in the codebase around char accesses, strlen, and substring operations, i.e. does any code somehow depend on substrings returning full chars. If so, byte offsets would have to be aligned at char boundaries (which they are likely to be). Offhand, without doing a real code audit, the only potential place that might have gotchas wrt use of byte offsets might be the diffing code in bin/roundtrip-test.js .. but given that it is not performance critical, there might be ways to work around that.
Convert tokenizer offsets to unicode char offsets and use mb_* functionality elsewhere. But, convert offsets to UCS2 offsets on demand where we need to compare them with JS generated offsets.