Page MenuHomePhabricator

byteoffset field in action=parse results is misnamed
Open, Needs TriagePublicBUG REPORT

Description

In action=parse, when getting prop=sections, each section has a byteoffset field. However, that field does not contain a byte offset, but rather a Unicode codepoint offset.

This is extremely misleading! There are two complaints (one by me just now) about this on the action=parse talk page: https://www.mediawiki.org/w/index.php?title=API_talk:Parsing_wikitext&oldid=5499602#%22byteoffset%22_is_a_misnomer

In an ideal world we would rename the field, but I think all we can do now is add a note to the documentation for action=parse.

Event Timeline

Func subscribed.

A friend of mine told me years ago that they have no idea how to use this "byteoffset" field when developing a talk archiving tool for a Chinese site. Turns out it's measured in codepoints, then everything makes sense.

And a deployed extension also misused this field of section data: https://codesearch.wmcloud.org/deployed/?q=byteoffset&repos=mediawiki/extensions/Flow

Maybe we can introduce a correct field named "codepoint" first.

Change 837254 had a related patch set uploaded (by Func; author: Func):

[mediawiki/core@master] Parser: Introduce a "codepoint" field for section data

https://gerrit.wikimedia.org/r/837254

kostajh subscribed.

Thank you for filing this task and thanks also for the patch. Moving to Triaged, as it's not something Growth can work on in Structured Discussions given our current priorities.

Change 885882 had a related patch set uploaded (by C. Scott Ananian; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] SectionMetadata: Rename byteOffset to cpOffset to reflect reality

https://gerrit.wikimedia.org/r/885882

cscott subscribed.

Should be fixed when T328605 is; the field is being properly (re)named in the new SectionMetadata objects which represent the section data internally.

Change 886077 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/extensions/Flow@master] Fix import of non-ASCII sections

https://gerrit.wikimedia.org/r/886077

Change 886077 merged by jenkins-bot:

[mediawiki/extensions/Flow@master] Fix import of non-ASCII sections

https://gerrit.wikimedia.org/r/886077

Change 885882 merged by jenkins-bot:

[mediawiki/services/parsoid@master] SectionMetadata: Rename byteOffset to codepointOffset to reflect reality

https://gerrit.wikimedia.org/r/885882

Change 888790 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.17.0-a15

https://gerrit.wikimedia.org/r/888790

Change 888790 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.17.0-a15

https://gerrit.wikimedia.org/r/888790

Change 837254 merged by jenkins-bot:

[mediawiki/core@master] Parser: Section offsets are in codepoints, not in bytes

https://gerrit.wikimedia.org/r/837254

Change 891594 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Report Parsoid section offsets in codepoints, not bytes

https://gerrit.wikimedia.org/r/891594

Change 891594 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Report Parsoid section offsets in codepoints, not bytes

https://gerrit.wikimedia.org/r/891594

Change 896413 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Bump parsoid to 0.17.0-a20 and zest-css to 3.0.0

https://gerrit.wikimedia.org/r/896413

Change 896413 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.17.0-a20 and zest-css to 3.0.0

https://gerrit.wikimedia.org/r/896413