Page MenuHomePhabricator

Deprecate prop=sections in action=parse
Open, Needs TriagePublicBUG REPORT

Description

In action=parse, when getting prop=sections, each section has a byteoffset field. However, that field does not contain a byte offset, but rather a Unicode codepoint offset.

This is extremely misleading! There are two complaints (one by me just now) about this on the action=parse talk page: https://www.mediawiki.org/w/index.php?title=API_talk:Parsing_wikitext&oldid=5499602#%22byteoffset%22_is_a_misnomer

In an ideal world we would rename the field, but I think all we can do now is add a note to the documentation for action=parse.

@cscott: The correct information is available via TOCData, we should export that and deprecate sections.

Event Timeline

Func subscribed.

A friend of mine told me years ago that they have no idea how to use this "byteoffset" field when developing a talk archiving tool for a Chinese site. Turns out it's measured in codepoints, then everything makes sense.

And a deployed extension also misused this field of section data: https://codesearch.wmcloud.org/deployed/?q=byteoffset&repos=mediawiki/extensions/Flow

Maybe we can introduce a correct field named "codepoint" first.

Change 837254 had a related patch set uploaded (by Func; author: Func):

[mediawiki/core@master] Parser: Introduce a "codepoint" field for section data

https://gerrit.wikimedia.org/r/837254

kostajh subscribed.

Thank you for filing this task and thanks also for the patch. Moving to Triaged, as it's not something Growth can work on in Structured Discussions given our current priorities.

Change 885882 had a related patch set uploaded (by C. Scott Ananian; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] SectionMetadata: Rename byteOffset to cpOffset to reflect reality

https://gerrit.wikimedia.org/r/885882

cscott subscribed.

Should be fixed when T328605 is; the field is being properly (re)named in the new SectionMetadata objects which represent the section data internally.

Change 886077 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/extensions/Flow@master] Fix import of non-ASCII sections

https://gerrit.wikimedia.org/r/886077

Change 886077 merged by jenkins-bot:

[mediawiki/extensions/Flow@master] Fix import of non-ASCII sections

https://gerrit.wikimedia.org/r/886077

Change 885882 merged by jenkins-bot:

[mediawiki/services/parsoid@master] SectionMetadata: Rename byteOffset to codepointOffset to reflect reality

https://gerrit.wikimedia.org/r/885882

Change 888790 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.17.0-a15

https://gerrit.wikimedia.org/r/888790

Change 888790 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.17.0-a15

https://gerrit.wikimedia.org/r/888790

Change 837254 merged by jenkins-bot:

[mediawiki/core@master] Parser: Section offsets are in codepoints, not in bytes

https://gerrit.wikimedia.org/r/837254

Change 891594 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Report Parsoid section offsets in codepoints, not bytes

https://gerrit.wikimedia.org/r/891594

Change 891594 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Report Parsoid section offsets in codepoints, not bytes

https://gerrit.wikimedia.org/r/891594

Change 896413 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Bump parsoid to 0.17.0-a20 and zest-css to 3.0.0

https://gerrit.wikimedia.org/r/896413

Change 896413 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.17.0-a20 and zest-css to 3.0.0

https://gerrit.wikimedia.org/r/896413

BPirkle subscribed.

Older issue, not sure if it is still relevant. Looks more like Content Transform than MediaWiki Interfaces, so tagging them and untagging us. Please retag if I'm wrong about that.

cscott renamed this task from byteoffset field in action=parse results is misnamed to Deprecate prop=sections in action=parse.Oct 30 2025, 2:22 PM
cscott updated the task description. (Show Details)
cscott removed a project: MediaWiki-Parser.
cscott moved this task from Backlog to Later on the Content-Transform-Team board.
cscott moved this task from Unsorted to In Dev on the MediaWiki-Action-API board.

Change #1203071 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] ApiParse: deprecate prop=sections

https://gerrit.wikimedia.org/r/1203071

Change #1203150 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] ApiParse: remove deprecated 'prop=sections' from default properties

https://gerrit.wikimedia.org/r/1203150

Change #1203071 merged by jenkins-bot:

[mediawiki/core@master] ApiParse: deprecate prop=sections

https://gerrit.wikimedia.org/r/1203071

Noting it here because it took some time for me to see it: toclevel is tocLevel in the new API.

You forgot to announce this to mediawiki-api-announce.