Page MenuHomePhabricator

byteoffset of action=parse is broken when manually specifying headers using <h1> syntax
Closed, ResolvedPublic

Description

The byteoffsets of the page [[de:Wikipedia:Testseite]] are all the same and at the end of the wikipage (see url).

That is useless. Is there a way to get the right byteoffsets?

Thanks.


Version: unspecified
Severity: minor
URL: http://de.wikipedia.org/w/api.php?action=parse&page=Wikipedia:Testseite&prop=sections

Details

Reference
bz25203

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:21 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz25203.
bzimport added a subscriber: Unknown Object (MLST).

It appears to stop giving correct byte offsets after encountering the first header made using <h1> (or h2, h3, etc) syntax instead of the normal ==header== syntax.

Yes, it seems to have trouble with the fact that the page uses an <h2> followed by a =header= .

Changing component to Page Rendering/Parsing

The api isn't at fault here, its only displaying what the parser output says there is

Anomie added a comment.Jan 2 2013, 4:31 PM

It's sort of API, in that this feature in the parser seems to have been added solely to support returning this information in the API.

Odd that "byteoffset" is actually the offset in Unicode codepoints.

The problem is that the code pulls out all the <h#> tags from the parsed HTML, but uses the parsed-to-DOM representation of the original wikitext to try to calculate the byteoffset. This parsed-to-DOM representation, however, doesn't include DOM structure for any raw <h#> tags from the original wikitext, so when it tries to find the DOM node for one of those it searches to the end of the wikitext without finding it. Which also screws up all subsequent headers.

Roan, it looks like you added this back in 2009, any ideas here? Otherwise I'll just put together a patch that skips trying to calculate byteoffset when $sectionIndex === false.

Anomie added a comment.Jan 2 2013, 4:31 PM
  • Bug 43584 has been marked as a duplicate of this bug. ***

Change 88750 had a related patch set uploaded by Anomie:
Handle raw <h#> when calculating $rawtoc

https://gerrit.wikimedia.org/r/88750

Change 88750 merged by jenkins-bot:
Handle raw <h#> when calculating $rawtoc

https://gerrit.wikimedia.org/r/88750