Page MenuHomePhabricator

action=parse does not detect headers in templates correctly
Closed, ResolvedPublic

Description

Bug detected at: http://de.wikipedia.org/w/api.php?format=yamlfm&action=parse&page=BD:Label5&prop=section

Two 3rd level headers are embedded in a template call, the parsed results are messed up:

Byteoffsets from number=1.1 on give end-of-page offset. There is no index and no fromtitle.

May have the same cause as 25203#c3 (“The api isn't at fault here, its only displaying what the parser output says there is.”).


Version: 1.18.x
Severity: major

Details

Reference
bz32753

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:07 AM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz32753.
bzimport added a subscriber: Unknown Object (MLST).

Hmm...

{
"warnings": {

		"parse": {
			"*": "Unrecognized value for parameter 'prop': section"
		}

},
"parse": {

		"title": "Benutzer Diskussion:Label5"

}
}

(In reply to comment #1)
Awww, somehow the s in the end of the URL got lost. Correct link: http://de.wikipedia.org/w/api.php?format=yamlfm&action=parse&page=BD:Label5&prop=sections

Ok I can confirm your results there.

The first two sections (first one is 'regular', second is in the templated text):

			{
				"toclevel": 1,
				"level": "2",
				"line": "Gr\u00fc\u00df Gott und Herzlich Willkommen auf meiner Benutzer-Diskussionsseite",
				"number": "1",
				"index": "1",
				"fromtitle": "Benutzer_Diskussion:Label5",
				"byteoffset": 3417,
				"anchor": "Gr.C3.BC.C3.9F_Gott_und_Herzlich_Willkommen_auf_meiner_Benutzer-Diskussionsseite"
			},
			{
				"toclevel": 2,
				"level": "3",
				"line": "Meine WP-W\u00fcnsche f\u00fcr 2011",
				"number": "1.1",
				"index": "",
				"fromtitle": false,
				"byteoffset": 7897,
				"anchor": "Meine_WP-W.C3.BCnsche_f.C3.BCr_2011"
			},

Since this second one comes from within a template, the current parser can't really assign it a byte position within the article text. I'm not too familiar with how this output is generated so will have to take a peek to say more. Ideally it at least shouldn't mess up the later sections, but I'm not sure how a "byteoffset" helps if you don't have a "bytelength"... possibly this is just a bad data structure that's not really suitable for how sections are handled. :(

(In reply to comment #3)

Since this second one comes from within a template, the current parser can't
really assign it a byte position within the article text. I'm not too familiar
with how this output is generated so will have to take a peek to say more.
Ideally it at least shouldn't mess up the later sections, but I'm not sure how
a "byteoffset" helps if you don't have a "bytelength"... possibly this is just
a bad data structure that's not really suitable for how sections are handled.
:(

Why is it actually called byteoffset when it is a character offset and not a byte offset? I propose renaming it to charoffset, maybe. I understand that the parser has no notion of sections in templates, I don't really care. But what I care about is the byteoffsets. Or actually where a section starts (and then implicitly where it ends), so that I can take them apart.

Does not only affect templates but also tables: Benutzer Diskussion:Caliban@dewiki. And <div> elements: Benutzer Diskussion:Elchbauer@dewiki. And parser functions: Benutzer Diskussion:4Frankie@dewiki.

dr.trigon wrote:

Can confirm this bug on de:wiki 1.18mwf e.g. on(In reply to comment #3)

Ok I can confirm your results there.

The first two sections (first one is 'regular', second is in the templated
text):

{
    "toclevel": 1,
    "level": "2",
    "line": "Gr\u00fc\u00df Gott und Herzlich Willkommen auf meiner

Benutzer-Diskussionsseite",

"number": "1",
"index": "1",
"fromtitle": "Benutzer_Diskussion:Label5",
"byteoffset": 3417,
"anchor":

"Gr.C3.BC.C3.9F_Gott_und_Herzlich_Willkommen_auf_meiner_Benutzer-Diskussionsseite"

},
{
    "toclevel": 2,
    "level": "3",
    "line": "Meine WP-W\u00fcnsche f\u00fcr 2011",
    "number": "1.1",
    "index": "",
    "fromtitle": false,
    "byteoffset": 7897,
    "anchor": "Meine_WP-W.C3.BCnsche_f.C3.BCr_2011"
},

Since this second one comes from within a template, the current parser can't
really assign it a byte position within the article text. I'm not too familiar
with how this output is generated so will have to take a peek to say more.
Ideally it at least shouldn't mess up the later sections, but I'm not sure how
a "byteoffset" helps if you don't have a "bytelength"... possibly this is just
a bad data structure that's not really suitable for how sections are handled.
:(

The point is in the byteoffset field should be a "" in order to be correct recognized e.g. by DrTrigonBot. Look at [1] there you have e.g.

index="T-7" byteoffset=""

for all template entries, except the level 3 headings were you get e.g.

index="" byteoffset="137405"

which confuses my bot a little bit! My workaround is to catch the empty index string, but since this is considered to be a bug I cannot rely on the fact that there will always be an empty index string...

[1] http://de.wikipedia.org/w/api.php?action=parse&page=Wikipedia:L%C3%B6schkandidaten/12.%20Dezember%202009&prop=sections

Greetings

Anomie claimed this task.
Anomie subscribed.

This was probably fixed by rMW12bc4f1c5b7c: Handle raw <h#> when calculating $rawtoc (back in 2013) which also fixed T27203. I note that https://de.wikipedia.org/w/api.php?format=jsonfm&action=parse&oldid=97765341&prop=sections does not currently exhibit the bug described here, but does shows signs that it would have been affected in the same way as the wikitext described in T27203. The empty 'index' and false value for 'fromtitle' shown in T34753#380721 are also indicative.