Page MenuHomePhabricator

Incorrect section numbering after unclosed subst
Open, MediumPublicBUG REPORT

Description

Steps to Reproduce:

  1. Inspect the section element for https://en.wikipedia.org/api/rest_v1/page/html/User_talk%3ABrion_VIBBER/891637716#Quick_bar
  2. Note that the id of the previous section ("Substituting surname") was 6.

Actual Results:

<section data-mw-section-id="154" id="mwdw"><h2 id="Quick_bar">Quick bar</h2>

Expected Results:
The section id should have been 7.

<section data-mw-section-id="7" id="mwdw"><h2 id="Quick_bar">Quick bar</h2>

I suspect this has to do with the unclosed {{subst: in section 6. The syntax highlighter (and possibly the legacy parser) seems to autoclose it after the newline but Parsoid does not.

Impact: this leads clients that use Parsoid output to build their section edit links based on that to try to edit the wrong section.

Event Timeline

Reproduced:

$ (echo '==Foo==' ; echo '{{subst: a surname' ; echo '==Bar==') | bin/parse.js --wrapSections --normalize=parsoid
<section data-mw-section-id="0"></section><section data-mw-section-id="1">
<h2 id="Foo">Foo</h2>
<p>{{subst: a surname</p>
</section><section data-mw-section-id="3">
<h2 id="Bar">Bar</h2>
</section>

I suspect what is happening is that we are incrementing section ID when we attempt to parse, and then failing to decrement when we backtrack out of the parse.

Change 507966 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Use wikipeg rule variables to ensure headingIndex is correct after backtrack

https://gerrit.wikimedia.org/r/507966

ssastry triaged this task as Medium priority.May 8 2019, 3:41 PM
ssastry edited projects, added Parsoid-Read-Views-Deprecated-Project; removed Parsoid.

Can we just leave off the headingIndex when the token is created, and assign it when the chunk is emitted? Or in an early TT pass?

Can we just leave off the headingIndex when the token is created, and assign it when the chunk is emitted? Or in an early TT pass?

Originally section ids were assigned in the section wrapping dom pass and scott moved it to the tokenizer to handle some edge cases, but backtracking bit him here. But, as you indicate, no reason for it to be done in the tokenizer. I think it can be done in the HTML5 Tree Builder pass, since it has to inspect every token anyway, and at that time, tokens are in the desired order as well.

The PHP parser does it in the preprocessor, and there are edge cases that mean if we want compatibility it's best to continue to do so. I moved it to the tokenizer to fix these edge cases, if we move it back we're going to regress on those.

In particular, the heading index is incremented by heading-like constructs that appear in template arguments *irregardless* of whether they appear in the output. Ie: {{echo|foo|==this is ignored==}} increments the heading index. You can't emulate that in the tree builder, because once we do template expansion to HTML the argument in question is no longer present. (Similar issue if the argument is used multiple times.)

Now, you could make the argument that we shouldn't be doing this in the preprocessor (T214538: MediaWiki shouldn't assign section ids during tokenization, but instead only when headings are generated, see also T23844) but changing it would be an expensive migration, with bad consequences if parser caches cause edits to apply to the wrong section.

I think that the tokenizer fix using rule variables is simple and elegant, matches PHP behavior so doesn't require migration, and is clearly the way rule variables "ought" to work. As I wrote inhttps://gerrit.wikimedia.org/r/508037 , otherwise rule variables can be corrupted by failed parses and are very complicated to reason about -- some bogus parse nested arbitrarily far away can corrupt your variable with no way to fix it during the backtrack. If I work hard enough, I can probably come up with an example where our existing uses of rule variables for broken templates, etc, can be fooled the same way.

Change 508037 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[wikipeg@master] Rule reference variables should not be affected by failed rules

https://gerrit.wikimedia.org/r/508037