Page MenuHomePhabricator

NOTOC causing API action=parse prop=sections to return nothing
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  1. Make an action=parse prop=sections API request with three headers and no NOTOC (execute)
  2. sections returns a proper array
  3. Add a NOTOC anywhere in the text (execute)
  4. sections returns a blank array

What happens?:
sections returns a blank array.

What should have happened instead?:
sections should be an array with section data.

Other information (browser name/version, screenshots, etc.):
This is currently causing issues with Ultraviolet, and AIV reports were no longer sending on February 23, the Thursday after [ 544a479 ] Parser: don't set TOCData if __NOTOC__ is used on the page got merged into mediawiki/core. I haven't checked if the MediaWiki version deployed on that day includes this commit explicitly.

The issue was only made evident to me today when debugging (with the assistance of User:Roundish) after finding out that UV fell into a should-never-happen case absolutely no sections were being returned. The related lines can be found here and they both depend on the sections returned by the parser. I'll be placing _another_ safeguard against this on Ultraviolet, but this is evidently a bug as there were no API deprecation warnings that went out that would have indicated that this behavior was planned.

Event Timeline

Adding in the author of the aforementioned commit @cscott, hopefully you might know what could have happened here. I'm not that well versed in core code so I don't feel confident writing a patch myself.

Legoktm triaged this task as Unbreak Now! priority.Mar 16 2023, 1:23 AM
Legoktm added a project: MediaWiki-Action-API.
Chlod reopened this task as Open.
Chlod claimed this task.
Chlod removed Chlod as the assignee of this task.
Chlod moved this task from Backlog to Done on the Ultraviolet board.

oops

Legoktm lowered the priority of this task from Unbreak Now! to High.Mar 16 2023, 2:30 AM
Legoktm subscribed.

Sorry I misread, if it's been broken since February 23 it's not UBN by definition, but definitely API breakage that still should be fixed.

Change 900314 had a related patch set uploaded (by Samtar; author: Samtar):

[mediawiki/core@master] [WIP] Parser.php: Ignore __NOTOC__ if API parse

https://gerrit.wikimedia.org/r/900314

I am not confident in my understanding of the ramifications of doing this:

diff --git a/includes/parser/Parser.php b/includes/parser/Parser.php
index 04e4b0382d1..7b9b504076d 100644
--- a/includes/parser/Parser.php
+++ b/includes/parser/Parser.php
@@ -4095,7 +4095,11 @@ class Parser {
                if ( isset( $this->mDoubleUnderscores['nogallery'] ) ) {
                        $this->mOutput->setNoGallery( true );
                }
-               if ( isset( $this->mDoubleUnderscores['notoc'] ) && !$this->mForceTocPosition ) {
+               # __NOTOC__ is only respected if this is not an API parse request
+               if ( isset( $this->mDoubleUnderscores['notoc'] )
+                       && !$this->mForceTocPosition
+                       && self::getOptions()->getRenderReason() !== 'api-parse'
+               ) {
                        $this->mShowToc = false;
                }
                if ( isset( $this->mDoubleUnderscores['hiddencat'] )

I'm leaning toward won't fix, although I'm open to hearing otherwise. The TOC can have no sections for a number of reasons, including that the author has explicitly requested it (as in this case), or any seen sections are bogus (ie, content model is javascript), etc. Assuming that a page will always have sections seems to be a misunderstanding, and from the above it seems there's already been a fix committed to the affected project.

The sections feature of the action API has had a /number/ of fundamental issues, including the fact that it was never properly language converted, T319141: byteoffset field in action=parse results is misnamed, etc. T330232: Export TOCData via action API should help somewhat, but that's orthogonal to this issue.

Change 901690 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Add ParserOutputFlags::NO_TOC

https://gerrit.wikimedia.org/r/901690

Change 900314 abandoned by Samtar:

[mediawiki/core@master] Parser.php: Ignore __NOTOC__ if API parse

Reason:

Flawed implementation, see alt. Ib41e6e4926cb752826ad75d10e8692125fc0b064

https://gerrit.wikimedia.org/r/900314

@cscott The fix to Ultraviolet was not really a fix but rather a safety net for cases where the actual section couldn't be found. I never made a case for when the sections array turns empty but the section actually exists. With the safety net in place (see the exact changes here), Ultraviolet warns the user that the section couldn't be found—which isn't really a fix, more like a properly-handled error. This currently doesn't appear on the English Wikipedia as I disabled the use of sections to modify AIV and UAA entirely. But for another wiki, be it a WMF wiki or some other wiki which plans to use Ultraviolet, this error will likely appear if the page didn't hit the minimum amount of sections required for the array to get populated.

The workflow of using the output of sections to edit the page comes from action=edit's ability to use prependtext and appendtext with section to only edit specific sections. This way, the script doesn't have to download the wikitext for the entire page, search through the wikitext to find the appropriate section, and apply changes. I find this better for three reasons: section parsing is not recreated in JavaScript, as there may be differences with how sections are parsed with the actual MediaWiki parser and re-parsing in JavaScript means adding in support for special edge cases (such as duplicate section names); there's less network overhead since wikitext that essentially remains unchanged doesn't have to be queried and sent back (with minimal modifications); and probably most important for Ultraviolet, the chance of an edit conflict is significantly reduced when using section with prependtext and appendtext, important for when the last section is not the section where the report should be placed and also when dealing with a page that absorbs a significant amount of edits per minute on a busy day (like in AIV) lest we risk having to re-query and re-process the received wikitext after an edit conflict is detected (or, if edit conflicts weren't accounted for at all, be it for UV or some other script, risk overwriting reports entirely).

Granted, much of this would be solved eventually when Parsoid becomes the default parser, but we're a bit far off from that right now and in any case, support for older MediaWiki versions is important, as we can't expect everyone to upgrade MediaWiki instantly when Parsoid as default eventually does get rolled out.

I'm leaning toward won't fix, although I'm open to hearing otherwise. The TOC can have no sections for a number of reasons, including that the author has explicitly requested it (as in this case), or any seen sections are bogus (ie, content model is javascript), etc. Assuming that a page will always have sections seems to be a misunderstanding, and from the above it seems there's already been a fix committed to the affected project.

It's an undocumented and unannounced API change which seems revert worthy on its own (I would've done so myself if it was straightforward to do so, but there were too many cross dependencies and other considerations that didn't make it practical). If we're going to break the API, that should be a conscious and pre-announced decision.

I'm leaning toward won't fix, although I'm open to hearing otherwise. The TOC can have no sections for a number of reasons, including that the author has explicitly requested it (as in this case), or any seen sections are bogus (ie, content model is javascript), etc.

The API documentation states that sections gives the sections in the parsed wikitext. Removing the TOC does not remove the sections from the wikitext.

It's not okay, there are Bots, LUA models and JavaScripts working with parsing action returning the sections and working on it. How else can I get a list of sections other than API action parse? Before February 22nd (+/- 1 day) the parse action worked properly. Please undo the API changes.

This change seems wrong to me. The sections should be present regardless of magic words or other conditions that cause the TOC to be shown or not. The API output includes a showtoc: true/false property that can be used to determine whether the TOC should be shown, which should be used by clients, instead of us removing the actual data from the result.

Yes, I fully agree with matmarex.

You can override it with: <div style="display:none;">__TOC__</div> but this
can never be serious indeed.

doesn't appear on the English Wikipedia as I disabled the use of sections to modify AIV and UAA entirely. But for another wiki, be it a WMF wiki or some other wiki which plans to use Ultraviolet, this error will likely appear if the page didn't hit the minimum amount of sections required for the array to get populated.

This is a misunderstanding. There is no "minimum amount of sections", the section data will always be present unless (a) it is invalid (ie, JavaScript or CSS content) or (b) the user has explicitly suppressed it via __NOTOC__. I'm willing to consider the possibility that the editors might distinguish the ToC from "sections", aka things like __NOEDITSECTION__ as @JJMC89 and others point out: hence https://gerrit.wikimedia.org/r/c/mediawiki/core/+/901690.

As I understand it, this property is meant to expose headings that exist in the content. Similar to how e.g. edit section and other mechanisms work by section. This works regardless of whether that or all headings are included in a TOC.

It seems the tool in question is using this to find and validate a particular portion of the page. Both the API property and the tool in question are not documented or named in relation to a graphical interface for a table of contents. If we retroactively treat it as such, I expect a feature request will be filed to introduce support for exposing "section" information, in which case the property sections would seem like a natural fit, bringing us back to where we are.

I agree that when tools interact with a arbitrary page, they should account for there being no sections indeed. It seems that is not the case, however. The tool in question is interacting with known and specific sections and looking for a specific N-th section. The removing of the visible TOC by an editor of the underlying page, led to the section metadata no longer being returned as unexpected side-effect.

and how can we replace the parse data with sections like section title and toclevel and level and section number and index as API action to repair java scripts and lua models using the sections property?

This is a misunderstanding. There is no "minimum amount of sections", the section data will always be present unless (a) it is invalid (ie, JavaScript or CSS content) or (b) the user has explicitly suppressed it via __NOTOC__. I'm willing to consider the possibility that the editors might distinguish the ToC from "sections", aka things like __NOEDITSECTION__ as @JJMC89 and others point out: hence https://gerrit.wikimedia.org/r/c/mediawiki/core/+/901690.

Sorry about that, I had a fever while writing this and I must have gotten my words wrong. My point was that the the error would likely appear if __NOTOC__ was set, or any other issue causes sections to become unpopulated. Granted, there is currently no method in UV of setting the section number which would lead to that "section-finding" path to be used, but this is purely due to the fact that the tool hasn't been completed yet.

@Krinkle put it best in the above comment. The sections prop in mw:API:Parsing wikitext writes "Gives the sections in the parsed wikitext.", not "Gives the sections in the parsed wikitext only if a table of contents is present.", and a change from the former to the latter was definitely not communicated beforehand (and even if it did, as Krinkle said, a request would have been put in to reintroduce it as such). You are correct that developers interpret "sections" as "the headings on the wikitext" rather than things in the Table of Contents. Considering our current documentation and its practical use in tools, this has been the case for a long while now.

After thinking about this for a bit, I've remembered that another one of my tools depends on sections much more than UV does (which only uses it for reporting).

Screenshot 2023-03-24 at 07-10-33 User Chlod_sandbox - Wikipedia.png (768×1 px, 272 KB)

Attached is a screenshot of InfringementAssistant on latest version (v0.4.2, tree) running on my sandbox copy of en:Dog with a __NOTOC__ applied. A user should expect to see 36 sections that can be wrapped by the {{copyvio}} template so that some sections can be hidden. Instead the option is disabled entirely, which is the behavior of IA when it hits a page with no sections. There are only two outcomes to this: the editor notices, manually transcludes the template, and informs me of the issue (leading us back here); or the editor doesn't notice, and suddenly an entire article—no matter the size—gets blanked out by template.

This brings the unexpected breakages to two tools, albeit both from me. I'm sure there's others tools or bots that use sections this way, and they'd probably be surprised and confused to find out that sections isn't spat out because of __NOTOC__ (which itself took me a while to debug, considering how obscure the change was).

One more is getting bites from users who expect the archiving of their pages and miss that job on bot breakages working with section titles and levels and index finding out the sections who has to be archived according to individually set archive parameters. And these archive jobs are running on thousands of pages.

Changing API behavior without announcing or warning should be a nogo because of breaking a lot of tools and bots and scripts running their job every day without problems on several wikis until such a change is coming. This is not acceptable and should be made undone. And then there is an opportunity to discuss about the problem and to find a solution that is less harmful to the world of all the wikis.

These three or four lines can be made undone very easily. Maybe I don't know the reason why you don't do this. So please explain it. Thank you ...

FYI: @Jdforrester-WMF ; @Jdlrobson

@cscott is currently working on a fix. It'll take a bit to get it right, get it reviewed and merged. See T332243#8716049 above.

Hi @cscott : I'm pretty sure, you're still working on a fix. But there are still bots and LUA templates down because of missing sections content. Is it possible to give this task more priority? Thank you and kind regards.

I fixed some comments and poked @Jdlrobson for a review. The issue with the Graph extension is lighting the world on fire for both of us right now but i think we can get this merged this week so it will go out on next week's train.

Change 913204 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Match core NO_TOC logic

https://gerrit.wikimedia.org/r/913204

Change 901690 merged by jenkins-bot:

[mediawiki/core@master] Add ParserOutputFlags::NO_TOC

https://gerrit.wikimedia.org/r/901690

matmarex assigned this task to cscott.

Please can we get a statement in Tech News about what happened here?

I think anyone would be welcome to write one. It's not clear to me what the user impact of this bug was, so I'm not the best person to do it.

@doctaxon I see the section data in your API query now. I think you were seeing cached parser output (since we didn't invalidate the cache, only fixed the bug for future parses), and when you edited the page it got purged and parsed again, correctly.

Hi, I'm looking at this in the context of the User-notice and Tech News and I'm uncertain whether this still needs an entry, and if so what it would say. Is there any followup action required from any tool maintainers, or were you imagining a purely-informational entry? Either way, do you have suggestions on what the content needs to include? (Drafts or suggestions always help!)
As far as I understand it, the summary is something like:

"For a few months earlier this year, making an action=parse prop=sections API request for a page that included a __NOTOC__ did not correctly return the list of section names. This has now been fixed."

If that's accurate, then I understand this was a frustrating bug for some tool maintainers and the editors of a few pages, but it seems like enough of an edge-case that it might not be worth adding to the massive piles of global communications we all have to sort through. Please advise, thanks!

Change 913204 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Match core NO_TOC logic

https://gerrit.wikimedia.org/r/913204

Change 946615 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump parsoid to 0.18.0-a21

https://gerrit.wikimedia.org/r/946615

Change 946615 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.18.0-a21

https://gerrit.wikimedia.org/r/946615