Page MenuHomePhabricator

Improve section validation
Closed, ResolvedPublic

Description

The 12/12/17 deployment was derailed by a huge number of unsupported_sections errors thrown by throwIfPreviousSectionIsIncomplete (link). This is a function to check that all sections have titles and anchors. However, the assumption that every section should have a title and anchor doesn't seem to be valid for the project main page or for non-content namespaces. We should figure out how to handle these situations gracefully.

Proposed plan

Do nothing for Parsoid sections with IDs of -1 and -2; log warnings rather than throwing errors for sections with IDs >= 0 that fail the check for presence of heading and anchor. (See T182774#3834556 and T182774#3834568 for discussion.)

Event Timeline

Mholloway added a subscriber: bearND.

A quick and probably mostly effective fix would be to exclude sections with IDs of -1 and -2 from this checking. I'm trying to think through whether this is correct behavior in light of what those IDs represent.

Per https://www.mediawiki.org/wiki/Parsing/Notes/Section_Wrapping#Pseudo-sections psuedo-sections (represented by ID -2) are by definition created where there is no heading and should be suppressed in presentations such as a table of contents. These clearly shouldn't be included in the heading/anchor check.

A section ID of -1 indicates that Parsoid has created a section that it considers "uneditable," meaning that it may not correspond to a PHP parser section. Based on my quick read of https://www.mediawiki.org/wiki/Parsing/Notes/Section_Wrapping the situation here is less clear. However, looking at the enwiki main page reveals at least some cases of sections with ID -1 that are template-created and strictly presentational, and that we wouldn't expect to have headings. I propose disabling the check for ID -1 and then refinining behavior as needed based on product-identified bugs. In practice I'd expect these to be few, as virtually all of the unsupported_section errors we were seeing were in non-content namespaces (such as Module).

I'm also skeptical that throwing an error is the right way of going about things in the situations we do care about.

  • For the Android app, the result for an unlucky user who stumbles upon an affected page will be a page load error, then falling back to mobileview loading for the next 100 page requests.
  • For other unknown clients, it's probably not safe to assume that they will consider a section without a heading and/or anchor invalid.

Wouldn't it be better just to log a warning in the cases we're interested in, for our own diagnostic use?

Mholloway renamed this task from Figure out how to handle non-content "sections" to Improve section validation.Dec 13 2017, 3:59 PM
Mholloway updated the task description. (Show Details)

A quick and probably mostly effective fix would be to exclude sections with IDs of -1 and -2 from this checking. I'm trying to think through whether this is correct behavior in light of what those IDs represent.

Per https://www.mediawiki.org/wiki/Parsing/Notes/Section_Wrapping#Pseudo-sections psuedo-sections (represented by ID -2) are by definition created where there is no heading and should be suppressed in presentations such as a table of contents. These clearly shouldn't be included in the heading/anchor check.

A section ID of -1 indicates that Parsoid has created a section that it considers "uneditable," meaning that it may not correspond to a PHP parser section. Based on my quick read of https://www.mediawiki.org/wiki/Parsing/Notes/Section_Wrapping the situation here is less clear. However, looking at the enwiki main page reveals at least some cases of sections with ID -1 that are template-created and strictly presentational, and that we wouldn't expect to have headings. I propose disabling the check for ID -1 and then refinining behavior as needed based on product-identified bugs. In practice I'd expect these to be few, as virtually all of the unsupported_section errors we were seeing were in non-content namespaces (such as Module).

I didn't look at what a title and anchor correspond to within a section. But, yes, you are right about sections with data-mw-section-id=-2 ... but, for -1 ids, even if templated, they should always have a heading embedded in them. If you find an instance where that isn't the case, let us know.

And I agree with the recommendation to emit a warning / log-entry instead of throwing an exception.

Change 398082 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] Update section validation logic

https://gerrit.wikimedia.org/r/398082

@ssastry, https://fr.wikipedia.org/api/rest_v1/page/html/Mod%C3%A8le:Donn%C3%A9es%2F6117%2F%C3%A9volution_population is an illustrative example of a page that caused an error yesterday. Of interest to your comment, the first section identified contains only a <p> and a <span> as children, no headings. The second section contains headings but only in child sections.

The old logic expected an <h2>-<h6> element as a direct descendant of any <section> element (after the first on the page), which we clearly can't assume in many cases.

As updated, we'll log a warning only if a direct <h2>-<h6> descendant is not found for a <section> with ID > 0.

@ssastry, https://fr.wikipedia.org/api/rest_v1/page/html/Mod%C3%A8le:Donn%C3%A9es%2F6117%2F%C3%A9volution_population is an illustrative example of a page that caused an error yesterday. Of interest to your comment, the first section identified contains only a <p> and a <span> as children, no headings. The second section contains headings but only in child sections.

The old logic expected an <h2>-<h6> element as a direct descendant of any <section> element (after the first on the page), which we clearly can't assume in many cases.

As updated, we'll log a warning only if a direct <h2>-<h6> descendant is not found for a <section> with ID > 0.

The first one is a lead section ... but for the second one, your observation stands ... turns out in our code, if a section is both pseudo and uneditable, we assign it a -1 preferentially ... I should take a look at that in terms of what makes for better semantics ... but for now, your MCS logic LGTM.

Change 398082 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Update section validation logic

https://gerrit.wikimedia.org/r/398082

I agree it makes sense to assign -2 to sections with no actual heading element, even if the section is also uneditable.

Change 398113 had a related patch set uploaded (by C. Scott Ananian; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Pseudo-section marker takes precedence over uneditable marker

https://gerrit.wikimedia.org/r/398113

Change 398113 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Pseudo-section marker takes precedence over uneditable marker

https://gerrit.wikimedia.org/r/398113

Change 398181 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] Update section validation logic, take 2

https://gerrit.wikimedia.org/r/398181

Change 398182 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] Include H1 tags when evaluating sections

https://gerrit.wikimedia.org/r/398182

Change 398184 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Add unit test for logging the warning if section validation fails

https://gerrit.wikimedia.org/r/398184

Change 398182 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Include H1 tags when evaluating sections

https://gerrit.wikimedia.org/r/398182

Change 398184 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Add unit test for logging the warning if section validation fails

https://gerrit.wikimedia.org/r/398184

Stashbot subscribed.

Mentioned in SAL (#wikimedia-operations) [2017-12-14T18:48:33Z] <bsitzmann@tin> Started deploy [mobileapps/deploy@bf85a55]: Update mobileapps to ff74bb1 (T182868 T182774)

Mentioned in SAL (#wikimedia-operations) [2017-12-14T19:00:59Z] <bsitzmann@tin> Finished deploy [mobileapps/deploy@bf85a55]: Update mobileapps to ff74bb1 (T182868 T182774) (duration: 12m 26s)