Page MenuHomePhabricator

<section> tags for MediaWiki sections
Closed, ResolvedPublic

Description

Problem

Sections are just sort of there on a page, with no usable distinguishing beginning/endpoints.

We would like to output semantic <section> markup for article sections, but currently are prevented from doing so because sections are not guaranteed to contain balanced content. Any outcome with this is guaranteed to break things.

  • Currently skins like minerva and wikihow need to do weird, horribly things for section handling.
  • Sections cannot be given labels/ids and only have numbers, which change all the time. (T60541, T116350).
  • No internal way to refer to sections besides numbers. (DPL)

Expected outcome

We need to determine a path forward with acceptable breakage for major use cases (on-wiki divs, archived discussions, extensions, styled pages, etc), and appropriate fallbacks.

Per @GWicke, There are two ideas being discussed:

  1. conservative wrapping: Only consider headings that are direct children of <body> in the first iteration, and don't break up existing DOM structures.
  2. aggressive wrapping: Break up all structures containing headings, and wrap every heading in its own section.

The "conservative wrapping" approach would be the first step towards a behavior similar to the one @daniel and @matmarex describe. Nested sections for wrapped headings can be added in a later iteration, without breaking up existing DOM structures.

See also

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@ssastry, @cscott: It would be great if we could make this happen soon. The lack of sensible section definition seriously undermines the usefulness of section editing and -retrieval in the REST API, which affects VisualEditor and others.

See https://jakearchibald.com/2017/do-we-need-a-new-heading-element/ for background on how current browsers interpret <section> tags and headings, as well as proposals to add a <h> element that would get its levels from the nesting of section elements.

@GWicke mentioned this above, but the mobile web and apps have been doing this as a matter of course for literally years now. I'm not aware of anything breaking from adding sections to the HTML.

A note, mobile front end does this with media wiki output, and Android does this with MCS/Parsoid output.

To follow up on what @Fjalapeno said, the only issue I have seen with this is on portals where users do

<div>
== Foo ==
</div>
<div>body</div>

When we encounter a heading inside a div we don't add any sections to the page.

TheDJ added a comment.EditedMar 7 2017, 12:11 PM

I just realised something... This structure will also break floating elements like infobox, images etc from overflowing into the next section.... That is gonna be a major problem I think.

It also exists on MF, but there you only notice if you use an iPad or something (phones need to vertically stack due to lack of width), so I'm guessing that is why we have not had many problem reports about it. But I foresee a lot of pushback from editors on that on Desktop.

Try this article on an iPad (or desktop with responsive design mode enabled and iPad User-Agent).

Hmm. maybe that can be fixed.. Definitely requires attention however.

GWicke added a comment.EditedMar 7 2017, 3:01 PM

I just realised something... This structure will also break floating elements like infobox, images etc from overflowing into the next section.... That is gonna be a major problem I think.

Assuming we don't set clear: both on sections, they should behave about the same as paragraphs do now wrt floated content.

Try this article on an iPad (or desktop with responsive design mode enabled and iPad User-Agent).

The following heading has clear: both set. Looks like a conscious styling choice.

This got buried a bit in the thread, so re-quoting my earlier post for easier reference:

I have created a simple section wrapper implementation using the conservative approach in https://github.com/gwicke/parsoid-dom-utils. See https://github.com/gwicke/parsoid-dom-utils/blob/master/test/sections.yaml for some input/output pairs, and https://github.com/gwicke/parsoid-dom-utils/blob/master/lib/sections.js for the implementation.
This algorithm does take nested headings into accounts, but currently only wraps at the top level. After a run, all direct children of body are <section> elements.
Wrapping of nested lower-level headings can be added in a next step, but we should first get more clarity on how edge cases should be resolved without restructuring the content.

To illustrate,

Lead content

== h2 ==

=== h3 ===

Section content

== h2 again ==

Another section content

is translated to

<section>
  Lead content
</section>
<section>
  <h2> h2 </h2>
  
  <h3> h3 </h3>
  
  Section content
</section>
<section>
  <h2> h2 again </h2>

  Another section content
</section>

Indentation & spacing added for clarity.

This doesn't need to be on our workboard, cscott is leading the project.

@cscott any updates on progress here? Would be great to see. Will this also be added to PHP parser? (I think that's why I added MediaWiki-platform-team)

This comment was removed by ssastry.

FYI, the section wrapping algorithm implemented in https://github.com/wikimedia/parsoid-dom-utils is now used in the Mobile Content Service. We found a bug in the handling of articles without a lead section, which is now fixed & covered by additional tests.

[[ Deleted previous comment since I accidentally hit submit in the middle of typing. ]]

@cscott any updates on progress here? Would be great to see. Will this also be added to PHP parser? (I think that's why I added MediaWiki-platform-team)

This is currently stalled. But, we'll pick it up again soon. Our team is responsible for both parsers. The reason this is stalled is because of this requirement to support this in both parsers. We are trying to bring the two parsers together, adopt Parsoid output as the default for wikimedia wikis, and eventually make it the default parser for Mediawiki. So, we definitely don't want to implement subtly different section-wrapping solutions in each parser.

It maybe simpler to start with just Parsoid for now. With RemexHTML (that will replace Tidy), it might be simpler to do an equivalent DOM pass on the PHP parser output as well, and at that time, make necessary tweaks to the section wrapping code to handle edge cases and consolidate around an identical notion of a section for both section editing and wrapping (vs. maintaining two different interpretations and notions). Without DOM parsing (I am using this as a proxy for any mechanism that provides such a structural view of a page - SAX based, or whatever specific mechanism we use for well-balanced output), this is hard to do easily correctly. Since Parsoid already has DOM parsing, this is less of an issue in Parsoid right now. We can tweak behaviour, output, and fix bugs in Parsoid as part of using this output. As for the PHP parser, we can either (a) not provide this solution there ever, or (b) figure out a solution to implement this once we have Tidy replaced.

bearND added a comment.EditedJun 20 2017, 5:12 PM

As part of the work on T162179 and once the patch is merged, the mobile-sections and the new read-html endpoints in MCS are using the parsoid-dom-utils library @GWicke mentioned. Once the sections are added by Parsoid I'm going to remove this from the MCS layer.

Thanks for the update @ssastry - it's reassuring to hear! @bearND that's awesome!

Nirmos added a subscriber: Nirmos.Jun 28 2017, 6:20 AM

I finally got around to consolidating notes in one place @ https://www.mediawiki.org/wiki/Parsing/Notes/Section_Wrapping . Please edit that page or leave comments on the talk page there as appropriate. But, that page should let us take concrete steps here.

I finally got around to consolidating notes in one place @ https://www.mediawiki.org/wiki/Parsing/Notes/Section_Wrapping . Please edit that page or leave comments on the talk page there as appropriate. But, that page should let us take concrete steps here.

See WIP Parsoid patch at https://gerrit.wikimedia.org/r/364933. Please discuss the strategy on the wiki page. Please discuss patch-specific issues (or compatibility with wiki-documented strategy) on gerrit.

I finally got around to consolidating notes in one place @ https://www.mediawiki.org/wiki/Parsing/Notes/Section_Wrapping . Please edit that page or leave comments on the talk page there as appropriate. But, that page should let us take concrete steps here.

See WIP Parsoid patch at https://gerrit.wikimedia.org/r/364933. Please discuss the strategy on the wiki page. Please discuss patch-specific issues (or compatibility with wiki-documented strategy) on gerrit.

We discussed the semantics of section wrapping quite a bit in this task. The implementation at https://github.com/wikimedia/parsoid-dom-utils/blob/master/lib/sections.js implements the semantics described in T114072#1850840, which is conservative in that it doesn't break up existing wrappers, but within those constraints also tries hard to keep sections as aligned with rendering and edit behavior as possible by considering headings wrapped in <div>s or <table>s. In the output, a page is sequence of sections (including the lead section), which makes it easy to work with for mobile & page summary use cases. This solution is primarily driven by product requirements, and less concerned with bug-for-bug compatibility with existing section edit code.

The wiki page and corresponding Parsoid implementation seems to be more concerned with compatibility, and drops some of the product requirements. I think it would be good to clarify the reasons behind those changes.

@ssastry it would be helpful if you could contrast your proposals vs the existing implementation of the library that @GWicke mentioned above:
https://github.com/wikimedia/parsoid-dom-utils/blob/master/lib/sections.js

I believe that implementation captures everything we need for the Page Content Service (@bearND can you confirm?)

If we have to make changes from the implementation, it would be helpful to evaluate them on a case by case basis.

ssastry added a comment.EditedJul 17 2017, 4:26 PM

@ssastry it would be helpful if you could contrast your proposals vs the existing implementation of the library that @GWicke mentioned above:
https://github.com/wikimedia/parsoid-dom-utils/blob/master/lib/sections.js
I believe that implementation captures everything we need for the Page Content Service (@bearND can you confirm?)
If we have to make changes from the implementation, it would be helpful to evaluate them on a case by case basis.

I don't have a specific implementation proposal right now. I am trying to extract what the requirements are from different perspectives (parsing team, editing, reading, anyone else?) and see if they are in agreement or not, and if not, how we can find a solution that works for all. If the dom utils library is the solution that works best, I am happy with it. We just need to resolve that it is indeed the case.

The wiki page and corresponding Parsoid implementation seems to be more concerned with compatibility, and drops some of the product requirements. I think it would be good to clarify the reasons behind those changes.

See https://www.mediawiki.org/wiki/Parsing/Notes/Section_Wrapping#Consistency_requirements_and_why_they_matter

I believe that implementation captures everything we need for the Page Content Service (@bearND can you confirm?)

Yes, this is my understanding as well.

+1 to wrapping the lead section in a <section> tag as well. I'm not sure what the other differences between the two implementations are. That's the only difference one I've noticed so far but I admit I haven't tried any exotic test cases.

I believe that implementation captures everything we need for the Page Content Service (@bearND can you confirm?)

Yes, this is my understanding as well.
+1 to wrapping the lead section in a <section> tag as well. I'm not sure what the other differences between the two implementations are. That's the only difference one I've noticed so far but I admit I haven't tried any exotic test cases.

Yes, besides the lead section requirement (which I didn't incorporate in my notes on the wiki page), the only differences will be around edge cases where tables and divs wrapping sections are involved. Unless reading has a very specific requirement about how those cases should behave, I'll go with what I document on the wiki page.

The current plan is to implement the proposal outlined here. There is a section on the page that explains why we care about these constraints.

In the common case, this will match what MCS currently does (including adding the lead section wrapper). In edge cases where there are divs wrapping multiple sections or divs overlapping incomplete sections or sections in tables, there might be divergences in some cases from what MCS is currently doing. These are edge cases and I don't think they are a issue necessarily ... but I am not sure how Reading / MCS relies on these section tags. Please take a look and leave comments on the talk page (or here, but talk page ideally) as appropriate.

@ssastry Within apps these sections are used to:

  1. Create a Styled Table of Contents with proper indentation
  2. Link from the Table of Contents to the actual section within the page

Within the Mobile Web, the top level sections can be collapsed and expanded.

@phuedx @Jdlrobson is there any other use case for the sections in MobileFrontend?

@bearND any other use cases for the MCS/Apps that I missed?

@ssastry Within apps these sections are used to:

  1. Create a Styled Table of Contents with proper indentation
  2. Link from the Table of Contents to the actual section within the page

Within the Mobile Web, the top level sections can be collapsed and expanded.

Thanks @Fjalapeno. This is helpful. While the edge cases don't really affect this functionality greatly, this did get me thinking a bit more and I added a new alternative proposal that tries to separate display / rendering requirements from editability requirements (along with a different save API in MediaWiki that addresses the conflict). See https://www.mediawiki.org/wiki/Parsing/Notes/Section_Wrapping#Alternative_Proposal_2

bearND added a comment.EditedJul 19 2017, 8:33 PM

@bearND any other use cases for the MCS/Apps that I missed?

No, that's pretty much it. MCS uses the <section> tags to turn the HTML into a JSON structure. In this case the sections in JSON are structurally similar to how MW API (action=mobileview) renders them. To get the section text MCS code starts concatenating the innerHTML of the child elements after the <h[1-6]> of the current <section> tag until a new <section> tag is encountered (for subsections) or the current section ends.
[see line 121ff in https://gerrit.wikimedia.org/r/#/c/352963/13/lib/parsoid-access.js]

@ssastry Within apps these sections are used to:

  1. Create a Styled Table of Contents with proper indentation
  2. Link from the Table of Contents to the actual section within the page

Within the Mobile Web, the top level sections can be collapsed and expanded.

Thanks @Fjalapeno. This is helpful. While the edge cases don't really affect this functionality greatly, this did get me thinking a bit more and I added a new alternative proposal that tries to separate display / rendering requirements from editability requirements (along with a different save API in MediaWiki that addresses the conflict). See https://www.mediawiki.org/wiki/Parsing/Notes/Section_Wrapping#Alternative_Proposal_2

I merged that proposal with the existing one. The save api proposal had a flaw so I got rid of it. Anyway, I think we now have a plan that supports MCS needs as well as addresses consistency issues we were worried about. Please follow the wiki page for updates.

In web, the most important use case we have is section collapsing. The current implementations suggest that the heading would be inside the section tag however in mobile web we put it outside.

e.g.

<h2>Heading<h2>
<section>Section text goes here</section>
<h2>Heading 2<h2>
<section>Section 2 text goes here</section>

So if you are wrapping the heading we'd also request you wrap the content. e.g.

<section>
<h2>Heading<h2>
<div>Section text goes here</div>
</section>
<section>
<h2>Heading 2<h2>
<div>Section 2 text goes here</div>
</section>

We tried to do section collapsing where all sibling elements collapse but it wasn't feasible with the existing mobile design.

Of course, wrapping everything in a <section> is better than the status quo - we can always pull the heading out with JS if we need to, but it would be great if we can avoid that and the potentially associated reflows.

phuedx added a comment.EditedJul 21 2017, 9:42 AM

We tried to do section collapsing where all sibling elements collapse but it wasn't feasible with the existing mobile design.

Thanks, @Jdlrobson. I remember this being the case as well but I can remember whether we (Reading Web) documented it.

Might it be worth re-opening the investigation and including the suggested format without the container element that you suggest? My thinking is, if that there's a minor design tradeoff to make so that we can all consume the suggested format, then we should probably make it.

@phuedx @Jdlrobson For documentation, could you post a snippet of the mobile website html demonstrating the current wrapping strategy?

@Fjalapeno
MobileFrontend markup currently looks like this (sorry if I wasn't clear):

<h2>Heading<h2>
<div>Section text goes here</div>

Right now our main interest is in the PHP parser. We do not use RESTBase yet... using mobile-content-service makes this somewhat redundant as sections are treated as first class citizens :)

Might it be worth re-opening the investigation and including the suggested format without the container element that you suggest?

So the problem before was we animated the closing of sections. Animating various elements inside a section to slide up is not possible. We don't do this right now so it's not as important, but it's a use case that we'd lose if heading is not separated from the rest of the content.

I don't think an investigation is needed we can work with this constraint if we have to.
We could work around this via:

$( 'section h2' ).each( (i, h) => { var $h = $(h); $h.insertBefore($h.parent()) } )

in our toggling code.

@Jdlrobson but what about the sections in MFE?

Your example doesn't show how you use sections currently.

<h2>Heading<h2>
<div>Section text goes here</div>

Just trying to get a dif on what you have and what you want.

That's what we have... Before MobileFrontend runs its MobileFormatter the HTML would be

<h2>Heading<h2>
Section text goes here

We wrap any content between headings with a div. It's that simple.

Change 364933 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] T114072: WIP: Add <section> wrappers to Parsoid output

https://gerrit.wikimedia.org/r/364933

ssastry moved this task from Backlog to In Progress on the Parsoid board.Nov 6 2017, 11:57 PM

Change 392695 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] T114072: Always emit a lead <section> tag

https://gerrit.wikimedia.org/r/392695

Change 364933 merged by jenkins-bot:
[mediawiki/services/parsoid@master] T114072: Add <section> wrappers to Parsoid output

https://gerrit.wikimedia.org/r/364933

Change 392695 merged by jenkins-bot:
[mediawiki/services/parsoid@master] T114072: Always emit a lead <section> tag

https://gerrit.wikimedia.org/r/392695

Parsoid-side support for sections is being deployed tomorrow. FYI. We are also bumping the Parsoid HTML version number to 1.6.0

brion added a subscriber: brion.Jan 8 2018, 6:58 PM

With the parsoid updates deployed last month, are there remaining tasks here (eg, updating the PHP parser output to match, dealing with any fallout from changes)?

I think we could close this task -- we have no immediate plans for adding this to the PHP parser output right now. This will be Parsoid-only at this point. If that is really a big concern, someone can open a task for PHP parser support to match Parsoid behavior.

ssastry closed this task as Resolved.Jan 8 2018, 7:25 PM
ssastry claimed this task.

@ssastry: Would you mind summarizing how this ended up being implemented. Did we end up doing conservative or aggressive wrapping? How did we deal with the unbalanced HTML issues? Thanks!

@ssastry: Would you mind summarizing how this ended up being implemented. Did we end up doing conservative or aggressive wrapping? How did we deal with the unbalanced HTML issues? Thanks!

Let me know if https://www.mediawiki.org/wiki/Parsing/Notes/Section_Wrapping#Plan_of_Record:_Implementation_proposal is helpful or not.

@ssastry: That's perfect! Thanks for the pointer!

Rua added a subscriber: Rua.Jan 30 2018, 12:06 PM

How will this information be accessible from Scribunto scripts? This is a much-wanted feature for en.wiktionary.

@Rua since parsoid isn't used for the normal page rendering I don't expect this to be available in Scribunto, which is unfortunate. Maybe @GWicke or @ssastry could explain?

We are working to make Parsoid be the parser for everything, but it will take some time to get there. At that point, section tags will be available for everyone.

Ok, thanks. For reference, T55784 seems to be the ticket to follow.

Izno added a comment.Sep 7 2018, 4:16 PM

There is a sort-of task for sections in the PHP parser at T8104: Wrap each wiki page section contents in a container (which is naturally ancient 😃 ). I've added it to MediaWiki-Parser .