Page MenuHomePhabricator

Document Wikisource uses of LST
Open, Needs TriagePublic

Description

To add Parsoid support for Wikisource, it would be useful to create a list of the major uses of LST (Labeled Section Transclusion) on Wikisource, identify examples of each. This will help Parsoid/VisualEditor designers/engineers see the complexity, and so that testers can build test plans to support the 'crazy' uses of LST.

The very common case is dividing pages into consecutive non-overlapping sections, in order to create "chapters" or "dictionary entry".

This usage will cover most "encyclopedia entry" cases, however many encyclopedia will lay out a single entry in overlapping sections. And newspaper articles are very often laid out in many sections all over the place, with two articles overlapping in every possible way.

Event Timeline

jayvdb raised the priority of this task from to Needs Triage.
jayvdb updated the task description. (Show Details)
jayvdb added subscribers: jayvdb, GWicke, Tpt.

An idea that came up in our discussion at the hackathon was to use id or class attributes to mark up / address content. Class attributes in particular could potentially support overlapping or multi-element sections. While it's easy to parse content to a DOM and match IDs or classes, there are performance and interface issues to be worked out there.

GWicke renamed this task from document Wikisource uses of LST to Document Wikisource uses of LST.Oct 23 2015, 10:23 PM

@Billinghurst gave @Ijon and myself more background on LST uses in wikisource. Here are my notes:

In the vast majority of cases, <section begin="somename"/> ... <section end="somename"/> actually matches up pairwise, and is properly nested. There is even a JS edit tool that converts consecutive sections into heading like syntax:

## Somesection ##

Section content

## Another section ##
...

This is actually stored as the following wikitext:

<section start="Somesection"/>
Section content
<section end = "Somesection"/>
<section start="Another section"/>
...
<section end="Another section"/>

If Parsoid parsed <section begin="somename"/> as the equivalent of <section id="<pagename>-<somename>">, and <section end="somename"> as just </section> (and emitted this as HTML section tags), things would probably mostly work out fine already. Treating these as individual tags would also handle multi-page transclusions, where <noinclude> sections are used to close sections when looking at individual pages, but let a section span two pages when transcluding them consecutively.

I confirm @GWicke summary: I believe than more than 90% (wet finger estimation) of LST use in Wikisource are properly nested and match pairwise.

I don't know about the numbers being quoted but I know many people avoid using Easy LST altogether and prefer to manually add (or script in) section tags. A majority of them at times tend to put the opening and/or closing section tags inline with the content rather than place opening and closing section tags alone and 'on their own lines' as described above.
Example:

<section begin="John Adams" />First section content start

Paragraph.

Last paragraph.
<section end = "John Adams "/>

<hr />

<section begin="John Smith" />Second section content start

Paragraph.

Last paragraph cut off midway<section end="John Smith" />

Still others always place their opening and closing sections "inline" with the content regardless of where a paragraph starts or ends.

I wanted to point this out so there is better understanding of the existing nuance dealing with exact tag placement and the [currently?] undefined block-level assigned for the <section begin= /> & <section end= /> tags - which I assume makes them behave inline by default - versus the block-level behavior of the HTML5 <section> tag.

In other terms, the HTML5 <section> tag behaves more like a traditional<div> tag does while our <section begin= /> tag scheme behaves more like a traditional <span> tag does so we might need to re-consider usurping the HTML5 <section> tag for any new Parsoid/LST purposes.

@GOIII, just having the closing tag last on the same line as the last paragraph of a section isn't a problem, but having it in the middle of a sentence would be, assuming default styling. Are there many uses like this?

<section begin="John Smith" />Second section content start

Paragraph.

Last paragraph cut off midway<section end="John Smith" />, but really continues.

I would think that if it was being done that way that it would be a very
small number of cases.

The thing is that we simply wouldn't know what is there (numerical sense).
The nature of the WSes is that we edit most pages the minimal number of
times, with many having just two edits in Page: ns (proofread and
validate), and transcluded once to main ns. Then maybe never to be opened
again.

We can say that it neither usual, occasional, nor rare to tag that way, way
less frequent. Tagging like that is hardcore and particular. IMO If that
edge use is required and prevents required schema migration, it should be
ignored as a blocker, and an alternate coding introduced for the edge
cases. Any such active user would be a power power ... user and reachable.

@GOIII, just having the closing tag last on the same line as the last paragraph of a section isn't a problem, but having it in the middle of a sentence would be, assuming default styling. Are there many uses like this?

<section begin="John Smith" />Second section content start

Paragraph.

Last paragraph cut off midway<section end="John Smith" />, but really continues.

More like this......

PAGE 104
<section begin="John Smith" />Second section content start

Paragraph.

Last paragraph cut off midway<section end="John Smith" />
PAGE 105
<section begin="John Smith" />, but really continues

Paragraph.

Last paragraph proper end.<section end="John Smith" />

I'm glad you threw in a comma where you did because once those range of pages are transcluded, the resulting sentence in the paragraph split across a page break would be ...

Last paragraph cut off midway , but really continues

Note the insertion of the extra space before the comma in the final rendering. That would not be a problem (normally) because most page breaks end with a full word so an extra space inserted automatically upon transclusion would be stripped by the parser/Parsoid.

The down side to that auto insertion is when a page break ends in a hyphenated word rather a full word, which is not all that uncommon as well.

Follow ?

Fwiw... here's another page that depicts rendering behavior in relation to in-line/own-line tag placement.

In the first three examples, note the differences in additional line-feed(s)/carriage-return(s) for each.

@GOII: Thank you, those examples are very helpful. There might be ways to address the multi-page section issue in Parsoid, perhaps by merging / stripping adjacent & matching open / close tags. However, the best solution for rendering and editing might differ in that case.

The different behavior of the trailing section tag illustrated on the test page might be an artifact of how MediaWiki's paragraph parser interacts with stripped lines. I would expect Parsoid to not suffer from the same issue.

@GOIII: Thank you, those examples are very helpful. There might be ways to address the multi-page section issue in Parsoid, perhaps by merging / stripping adjacent & matching open / close tags. However, the best solution for rendering and editing might differ in that case.

I only brought it up so others can also be aware of the current behavior - how to proceed is part of this process.

The different behavior of the trailing section tag illustrated on the test page might be an artifact of how MediaWiki's paragraph parser interacts with stripped lines. I would expect Parsoid to not suffer from the same issue.

"artifact" is putting it nicely. I wish somebody had the balls to remove that entire approach in light of css3 so empty P tags were no longer a strippable/ BR insertion issue for either

....I Know... I Know... I'm dreaming.

In plwikisource we use sometimes <section> tags to:
A. transclude references:

1. <section begin="ref1" />Text of the 1st reference<section end="ref1" /><br>
2. <section begin="ref2" />Text of the 1st reference<section end="ref2" /><br>
...
N. <section begin="refN" />Text of the 1st reference<section end="refN" /><br>

B. merging multipage table content (or extract parts of a table - eg a section from table of content):

Page 1:

<section begin="table" /><section begin="part1" />
{|
|-
|...
|...
<section end="part1" />
|-
|...
|...
<section end="table" />
|}

Page 2

{|
<section begin="table" />
|-
|...
|...
<section begin="part1" />
|}
<section end="part1" /><section end="table" />

C. multi-part sections on a page:

<section begin="pl" />
A paragraph.
...
A paragraph.
<section end="pl" /><section begin="fr" />
A paragraph.
...
A paragraph.
<section end="fr" /><section begin="pl" />
A paragraph.
...
A paragraph.
<section end="pl" /><section begin="fr" />
A paragraph.
...
A paragraph.
<section end="fr" />

In the first case <section> tags are used as linear, in the second some sections are overlapping, in the third a section contains multiple, separated blocks (mainly for merging multi-column tables or for reasonable presentation of multilingual texts.
The usege "B" while still exists is deprecated, however (discouraged in new texts - use of div-based templates instead of tables is preferred in such context).