Page MenuHomePhabricator

Consolidate migration passes into something generic and unified
Open, MediumPublic

Description

We have three passes (migrateTemplateMetas, migrateTrailingNLs, migrateTrailingCategories) that are effectively doing something identical .. moving rendering transparent content from beginning / end of some subset of nodes.

Since moving paragraph wrapping to the dom in https://gerrit.wikimedia.org/r/#/c/mediawiki/services/parsoid/+/443744/
we've had to make numerous adjustments to these passes that most likely apply to all three.

https://gerrit.wikimedia.org/r/#/c/mediawiki/services/parsoid/+/446985/
https://gerrit.wikimedia.org/r/#/c/mediawiki/services/parsoid/+/446208/
https://gerrit.wikimedia.org/r/#/c/mediawiki/services/parsoid/+/446322/

The latter of which probably needs another tweak for,

Seems like the call to DU.lastNonSepChild in Subbu's fix should also account for DU.isSolTransparentLink (or just categories).
Similar to the above.

'uzwiki/Pablo Neruda',

The following is a bunch of notes from investigating new semantic failures after 443744.

This is another case where sol transparent node are being included at the end of a paragraph where previously
they were excluded. The old algorithm would search the line in reverse order for where to put the end tag and
it would skip this content. It's unclear whether we should try to replicate that. This particular instance was broken wt though,
https://de.wikipedia.org/w/index.php?title=Seetaktische_Lehrgruppe&type=revision&diff=179240127&oldid=165339287
https://sv.wikipedia.org/w/index.php?title=Ulrik_Plesner&type=revision&diff=43289184&oldid=32227738
https://it.wikipedia.org/w/index.php?title=Cattedrale_di_Metz&type=revision&diff=98534211&oldid=98272937
https://ko.wikipedia.org/w/index.php?title=%ED%99%8D%EC%88%98%EC%A0%84&type=revision&diff=21808118&oldid=21387860
https://ko.wikipedia.org/w/index.php?title=%EB%B0%9C%ED%95%B4_%EB%AC%B4%EC%99%95&type=revision&diff=21808146&oldid=21308398
https://nl.wikipedia.org/w/index.php?title=Good_hearted_woman_(lied)&type=revision&diff=51968655&oldid=46721681&diffmode=source

'dewiki/Seetaktische_Lehrgruppe',
'svwiki/Ulrik_Plesner',
'itwiki/Cattedrale di Metz',
'kowiki/홍수전',
'kowiki/발해 무왕',
'nlwiki/Good hearted woman (lied)',

Another case where the quick semantic check now fails, this time because of trailing space included in data-mw parts.
These things roundtrip and render the same though. The semantic diff it points out is a rendering failure that's
filed as https://phabricator.wikimedia.org/T199845

'frwiki/Union des démocrates et indépendants'

This case is an argument for having a migrate leading (as opposed to traililng) newlines. For example,

test

123<div>hi</div>

Here the leading newlines will be included in the paragraph with "123". Of course, this only happens at the
edges between doBlockLevels and Remex p-wrapping. This is because the newlines are included in the text node.
Finding a way to keep the emitted nltks as separate textnodes and then moving the call to normalize the body before
pwrapping would also prevent it.

'ruwiki/Союз Советских Социалистических Республик',

Event Timeline

The difference here seems to be that previously a trailing space was excluded from p-wrapping, so that when
the serializer looked at the constrains to determine if a newline was necessary, it suppressed it. The new
behaviour seems preferable, since this is a block. The rt classification is broken since the two aren't really
semantically different.

'dewiki/Klingspor (Unternehmen)',

T196931 is a case where making the encapsulation marker transparent to p-wrapping is hindering instead of helping. We probably want to migrate them *in* for some scenarios.

T196931 is a case where making the encapsulation marker transparent to p-wrapping is hindering instead of helping. We probably want to migrate them *in* for some scenarios.

Unless I am misunderstanding, you are basically suggesting DOM templating by opening/closing paragraphs on template boundaries. Doing that just in Parsoid will lead to different rendering in PHP parser and Parsoid.

If so, I think you should treat that as a different issue from this ticket.

No, that's not what I'm suggesting. Look at openPTag in the ParagraphWrapper.js, it's careful about expanding template ranges unnecessarily. The decision to unilaterally treat template start metas as transparent in the dom based pass is suspect. It should depend on whether that template gets closed within the paragraph.

No, that's not what I'm suggesting. Look at openPTag in the ParagraphWrapper.js, it's careful about expanding template ranges unnecessarily. The decision to unilaterally treat template start metas as transparent in the dom based pass is suspect. It should depend on whether that template gets closed within the paragraph.

As far as PHP parser semantics are concerned, these start/end meta tags are indeed transparent to p-wrapping.

But, in Parsoid land, we add additional logic to push these off subtree edges ... so in this case, what you are recommending is that we should start/end paragraphs on these meta tags (don't start on an end tag; don't end on a start tag). That make sense to me. That also continues to fit the core model of what this phab task is about.

ssastry triaged this task as Medium priority.Jul 23 2018, 6:33 PM

Yes, the semantics as is are correct, but the outcome is less than desirable.

As in the linked task, there're cases where including the start tag in the paragraph will result in more precise ranges.