VE should indicate when it wants external links "WikiLink-ified"
Open, NormalPublic

Description

With the logical patch to fix T94723, Parsoid can't properly round trip wikitext of the form:

[https://en.wikipedia.org/wiki/Foo Foo]

We currently skate by because this is preserved via selser. But if this link is modified in any way, then the external link gets auto-converted to a mw:WikiLink:

[[Foo]]

This was a reasonable behavior because VE wanted to support cut-and-paste of WP URLs, without importing all of Parsoid's complexity which attempts to recognize all the possible interwiki link URLs and prefixes and convert to a wikilink.

But we need to be more careful about how we apply this wikilink-ification.

I propose that we *only* auto-linkify anchors with type mw:WikiLink. Anything with type mw:ExtLink will be left alone and not wikilinkified.

That is, some links which VE passes to Parsoid and claims are mw:WikiLinks we will serialize as mw:ExtLink (because the URLs do not actually point at interwiki sites). But we will never do the conversion in the other direction, and convert something labelled as a mw:ExtLink into a mw:WikiLink.

As a side effect, this means that edits to a mw:ExtLink in VE must reset the type of mw:WikiLink if you want to enable auto-wiki-linkification. Otherwise, once an mw:ExtLink always a mw:ExtLink.

This is opposite from how VE currently behaves (see comment below).

Alternatively, the entire WikiLink-ification pass could be move to the Parsoid "scrubWikitext" module of VE-specific cleanups.

cscott created this task.Jun 15 2015, 10:06 PM
cscott updated the task description. (Show Details)
cscott raised the priority of this task from to Normal.
cscott added projects: Parsoid, VisualEditor.
cscott added a subscriber: cscott.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 15 2015, 10:06 PM

Current VE behavior seems to be the opposite of the above: VE applies mw:WikiLink when it *knows* it's a wikilink, and uses mw:ExtLink when it "might" be an extlink -- but would prefer Parsoid to convert it to a wikilink if possible.

Hm.

cscott updated the task description. (Show Details)Jun 16 2015, 5:00 PM
cscott set Security to None.
GWicke added a subscriber: GWicke.Jun 16 2015, 7:28 PM

At some point we were discussing to get rid of those link types altogether, on the grounds of that being a syntactic rather than a semantic category. I still think that handling this transparently in Parsoid is an attractive option.

@cscott, do you see strong reasons for providing more control over the serialization format?

Currently we use the difference between ExtLink and WikiLink to provide user-visible semantics; namely ExtLink are decorated with https://www.mediawiki.org/static/1.26wmf9/skins/Vector/images/external-link-ltr-icon.png to indicate that the resource named is "not trusted".

That said, I can't imagine a case where we'd want to mark an interwiki link as "not trusted".

GWicke added a comment.EditedJun 16 2015, 10:30 PM

The trust level of a link seems to be fairly orthogonal to the wikitext syntax used to mark up the link. It would not be very consistent to 'trust' [[:en:Foo|Foo]], but not [//en.wikipedia.org/wiki/Foo Foo].

Elitre added a subscriber: Elitre.Jul 12 2015, 5:22 PM

In https://gerrit.wikimedia.org/r/#/c/292078/, I am adding code to use the [[..]] format independent of the rel attribute. So, effectively, this will eliminate possibilities for VE to prefer an extlink-based syntax. @Jdforrester-WMF doesn't think there are strong use cases for doing that, so it seems a good direction to go.

In the HTML -> WT direction, this basically means that Parsoid will move towards ignoring rel and data-parsoid information in most scenarios and automatically picking the right link wikitext syntax based on url, wiki config, etc.

Please comment here / on that gerrit patch if there are strong concerns / objections.

Elitre added a comment.Jun 7 2016, 3:23 PM

So the question here is:
is there a scenario where [http://en.wikipedia.org/wiki/Foobar Test] is actually preferred to [[Foobar|Test]]?

In my experience, my edit gets corrected when I use the former.

You would prefer [http://en.wikipedia.org/wiki/Foobar Test] on a Help: page that describes how to make external links. (However, you could use [http://www.example.com/wiki/Foobar Test] instead).

You might want the spelled-out version for a page that you intend to export to another wiki. I can't think of any other reasonable use offhand.

The only cases I can quickly think of, where I'd be concerned about auto-wikification of internal (or cross-wiki) links, are ones with technical parameters, which are all used a lot in documentation and discussion, such as:

Standard links with added parameters, e.g.:

index.php URLs, e.g.:

How will be/are those two types of links affected by this?

How will be/are those two types of links affected by this?

They are handled properly.

[subbu@earth parsoid] echo "<a href='http://en.wikipedia.org/wiki/foobar?a=b'>Foo</a>" | parse.js --html2wt
[http://en.wikipedia.org/wiki/foobar?a=b Foo]

[subbu@earth parsoid] echo "<a href='http://en.wikipedia.org/wiki/foobar'>Foo</a>" | parse.js --html2wt
[[foobar|Foo]]
ssastry added a comment.EditedJun 10 2016, 5:45 PM

The other scenario where we need to resolve this is url-links. How should the following be serialized? Arguably, http://it.wikipedia.org/wiki/Luna is better output than what is being output right now.

[subbu@earth parsoid] echo "<a href='http://it.wikipedia.org/wiki/Luna'>http://it.wikipedia.org/wiki/Luna</a>" | parse.js --html2wt
[[:it:Luna|http://it.wikipedia.org/wiki/Luna]]

Bare link (http://it.wikipedia.org/wiki/Luna) is what we want in those cases, I think.

Bare link (http://it.wikipedia.org/wiki/Luna) is what we want in those cases, I think.

Right. Given that, the more general question that I want to ask is: Is it okay to start emitting canonical forms of wikitext for A-tags that are newly added or that have been modified?

Is it okay to eliminate possibilities of generating non-canonical wikitext (via Parsoid, i.e.) like these?

[http://foo.com http://foo.com] --- always generate http://foo.com
[[:it:Luna|http://it.wikipedia.org/wiki/Luna]] -- always generate http://it.wikipedia.org/Luna
[http://en.wikipedia.org/wiki/Foobar Test]  -- always generate [[Foobar|Test]]

We started off discussing the last form, but I realized that this is a more general question about canonical forms of wikitext for links.

I can create a separate phab ticket for this if necessary since I realize the discussion now has gone beyond the original narrower question which we seem to have resolved.

I can create a separate phab ticket for this if necessary since I realize the discussion now has gone beyond the original narrower question which we seem to have resolved.

It does beg the broader question: was trying to losslessly wt2wt a mistake in the original parsoid design? In 20/20 hindsight, perhaps we should have relied more heavily on selser from the beginning, and concentrated on generating *idiomatic* wikitext for new/edited wikitext. ("Idiomatic" encompasses all the work we did on nowiki-removal, proper single quotes, templatedata formatting, etc.)

This might provide a coherent design philosophy going forward, and perhaps point the way toward further simplifications in the code base (or slimming down of data-parsoid/data-mw).

It's an interesting idea. I'm not sure.

cscott closed this task as Declined.Jul 1 2016, 2:11 PM

In any case, I'm going to change this to a "known but wont fix" and +2 subbu's https://gerrit.wikimedia.org/r/292078 patch, with the above philosophy as a tentative rationale going forward. We can always revert later if we decide that we actually want/need precise wt2wt even of nonidiomatic constructions like [https://en.wikipedia.org/wiki/Foo Foo].

I can create a separate phab ticket for this if necessary since I realize the discussion now has gone beyond the original narrower question which we seem to have resolved.

It does beg the broader question: was trying to losslessly wt2wt a mistake in the original parsoid design? In 20/20 hindsight, perhaps we should have relied more heavily on selser from the beginning, and concentrated on generating *idiomatic* wikitext for new/edited wikitext. ("Idiomatic" encompasses all the work we did on nowiki-removal, proper single quotes, templatedata formatting, etc.)

No, it was not a mistake. The original strategy was working with the assumption that selective serialization might not get everything right, and / or full serialization might kick in in scenarios where we didn't have access to original source or because its focus might not be narrow enough. I added the first version of selser around Feb 2013 and over time, it has had lots of tweaks and bug fixes applied (either to DSR computation, or as to how narrow its scope is).

But, now that we've had it solid for the last couple years, and we now have RESTBase storage that guarantees access to the previous revision, we can start relying on selser more and have the core serialization code emit more idiomatic wikitext. We've already deployed a few patches based on this strategy and I think it makes sense to consider these changes one at a time.

Separately, our testing strategy continues to rely on examining dirty diffs after roundtripping, so we cannot push ahead too far with this approach yet.

cscott added a comment.Jul 1 2016, 2:34 PM

Sorry, didn't mean to mean "mistake" in a pejorative sense. We obviously learned a lot as we implemented parsoid, and couldn't foresee the full succcess of selser. And, as you mention, precise wt2wt does help us often in testing scenarios, and was instrumental in getting the project off the ground.

But, going forward, it may be worth embracing the new "age of selser" and continuing to move away from testing strategies and bits of our codebase that depend on precise wt2wt. For example, we could lean harder on parsertests to include an "idiomatic wikitext" section to go along with the "!! wikitext" section that already exists. We expect precise wt2wt of "idiomatic wikitext" and we expect wt2wt of non-idiomatic wikitext to match the idiomatic wikitext section at the end.

Round trip tests are more difficult w/o precise wt2wt, but perhaps visual diffs are a more robust mechanism to catch the same sorts of bugs (although currently there is a large practical performance difference between the two). wt2wt2wt could be used to catch gross round-trip errors (where the output of the first wt2wt pass is assumed to be "idiomatic wikitext" which is then tested to (a) ensure it wt2wt roundtrips 100% cleanly from that point, and (b) renders exactly the same as the original wikitext, using visual diff methods).

ssastry reopened this task as Open.EditedAug 16 2016, 3:42 PM

Okay, I am reopening the discussion because well, in rt-testing, we uncovered lots of dirty-diffs with the normalized output that is being emitted as part of T71207: Parsoid: Interwiki links are halfway converted to external links, and completely broken. As it turns out, external links to creativecommons, imdb, archive.org, and many other websites now get emitted as a wikilink because the interwikimap has entries for all these and many many more external websites.

In production, this is not necessarily an issue since selective serialization will leave these links untouched. However, on edits, all these links to external websites will be forced into wikilink form. That strategy is probably fine for interwiki links to wikimedia wikis, but I am not sure that is necessarily desirable behaviour for all these other sites like imdb, doi, archive.org, creativecommons, etc. I checked the enwiki interwiki map and there are entries for many other websites like arxiv, doi, creaturewiki, google cache, etc.

So, I guess we need to revisit this discussion and see what is desirable / appropriate. In the meantime, I am going to revert my patch.

ssastry moved this task from Backlog to Links on the Parsoid board.Sep 18 2017, 5:28 PM