Page MenuHomePhabricator

VE should indicate when it wants external links "WikiLink-ified"
Closed, DeclinedPublic

Description

With the logical patch to fix T94723, Parsoid can't properly round trip wikitext of the form:

[https://en.wikipedia.org/wiki/Foo Foo]

We currently skate by because this is preserved via selser. But if this link is modified in any way, then the external link gets auto-converted to a mw:WikiLink:

[[Foo]]

This was a reasonable behavior because VE wanted to support cut-and-paste of WP URLs, without importing all of Parsoid's complexity which attempts to recognize all the possible interwiki link URLs and prefixes and convert to a wikilink.

But we need to be more careful about how we apply this wikilink-ification.

I propose that we *only* auto-linkify anchors with type mw:WikiLink. Anything with type mw:ExtLink will be left alone and not wikilinkified.

That is, some links which VE passes to Parsoid and claims are mw:WikiLinks we will serialize as mw:ExtLink (because the URLs do not actually point at interwiki sites). But we will never do the conversion in the other direction, and convert something labelled as a mw:ExtLink into a mw:WikiLink.

As a side effect, this means that edits to a mw:ExtLink in VE must reset the type of mw:WikiLink if you want to enable auto-wiki-linkification. Otherwise, once an mw:ExtLink always a mw:ExtLink.

This is opposite from how VE currently behaves (see comment below).

Alternatively, the entire WikiLink-ification pass could be move to the Parsoid "scrubWikitext" module of VE-specific cleanups.

Event Timeline

cscott raised the priority of this task from to Medium.
cscott updated the task description. (Show Details)
cscott added projects: Parsoid, VisualEditor.
cscott subscribed.

Current VE behavior seems to be the opposite of the above: VE applies mw:WikiLink when it *knows* it's a wikilink, and uses mw:ExtLink when it "might" be an extlink -- but would prefer Parsoid to convert it to a wikilink if possible.

Hm.

cscott set Security to None.

At some point we were discussing to get rid of those link types altogether, on the grounds of that being a syntactic rather than a semantic category. I still think that handling this transparently in Parsoid is an attractive option.

@cscott, do you see strong reasons for providing more control over the serialization format?

Currently we use the difference between ExtLink and WikiLink to provide user-visible semantics; namely ExtLink are decorated with https://www.mediawiki.org/static/1.26wmf9/skins/Vector/images/external-link-ltr-icon.png to indicate that the resource named is "not trusted".

That said, I can't imagine a case where we'd want to mark an interwiki link as "not trusted".

The trust level of a link seems to be fairly orthogonal to the wikitext syntax used to mark up the link. It would not be very consistent to 'trust' [[:en:Foo|Foo]], but not [//en.wikipedia.org/wiki/Foo Foo].

In https://gerrit.wikimedia.org/r/#/c/292078/, I am adding code to use the [[..]] format independent of the rel attribute. So, effectively, this will eliminate possibilities for VE to prefer an extlink-based syntax. @Jdforrester-WMF doesn't think there are strong use cases for doing that, so it seems a good direction to go.

In the HTML -> WT direction, this basically means that Parsoid will move towards ignoring rel and data-parsoid information in most scenarios and automatically picking the right link wikitext syntax based on url, wiki config, etc.

Please comment here / on that gerrit patch if there are strong concerns / objections.

So the question here is:
is there a scenario where [http://en.wikipedia.org/wiki/Foobar Test] is actually preferred to [[Foobar|Test]]?

In my experience, my edit gets corrected when I use the former.

You would prefer [http://en.wikipedia.org/wiki/Foobar Test] on a Help: page that describes how to make external links. (However, you could use [http://www.example.com/wiki/Foobar Test] instead).

You might want the spelled-out version for a page that you intend to export to another wiki. I can't think of any other reasonable use offhand.

The only cases I can quickly think of, where I'd be concerned about auto-wikification of internal (or cross-wiki) links, are ones with technical parameters, which are all used a lot in documentation and discussion, such as:

Standard links with added parameters, e.g.:

index.php URLs, e.g.:

How will be/are those two types of links affected by this?

How will be/are those two types of links affected by this?

They are handled properly.

[subbu@earth parsoid] echo "<a href='http://en.wikipedia.org/wiki/foobar?a=b'>Foo</a>" | parse.js --html2wt
[http://en.wikipedia.org/wiki/foobar?a=b Foo]

[subbu@earth parsoid] echo "<a href='http://en.wikipedia.org/wiki/foobar'>Foo</a>" | parse.js --html2wt
[[foobar|Foo]]

The other scenario where we need to resolve this is url-links. How should the following be serialized? Arguably, http://it.wikipedia.org/wiki/Luna is better output than what is being output right now.

[subbu@earth parsoid] echo "<a href='http://it.wikipedia.org/wiki/Luna'>http://it.wikipedia.org/wiki/Luna</a>" | parse.js --html2wt
[[:it:Luna|http://it.wikipedia.org/wiki/Luna]]

Bare link (http://it.wikipedia.org/wiki/Luna) is what we want in those cases, I think.

Bare link (http://it.wikipedia.org/wiki/Luna) is what we want in those cases, I think.

Right. Given that, the more general question that I want to ask is: Is it okay to start emitting canonical forms of wikitext for A-tags that are newly added or that have been modified?

Is it okay to eliminate possibilities of generating non-canonical wikitext (via Parsoid, i.e.) like these?

[http://foo.com http://foo.com] --- always generate http://foo.com
[[:it:Luna|http://it.wikipedia.org/wiki/Luna]] -- always generate http://it.wikipedia.org/Luna
[http://en.wikipedia.org/wiki/Foobar Test]  -- always generate [[Foobar|Test]]

We started off discussing the last form, but I realized that this is a more general question about canonical forms of wikitext for links.

I can create a separate phab ticket for this if necessary since I realize the discussion now has gone beyond the original narrower question which we seem to have resolved.

I can create a separate phab ticket for this if necessary since I realize the discussion now has gone beyond the original narrower question which we seem to have resolved.

It does beg the broader question: was trying to losslessly wt2wt a mistake in the original parsoid design? In 20/20 hindsight, perhaps we should have relied more heavily on selser from the beginning, and concentrated on generating *idiomatic* wikitext for new/edited wikitext. ("Idiomatic" encompasses all the work we did on nowiki-removal, proper single quotes, templatedata formatting, etc.)

This might provide a coherent design philosophy going forward, and perhaps point the way toward further simplifications in the code base (or slimming down of data-parsoid/data-mw).

It's an interesting idea. I'm not sure.

In any case, I'm going to change this to a "known but wont fix" and +2 subbu's https://gerrit.wikimedia.org/r/292078 patch, with the above philosophy as a tentative rationale going forward. We can always revert later if we decide that we actually want/need precise wt2wt even of nonidiomatic constructions like [https://en.wikipedia.org/wiki/Foo Foo].

I can create a separate phab ticket for this if necessary since I realize the discussion now has gone beyond the original narrower question which we seem to have resolved.

It does beg the broader question: was trying to losslessly wt2wt a mistake in the original parsoid design? In 20/20 hindsight, perhaps we should have relied more heavily on selser from the beginning, and concentrated on generating *idiomatic* wikitext for new/edited wikitext. ("Idiomatic" encompasses all the work we did on nowiki-removal, proper single quotes, templatedata formatting, etc.)

No, it was not a mistake. The original strategy was working with the assumption that selective serialization might not get everything right, and / or full serialization might kick in in scenarios where we didn't have access to original source or because its focus might not be narrow enough. I added the first version of selser around Feb 2013 and over time, it has had lots of tweaks and bug fixes applied (either to DSR computation, or as to how narrow its scope is).

But, now that we've had it solid for the last couple years, and we now have RESTBase storage that guarantees access to the previous revision, we can start relying on selser more and have the core serialization code emit more idiomatic wikitext. We've already deployed a few patches based on this strategy and I think it makes sense to consider these changes one at a time.

Separately, our testing strategy continues to rely on examining dirty diffs after roundtripping, so we cannot push ahead too far with this approach yet.

Sorry, didn't mean to mean "mistake" in a pejorative sense. We obviously learned a lot as we implemented parsoid, and couldn't foresee the full succcess of selser. And, as you mention, precise wt2wt does help us often in testing scenarios, and was instrumental in getting the project off the ground.

But, going forward, it may be worth embracing the new "age of selser" and continuing to move away from testing strategies and bits of our codebase that depend on precise wt2wt. For example, we could lean harder on parsertests to include an "idiomatic wikitext" section to go along with the "!! wikitext" section that already exists. We expect precise wt2wt of "idiomatic wikitext" and we expect wt2wt of non-idiomatic wikitext to match the idiomatic wikitext section at the end.

Round trip tests are more difficult w/o precise wt2wt, but perhaps visual diffs are a more robust mechanism to catch the same sorts of bugs (although currently there is a large practical performance difference between the two). wt2wt2wt could be used to catch gross round-trip errors (where the output of the first wt2wt pass is assumed to be "idiomatic wikitext" which is then tested to (a) ensure it wt2wt roundtrips 100% cleanly from that point, and (b) renders exactly the same as the original wikitext, using visual diff methods).

ssastry reopened this task as Open.EditedAug 16 2016, 3:42 PM

Okay, I am reopening the discussion because well, in rt-testing, we uncovered lots of dirty-diffs with the normalized output that is being emitted as part of T71207: Parsoid: Interwiki links are halfway converted to external links, and completely broken. As it turns out, external links to creativecommons, imdb, archive.org, and many other websites now get emitted as a wikilink because the interwikimap has entries for all these and many many more external websites.

In production, this is not necessarily an issue since selective serialization will leave these links untouched. However, on edits, all these links to external websites will be forced into wikilink form. That strategy is probably fine for interwiki links to wikimedia wikis, but I am not sure that is necessarily desirable behaviour for all these other sites like imdb, doi, archive.org, creativecommons, etc. I checked the enwiki interwiki map and there are entries for many other websites like arxiv, doi, creaturewiki, google cache, etc.

So, I guess we need to revisit this discussion and see what is desirable / appropriate. In the meantime, I am going to revert my patch.

Ping.

Specifically, if someone adds a link to archive.org, imdb, etc, is it okay to emit wikilinks like these instead of the corresponding external link syntax?

  • [[oldwikisource:Author:Uwe_Kils|High resolution images]]
  • [[iarchive:chineseclassics02legggoog|Tom I, 1861]]
  • [[rfc:7159|The JavaScript Object Notation (JSON) Data Interchange Format (RFC 7159)]]

.. and so on .. (enwiki's interwiki map has lots and lots ... floralwiki, flickr, gentoo, hackerspace, dreamhost (!), doomwiki, .. etc.) See https://github.com/wikimedia/parsoid/blob/master/lib/config/baseconfig/enwiki.json#L2072-L5898

If not, and if we don't want VE to give us this hint (on what basis will VE do it except by maintaining some map / asking users to flag that which is probably not such a good idea for arcana like this), we will have to add logic to selectively apply parts of the interwiki map by maintaining a separate regex / config of projects for which to apply the wikilink syntax to.

@Deskana, @Jdforrester-WMF, @Elitre, @Whatamidoing-WMF, @Quiddity ... input welcome here.

IMO the examples you give would probably be okay (especially the RFC one), but if editors see [[google:this]] or [[flickr:that]], then they will be surprised and angry.

Yup, that seems reasonable to me :-) which is why I reverted my previously merged patch in 2016. So, I suppose the question is how do we pick the subset of websites (wikis + other restricted ones) for which Parsoid should auto-wiki-linkify since we can no longer blindly use the interwikimap for this purpose .. Ideally, this would be a global config across all wikis that we would use in Parsoid.

.. and so on .. (enwiki's interwiki map has lots and lots ... floralwiki, flickr, gentoo, hackerspace, dreamhost (!), doomwiki, .. etc.) See https://github.com/wikimedia/parsoid/blob/master/lib/config/baseconfig/enwiki.json#L2072-L5898

IIUC, that list is global, not specific to Enwiki, per https://meta.wikimedia.org/wiki/Special:Interwiki - (I assume the name of that .json file, or the inclusion of that list in that file, is a historic artifact, or something similarly/more complicated! Or it replicates a global file, or something.)

.. and so on .. (enwiki's interwiki map has lots and lots ... floralwiki, flickr, gentoo, hackerspace, dreamhost (!), doomwiki, .. etc.) See https://github.com/wikimedia/parsoid/blob/master/lib/config/baseconfig/enwiki.json#L2072-L5898

IIUC, that list is global, not specific to Enwiki, per https://meta.wikimedia.org/wiki/Special:Interwiki - (I assume the name of that .json file, or the inclusion of that list in that file, is a historic artifact, or something similarly/more complicated! Or it replicates a global file, or something.)

Aha! Maybe the forward property is our magic wand .. i.e .. apply the wikilink transformation only for those entries with forward (is_local) property enabled.

IMO the examples you give would probably be okay (especially the RFC one), but if editors see [[google:this]] or [[flickr:that]], then they will be surprised and angry.

See https://meta.wikimedia.org/wiki/Talk:Interwiki_map/Archives/2009-04#Google and https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)/Archive_46#Jumbled_up_mess_of_link_types and https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_AF#[[MemoryAlpha:articlename]]_links:_how,_when,_why? for some previous discussions on this point.

It actually seems like the problem there is whether or not the link is *styled* as an external link, not whether the interwiki link syntax is used? It seems to be consistent for us to use interwiki *syntax* for all links where that's possible (it makes the wikitext more compact and readable) but only to add external link *icons* to those where forwarding is not enabled. That would be consistent with the intention of the forward property (ie, if forward=false, the link is "scarier" -- ie external -- than an internal forwarded link would be).

I guess subbu's proposal to only use interwiki syntax for forwardable links is consistent with this as well, but changing the CSS styling seems like it would be a more general fix.

LGoto lowered the priority of this task from Medium to Low.Mar 6 2020, 5:14 PM
LGoto moved this task from Link syntax (links & media) to Needs Investigation on the Parsoid board.