Page MenuHomePhabricator

ISBN links not internationalised
Open, MediumPublic

Description

https://fr.wikipedia.org/api/rest_v1/page/html/Utilisateur%3AEd_g2s%2FSandbox

In the PHP parser this links to Spécial:Ouvrages_de_référence/NNN. Parsoid always generates links to Special:BookSources/NNN.

Event Timeline

Esanders raised the priority of this task from to Needs Triage.
Esanders updated the task description. (Show Details)
Esanders added a project: Parsoid.
Esanders subscribed.

Parsoid should probably be generating an intermediate representation that's machine-readable, not trying to link to MediaWiki special: pages directly...

cscott added a subscriber: Arlolra.
cscott subscribed.

The href in the Parsoid output should be considered the machine-readable IR. It's not necessarily meant for humans to see. The only reason that humans can tell that the link is to the English language special page is that VE doesn't yet have a proper link inspector for magic links.

Although it would probably be straightforward to use the parent wiki's localization, it would complicate the ability of code to inspect the link href and identify magic links. (We have a data-parsoid property, but not a (user-visible) data-mw marker, so the user is expected to look at the href to identify magic links.) So for an IR, I think using a fixed URL string is preferable, and I'm -0 to localizing it in the DOM output. I'd be more inclined to localization of the href if we added some other sort of magic link marker; maybe something like rel="mw:WikiLink/ISBN" (which we already use internally, then sanitize for mw:WikiLink for export).

We should probably allow the localized string for round-tripping, though. I'll have to look into whether we do that yet.

IIRC there's some idea that eventually we'll be able to use the parsoid output for regular display rendering, and be able to activate VisualEditor directly on that HTML without loading up a second copy...

If so, I would caution strongly against using the a elment's href attribute for machine-readable metadata if it has to be replaced with the correct link for proper UI and linking in the readable rendering. This would at a minimum mean you have to reverse the reading transformations before activating the editor...

There are always going to be some transformations done to the Parsoid DOM
before it is useful for reading. For example: red links, stubs (which are
in the user preferences). Even default thumbnail sizes are a user
preference.

And we can still do a correct html2wt conversion with a localized href,
without requiring that the wt2html output have a localized href. (We are
very generous in our html2wt pass.)

Basically: there's an inherent tension here between Parsoid DOM as a useful
intermediate representation for machine processing, and using the DOM
directly for reading. We are already doing significant manipulation of the
Parsoid DOM for reading on the mobile front end.

It is true that we would like to minimize the amount of manipulation needed
for reading. But we would also like to make the DOM as useful as possible
as an IR.

Which is just to say that neither "we need this for reading" nor "we need
this for a good IR" is a sufficient argument by itself. Both of these are
weighed against each other.

​All that said, I'm leaning toward localizing the href, but at the same
time adding a data-mw property explicitly marking magic links. (Or
altering the rel string, but that might break more things; investigation
needed.) That would seem to preserve the easy machine identification of
magic links while also being better for direct reading purposes. But I
want to hear from arlo and subbu.

*nod* explicit marking makes sense to me; allows letting the link be localized (or transformed to localized form only when used for reading) without breaking the machine-readability needed for editing.

Yes, I'd much rather look for a specific attribute/value than pattern match the href.

So I suspect that the main problem with changing the rel to mw:ExtLink/RFC, mw:ExtLink/PMID, or mw:WikiLink/ISBN is that VE's current link inspector might break, because the rel attribute won't be something it's looking for.

You could still use the ^= CSS selector to find "all external links" or "all internal links" irrespective of the rel suffix.

IIRC Parsoid used to generate rel attributes of this form. I'd like to know why this was changed before considering changing it back. @GWicke? @ssastry? Do either of you remember?

We can coordinate and keep them working. We'll need a separate inspector for these types anyway.

IIRC Parsoid used to generate rel attributes of this form. I'd like to know why this was changed before considering changing it back. @GWicke? @ssastry? Do either of you remember?

T55432#559744 has some reasoning about this and https://gerrit.wikimedia.org/r/#/c/84871/ is the cleanup patch in question.

There were other considerations that cropped up in other scenarios and I think those were mostly around whether Parsoid can trust the attribute markup that it gets from clients (i.e. on edits, should clients be burdened with the task of updating the markup when Parsoid can easily infer that information). So, Parsoid has, over the last couple years, moved away from treating the typeof attributes on links as hints rather than as being authoritative.

The larger question this brings up (and it has come up in other contexts -- see T100225#1315556) is about what the DOM spec means. We definitely treat it as a guarantee that Parsoid makes about the HTML it generates. However, we have been more lenient about what Parsoid accepts (for the html -> wt transformation) in certain areas, but not so lenient about others. I think @GWicke was primarily making changes to get rid of that distinction and tightening it so that the attribute markers serve a contract both ways.

So, this rel markup discussion indicates that perhaps there is some value in the additional markup, but something to discuss further as to how to accommodate it in light of the above remarks.

It's odd that ISBNs are the only "interesting" case here. No one is proposing internationalizing the (external) hrefs we emit for RFC or PMID links, so looking at the href is still a fine way to identify magic links in those two cases.

I'd feel better about marking up the magic links if the need wasn't so specific to ISBNs.

Well potentially RFC/PMID links could be i18ned in the future, it just so happens that those point to an external site which is all English at the moment. Even if we don't i18n them it would useful if we could identify them without a regex.

Previous discussion on simplifying our link markup by making the href authoritative:

http://thread.gmane.org/gmane.science.linguistics.wikipedia.wikitext/852

ISBN links were the potentially problematic case we called out then as well.

The main issue in this task seems to be about the VE handling, which I think we all agree is largely orthogonal to how an ISBN link is marked up. The other issue while reading would be seeing a link to Special:Booksources (non-localized) when hovering over a link, and going through a redirect to the localized version when following it. To me that issue seems to be relatively minor, but I can see that others might feel strongly about consistently localized href attributes.

Overall, I think there are good reasons for using hrefs as a single source of truth, not least of them a reduction in the page size, and reduced complexity especially while editing. Matching a fixed href prefix should be no harder than matching a fixed attribute value.

URLs are quite central in the semantic web. I think we can trust them to carry their own weight a little more.

To be clear, my current position is to be generous in what we emit (extra "rel" types to disambiguate magic links for clients) but also be generous in what we accept (ignore "rel" attributes on serialization, so that clients aren't forced to maintain irrelevant syntactic distinctions).

The thread that @GWicke linked above does lay out the arguments on all sides pretty well. To elaborate some:

Internal/External links are problematic because we would really like to be able to use the browser's CSS selector machinery to efficiently pick out DOM nodes of interest. Currently we can do a rel="..." selector. Without rel information, you'd have to rely on a href^="http:",href^="https:",... selector to identify external links, which has two main problems: (1) the list of valid protocols is not short, (2) the list of valid protocols varies by wiki, and (3) this CSS selector cannot be negated to identify internal links.

ISBNs are a particularly thorny case, since there is no simple fixed regexp that can identify a *localized* isbn link. You'd have to do an alternation including every wiki language, and track changes to the localization on an ongoing basis as well. (Or, more likely, use a dynamically-generated regexp based on a fetch of /w/api.php?action=query&meta=siteinfo&format=json&siprop=specialpagealiases.)

It seems to me that the complexity of identifying the type of a link solely by its href goes up quite a lot if we emit localized hrefs for ISBNs.

Change 233655 had a related patch set uploaded (by Cscott):
WTS support for localized ISBN magic links

https://gerrit.wikimedia.org/r/233655

Change 233656 had a related patch set uploaded (by Cscott):
Emit localized ISBN magic links

https://gerrit.wikimedia.org/r/233656

ISBNs are a particularly thorny case, since there is no simple fixed regexp that can identify a *localized* isbn link.

This is only true if the href is indeed localized. If it is not, then recognizing ISBN links is a matter of matching [href^="./Special:BookSources/].

A blanket match for external links is a:not([href^="./"]).

Change 233655 merged by jenkins-bot:
WTS support for localized ISBN magic links

https://gerrit.wikimedia.org/r/233655

Arlolra triaged this task as Medium priority.Jan 16 2016, 2:46 AM