Page MenuHomePhabricator

Add rel="nofollow" to Parsoid links
Closed, ResolvedPublic

Description

If Parsoid is to be used for web page views (as opposed to "internal to editors" and "mobile web" or "app" views), it should respect the WP $wgNoFollowLinks configuration option. This may have to be exported in the MW siteinfo if it is not already present.

We already use rel for link type. The proper way to combine rel attributes is by space-separating them:

To determine which link types apply to a link, a, or area element, the element's rel attribute must be split on ASCII whitespace. The resulting tokens are the keywords for the link types that apply to that element.

https://html.spec.whatwg.org/#linkTypes

There are probably CSS selectors in various places which look like a[rel="...."]. These should be changed to a[rel~="..."] (see https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors ) -- in fact, they should always have been written that way, but it wouldn't have broken before we added nofollow.

Event Timeline

ssastry triaged this task as Medium priority.Feb 1 2018, 5:16 PM
ssastry moved this task from Needs Triage to Read Views on the Parsoid board.

Repeating a comment from T58756:

I'd *strongly* prefer to see this added as a post-processing pass, instead of directly to the wt2html pass. We should maintain a good discipline about separating out redundant information which is only used for read views or styling, so that (in the future) this can be generated on-demand and we don't waste storage or transmission bandwidth on unnecessary bits.

Some IRC discussion and related phab tasks:
<cscott> subbu: rel="nofollow" is part of that "redundant information needed for read views but not editing"
<cscott> in fact, it's only needed for web read views, not mobile read views, since no one indexes our mobile content (....not that that's necessarily a good thing....)
<subbu> rel="nofollow" is a directive to crawlers, isn't it?
<sbailey> yes, nofollow is for google etc.
<subbu> and also affects page rank computation i thought .. or maybe it might have at one point .. i assume google's algo is much smarter now.
<cscott> it's a disincentive to put spam links on wikipedia, since we use nofollow so they won't get any google juice
<cscott> but yeah, i think google's algorithm is smarter
<cscott> and it's not entirely clear that we *don't* want to give google juice to authoritative references cited in our articles
<sbailey> yes, actually increases page rank supposedly
<subbu> cscott, that is not a decision for us to make?
<cscott> yes, exactly
<cscott> it's a bit over our level
<cscott> and not actually relevant until/unless parsoid is used for web read views
<subbu> right .. so, that is what i meant by .. it is something that needs resolving in our output for read views .. unless someone determines it is no longer relevant.
<cscott> technically i think there's a wp config option for whether to add nofollow or not
<cscott> we should honor that, so that it is up to the individual wiki whether to use nofollow or not
<cscott> cf T8153: Disable nofollow tag for hyperlinks on de.wikipedia for instance, where dewiki wanted to remove nofollow
<subbu> right .. so, the qn. is how do we tackle this given that parsoid is already using it for the link type.
<cscott> check the spec, you can [white space separate] rel i think.
<subbu> sbailey, ^
<cscott> cf T68289: A16. Flow: external links should have rel="nofollow" property
<cscott> IMO it should be a separate post-processing phase, since it is clearly not needed for anything but read views
<sbailey> hmm, since we don't emit nofollow yet, we can make that a configuration option.
<cscott> see also https://gerrit.wikimedia.org/r/206344
<subbu> sbailey, also reg. the external link class phab ticket, you could reopen it till all pieces are done .. magic links, for ex.
<cscott> it's $wgNoFollowLinks
<subbu> sbailey, cscott is saying it is a wiki-config option.
<cscott> it *should* be exported in the MW siteinfo. if it's not already, that needs to be done first, so we can then honor the setting properly.
<cscott> (shades of the ATWG discussion about api-driven whatevers)
<cscott> and it's not just us, wikidata also has this issue: T175230: Wikidata identifier links don't respect nofollow configuration
<cscott> here's a task for "more precisely apply nofollow": T44599: Add option to apply nofollow only to external links added in revisions marked unpatrolled
<subbu> T44594#471946
<cscott> subbu: yeah, just pointing out that it's been controversial and continues to be so, which is why we originally punted on it
<cscott> subbu: i've got no problem with adding code to maintain parity with PHP, but i'd prefer it be done as part of the "further annotate output for read views" post-pass.
<cscott> if we care about storage costs, eg, we'd store the parsoid HTML w/o the post-passes which add all this cruft, and then add it on the fly into a memcache when a read view is requested
<subbu> not sure that is the reason we punted on it ... at least i am not aware of previous discussion around this .. but, definitely makes it awkward since we have repurposed that attr for link types based on rdfa
<subbu> rel="mw:ExtLink nofollow" would be odd ...
<subbu> i suppose not so odd if i stare at it long enough :)
<sbailey> The argument in T44594#471946
<sbailey> states that nofollow is generally helpful for all wikis to avoid them being used to manipulate pagerank and spamming.
<cscott> subbu: see the discussion of WP at https://en.wikipedia.org/wiki/Nofollow#Use_on_other_websites
<cscott> subbu: it was always controversial, but i think the real reason we didn't do nofollow is just the "avoid adding redundant information to read views" thing.
<cscott> which we've backtracked on
<subbu> cscott, in general true, but not sure if that holds for the nofollow property .. i don't remember discussions around this previously.
<cscott> subbu: i do, but then i wasted a lot of time debating gabriel about things like this. ;)
<cscott> rel is space separated, i believe. and any css rules that target rel should be changed to use the space-separated attribute selector.
<cscott> https://html.spec.whatwg.org/#linkTypes
<cscott> the element's rel attribute must be split on ASCII whitespace.

If this is implemented in Parsoid, the Flow code added for T68289: A16. Flow: external links should have rel="nofollow" property can/should be removed.

There's all of Parser::getExternalLinkAttribs() that needs considering here

Change 804317 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/core@master] Export nofollow settings in siteinfo API

https://gerrit.wikimedia.org/r/804317

Change 804318 had a related patch set uploaded (by C. Scott Ananian; author: Isabelle Hurbain-Palatin):

[mediawiki/services/parsoid@master] Add attributes to Parsoid external links

https://gerrit.wikimedia.org/r/804318

Change 804317 merged by jenkins-bot:

[mediawiki/core@master] Ensure core compatibility with Parsoid external link attributes support

https://gerrit.wikimedia.org/r/804317

Change 814741 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/services/parsoid@master] Allow multivalues in rel attributes

https://gerrit.wikimedia.org/r/814741

Change 814741 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Allow multivalues in rel attributes

https://gerrit.wikimedia.org/r/814741

Change 816216 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.16.0-a17

https://gerrit.wikimedia.org/r/816216

Change 816216 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.16.0-a17

https://gerrit.wikimedia.org/r/816216

Change 820512 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/services/mobileapps@master] Allows for multiple values in the "rel" attribute

https://gerrit.wikimedia.org/r/820512

Change 820512 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Allows for multiple values in the "rel" attribute

https://gerrit.wikimedia.org/r/820512

Change 822655 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/services/parsoid@master] Add attributes to Parsoid external links

https://gerrit.wikimedia.org/r/822655

Change 823594 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/services/mobileapps@master] Add spaces around rel attribute checks

https://gerrit.wikimedia.org/r/823594

Change 823594 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Split the "rel" attribute along whitespace and check for value

https://gerrit.wikimedia.org/r/823594

Change 804318 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Add nofollow support to SiteConfig & disable test checks for nofollow

https://gerrit.wikimedia.org/r/804318

Change 827533 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump parsoid to 0.16.0-a20

https://gerrit.wikimedia.org/r/827533

Change 827533 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.16.0-a20

https://gerrit.wikimedia.org/r/827533

Change 822655 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Add attributes to Parsoid external links

https://gerrit.wikimedia.org/r/822655

Change 831531 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/core@master] Add support of "check-referrer" in legacy parser tests

https://gerrit.wikimedia.org/r/831531

Change 831533 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/services/parsoid@master] Fix review comments from previous patch

https://gerrit.wikimedia.org/r/831533

Change 831602 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Bump parsoid to 0.17.0-a1 for MW 1.40

https://gerrit.wikimedia.org/r/831602

Change 831531 merged by jenkins-bot:

[mediawiki/core@master] Add support of ParserOptions::setExternalLinkTarget() in legacy parser tests

https://gerrit.wikimedia.org/r/831531

Change 831602 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.17.0-a1 for MW 1.40

https://gerrit.wikimedia.org/r/831602

There are probably CSS selectors in various places which look like a[rel="...."]. These should be changed to a[rel~="..."] (see https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors ) -- in fact, they should always have been written that way, but it wouldn't have broken before we added nofollow.

Given that this appears to have broken some clients "in the wild", specifically mobileapps and mwbot-rs, should this have been announced somewhere? Or gotten a spec version bump?

Or gotten a spec version bump?

I should've checked first, https://www.mediawiki.org/wiki/Specs/HTML/2.6.0 now exists (but no announcement?). Though I still don't really understand what the point of the Accept header is if I get the breaking changes anyways...

Technically 2.5.0 already stated that rel attributes were multi valued: https://mediawiki.org/wiki/Specs/HTML/2.5.0#Note_about_typeof_and_rel_attributes

Which is why we didn't consider this a major version bump over 2.5.0... although we could reconsider this if it makes life easier in the real world.

And unfortunately due to the logistic complexities of supporting N different HTML versions, I think clients sending an Accept header should be thought of as "best practices to support a future downgrade mechanism we might implement" rather than an actual promise we're going to implement a downgrade mechanism.

In the olden days of parsoid as a standalone service, one thought was that a deployment could just spin up N different parsoid services, running different releases of the code, if they wanted to support N different HTML versions. In the modern integrated codebase we don't really have good mechanisms to make it efficient to maintain multiple HTML versions without adding a lot of dev work.

Technically 2.5.0 already stated that rel attributes were multi valued: https://mediawiki.org/wiki/Specs/HTML/2.5.0#Note_about_typeof_and_rel_attributes

That's nice, except I think most people would've entirely overlooked that change since it wasn't called out in the list of changes - https://www.mediawiki.org/wiki/Specs/HTML/2.5.0#Changes_since_Specs/HTML/2.4.0.

Which is why we didn't consider this a major version bump over 2.5.0... although we could reconsider this if it makes life easier in the real world.

I think it's more about how it's announced and coordinated. Looking at T315209: Parsoid clients should handle multi-valued rel attributes (h/t @Arlolra), it's clear that enough clients we closely track needed fixes, which should imply/indicate that other clients would also need time to fix and change.

And unfortunately due to the logistic complexities of supporting N different HTML versions, I think clients sending an Accept header should be thought of as "best practices to support a future downgrade mechanism we might implement" rather than an actual promise we're going to implement a downgrade mechanism.

In the olden days of parsoid as a standalone service, one thought was that a deployment could just spin up N different parsoid services, running different releases of the code, if they wanted to support N different HTML versions. In the modern integrated codebase we don't really have good mechanisms to make it efficient to maintain multiple HTML versions without adding a lot of dev work.

https://www.mediawiki.org/wiki/Parsoid/API#Content_Negotiation, https://www.mediawiki.org/wiki/API_versioning#Content_format_stability_and_negotiation, etc. all need to be updated then.

Change 831533 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Fix review comments from previous patch

https://gerrit.wikimedia.org/r/831533

Change 837692 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.17.0-a2

https://gerrit.wikimedia.org/r/837692

Change 837692 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.17.0-a2

https://gerrit.wikimedia.org/r/837692

Change 971889 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/services/parsoid@master] Remove dead code path from WikiLinkText::fromSelSerImpl

https://gerrit.wikimedia.org/r/971889

Change 971889 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Remove dead code path from WikiLinkText::fromSelSerImpl

https://gerrit.wikimedia.org/r/971889

Change 988723 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a12

https://gerrit.wikimedia.org/r/988723

Change 988723 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a12

https://gerrit.wikimedia.org/r/988723