Page MenuHomePhabricator

Add rel="nofollow" to Parsoid links
Open, NormalPublic

Description

If Parsoid is to be used for web page views (as opposed to "internal to editors" and "mobile web" or "app" views), it should respect the WP $wgNoFollowLinks configuration option. This may have to be exported in the MW siteinfo if it is not already present.

We already use rel for link type. The proper way to combine rel attributes is by space-separating them:

To determine which link types apply to a link, a, or area element, the element's rel attribute must be split on ASCII whitespace. The resulting tokens are the keywords for the link types that apply to that element.

https://html.spec.whatwg.org/#linkTypes

There are probably CSS selectors in various places which look like a[rel="...."]. These should be changed to a[rel~="..."] (see https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors ) -- in fact, they should always have been written that way, but it wouldn't have broken before we added nofollow.

Event Timeline

cscott created this task.Feb 1 2018, 5:15 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 1 2018, 5:15 PM
ssastry triaged this task as Normal priority.Feb 1 2018, 5:16 PM
ssastry moved this task from Backlog to Read Views on the Parsoid board.
cscott added a comment.Feb 1 2018, 5:25 PM

Repeating a comment from T58756:

I'd *strongly* prefer to see this added as a post-processing pass, instead of directly to the wt2html pass. We should maintain a good discipline about separating out redundant information which is only used for read views or styling, so that (in the future) this can be generated on-demand and we don't waste storage or transmission bandwidth on unnecessary bits.

Some IRC discussion and related phab tasks:
<cscott> subbu: rel="nofollow" is part of that "redundant information needed for read views but not editing"
<cscott> in fact, it's only needed for web read views, not mobile read views, since no one indexes our mobile content (....not that that's necessarily a good thing....)
<subbu> rel="nofollow" is a directive to crawlers, isn't it?
<sbailey> yes, nofollow is for google etc.
<subbu> and also affects page rank computation i thought .. or maybe it might have at one point .. i assume google's algo is much smarter now.
<cscott> it's a disincentive to put spam links on wikipedia, since we use nofollow so they won't get any google juice
<cscott> but yeah, i think google's algorithm is smarter
<cscott> and it's not entirely clear that we *don't* want to give google juice to authoritative references cited in our articles
<sbailey> yes, actually increases page rank supposedly
<subbu> cscott, that is not a decision for us to make?
<cscott> yes, exactly
<cscott> it's a bit over our level
<cscott> and not actually relevant until/unless parsoid is used for web read views
<subbu> right .. so, that is what i meant by .. it is something that needs resolving in our output for read views .. unless someone determines it is no longer relevant.
<cscott> technically i think there's a wp config option for whether to add nofollow or not
<cscott> we should honor that, so that it is up to the individual wiki whether to use nofollow or not
<cscott> cf T8153: Disable nofollow tag for hyperlinks on de.wikipedia for instance, where dewiki wanted to remove nofollow
<subbu> right .. so, the qn. is how do we tackle this given that parsoid is already using it for the link type.
<cscott> check the spec, you can [white space separate] rel i think.
<subbu> sbailey, ^
<cscott> cf T68289: A16. Flow: external links should have rel="nofollow" property
<cscott> IMO it should be a separate post-processing phase, since it is clearly not needed for anything but read views
<sbailey> hmm, since we don't emit nofollow yet, we can make that a configuration option.
<cscott> see also https://gerrit.wikimedia.org/r/206344
<subbu> sbailey, also reg. the external link class phab ticket, you could reopen it till all pieces are done .. magic links, for ex.
<cscott> it's $wgNoFollowLinks
<subbu> sbailey, cscott is saying it is a wiki-config option.
<cscott> it *should* be exported in the MW siteinfo. if it's not already, that needs to be done first, so we can then honor the setting properly.
<cscott> (shades of the ATWG discussion about api-driven whatevers)
<cscott> and it's not just us, wikidata also has this issue: T175230: Wikidata identifier links don't respect nofollow configuration
<cscott> here's a task for "more precisely apply nofollow": T44599: Add option to apply nofollow only to external links added in revisions marked unpatrolled
<subbu> T44594#471946
<cscott> subbu: yeah, just pointing out that it's been controversial and continues to be so, which is why we originally punted on it
<cscott> subbu: i've got no problem with adding code to maintain parity with PHP, but i'd prefer it be done as part of the "further annotate output for read views" post-pass.
<cscott> if we care about storage costs, eg, we'd store the parsoid HTML w/o the post-passes which add all this cruft, and then add it on the fly into a memcache when a read view is requested
<subbu> not sure that is the reason we punted on it ... at least i am not aware of previous discussion around this .. but, definitely makes it awkward since we have repurposed that attr for link types based on rdfa
<subbu> rel="mw:ExtLink nofollow" would be odd ...
<subbu> i suppose not so odd if i stare at it long enough :)
<sbailey> The argument in T44594#471946
<sbailey> states that nofollow is generally helpful for all wikis to avoid them being used to manipulate pagerank and spamming.
<cscott> subbu: see the discussion of WP at https://en.wikipedia.org/wiki/Nofollow#Use_on_other_websites
<cscott> subbu: it was always controversial, but i think the real reason we didn't do nofollow is just the "avoid adding redundant information to read views" thing.
<cscott> which we've backtracked on
<subbu> cscott, in general true, but not sure if that holds for the nofollow property .. i don't remember discussions around this previously.
<cscott> subbu: i do, but then i wasted a lot of time debating gabriel about things like this. ;)
<cscott> rel is space separated, i believe. and any css rules that target rel should be changed to use the space-separated attribute selector.
<cscott> https://html.spec.whatwg.org/#linkTypes
<cscott> the element's rel attribute must be split on ASCII whitespace.

cscott added a comment.Feb 1 2018, 5:26 PM

If this is implemented in Parsoid, the Flow code added for T68289: A16. Flow: external links should have rel="nofollow" property can/should be removed.

Reedy edited projects, added Parsoid-Read-Views; removed Parsoid.Sep 17 2018, 7:25 PM