Page MenuHomePhabricator

URLs with action=render should not be indexed by search engines
Closed, ResolvedPublic

Description

URLs with action=render are indexed by external search engines like Google. For example, see https://www.google.com/search?q=site:en.wikipedia.org+inurl:action%3Drender .

I'm not sure the best approach to fix this. The pages are not well-formed (there's no <head> at all), so I don't know whether a <meta> robots directive (http://www.robotstxt.org/meta.html) will work. It might be necessary to use robots.txt .


Version: 1.23.0
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=46424

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 3:21 AM
bzimport set Reference to bz63891.

<meta> will likely work since <html>, <head> and <body> are optional. Browsers automatically create a head and body for text/html documents, and relevant tags are hoisted to the <head> accordingly.

However that would be undesired for more important reasons since action=render is used to retrieve partial documents. If that would start including non-content, the result is that some applications will treat that <meta> tag as part of the content and thus could incorrectly treat articles as non-indexable.

This sounds like a perfect case for an http header.

(In reply to Krinkle from comment #1)

This sounds like a perfect case for an http header.

Specifically,

X-Robots-Tag: noindex

This is also used on the web as the way to exclude internal APIs that don't respond with html (e.g. JSON responses, or images) when robots.txt hacking is not desired.

Change 134996 had a related patch set uploaded by devunt:
Add 'X-Robots-Tag: noindex' header in action=render pages

https://gerrit.wikimedia.org/r/134996

Change 134996 merged by jenkins-bot:
Add 'X-Robots-Tag: noindex' header in action=render pages

https://gerrit.wikimedia.org/r/134996

merged by Mattflaschen