Page MenuHomePhabricator

Special:Export xml responses should not be indexed by Google
Closed, ResolvedPublic

Description

https://www.google.co.uk/search?q=inurl:special:export+site:en.wikipedia.org

Template:Navbox - Wikipedia
en.wikipedia.org/?title=Special:Export&history=1&action=submit...
Wikipedia enwiki http://en.wikipedia.org/wiki/Main_Page MediaWiki 1.25wmf7 first-letter Media Special Talk User User talk Wikipedia Wikipedia talk File File talk ...

Event Timeline

Krinkle raised the priority of this task from to Needs Triage.
Krinkle updated the task description. (Show Details)
Krinkle subscribed.
MZMcBride subscribed.

This task seems like it would be fairly easy to resolve. Marking it with the good first task tag accordingly.

Note that the output is an XML page. You can't just output <meta name="robots" content="noindex,nofollow" /> on it. The only way to do that would be sending an HTTP header:

X-Robots-Tag: noindex

See https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag

On WMF wikis it may not be enough, since the front-end caches may filter out non-standard HTTP headers. This may need to be tested once the fix goes live, unless someone else knows if that could be a problem.

Also, google is notoriously ignoring the noindex,nofollow meta tags on several pages, so I wonder if that would event stop google from indexing them. See T48424

Is there a constraint for Google only or in general any search engine.In the former case the solution will be
X-Robots-Tag:googlebot:noindex,nofollow and in the latter case it will be X-Robots-Tag:noindex,nofollow

Is there a constraint for Google only or in general any search engine

Any, if a shared solution exists.

Has something changed? The google search linked in the task description no longer seems to show extraneous entries.

I think they're not visible by default anymore:

In order to show you the most relevant results, we have omitted some entries very similar to the 2 already displayed.
If you like, you can repeat the search with the omitted results included.

So, click on the "repeat the search with the omitted results included." link or use:
https://www.google.co.uk/search?q=inurl:special:export+site:en.wikipedia.org#q=inurl:special:export+site:en.wikipedia.org&filter=0

Has something changed? The google search linked in the task description no longer seems to show extraneous entries.

Not directly, but it displays this message for me: In order to show you the most relevant results, we have omitted some entries very similar to the 2 already displayed.
If you like, you can repeat the search with the omitted results included. If you like, you can repeat the search with the omitted results included.

Clicking on that link I see this:

googlesearch.png (666×579 px, 73 KB)

Not many results, though.

Bing has a lot more: https://www.bing.com/search?q=special%3aexport+site%3aen.wikipedia.org

Change 285087 had a related patch set uploaded (by Unicornisaurous):
Add X-Robots-Tag header to Special:Export dumps

https://gerrit.wikimedia.org/r/285087

As previously mentioned, this change needs to be tested on the WMF cluster in case non-standard headers are filtered out.

Change 285087 merged by jenkins-bot:
Add X-Robots-Tag header to Special:Export dumps

https://gerrit.wikimedia.org/r/285087

This is working correctly on the beta cluster