[Task] Disallow Special:GoToLinkedPage in wikidata.org/robots.txt
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	thiemowmde
	Aug 30 2016, 7:17 PM

Description

We just realized that Google started indexing tenth of thousands of Special:GoToLinkedPage redirects.

Tasks:

Add Disallow: /wiki/Special:GoToLinkedPage to http://www.wikidata.org/robots.txt. This is done via https://www.wikidata.org/wiki/MediaWiki:Robots.txt, I believe.
Check if there are more special pages that should be excluded entirely. Some are already excluded.

Event Timeline

thiemowmde created this task.Aug 30 2016, 7:17 PM

Restricted Application added subscribers: TerraCodes, Aklapper. · View Herald TranscriptAug 30 2016, 7:17 PM

Sjoerddebruin awarded a token.Aug 30 2016, 7:19 PM

Added the link to the page.
Leaving it open for a few days in case more pages are found.
Might take some time before it's actually visible in the robots.txt

What about URLs like title=Special:GoToLinkedPage&site=dewiki&itemid=Q123456?

@hoo, doesn't seem to be a problem at the moment: https://www.google.com/search?q=site:wikidata.org+inurl:title=Special:GoToLinkedPage&filter=0

I played around with different Google queries and found that these three are an actual problem:

https://www.google.com/search?q=site:wikidata.org+inurl:ItemByTitle&filter=0 (~38,000 Google results)
https://www.google.com/search?q=site:wikidata.org+inurl:GoToLinkedPage&filter=0 (~14,000)
https://www.google.com/search?q=site:wikidata.org+inurl:EntityData&filter=0 (~1,300)

My conclusions:

ItemByTitle is pure search and pure redirect, similar to GoToLinkedPage.
EntityData is either an RDF file (not sure if we want to disallow these) or a redirect when no file extension is given.
All exclusions should have slashes at the end to not exclude the special page itself, only the results it produces.
All URLs with %3A instead of : are duplicates anyway, let's exclude them all.

@Mbch331, please edit https://www.wikidata.org/wiki/MediaWiki:Robots.txt as follows. Thanks.

Disallow: /wiki/Special%3A
Disallow: /wiki/Special:EntityData/
Allow: /wiki/Special:EntityData/*.
Disallow: /wiki/Special:GoToLinkedPage/
Disallow: /wiki/Special:ItemByTitle/
Disallow: /wiki/Special:SetAliases/
Disallow: /wiki/Special:SetDescription/
Disallow: /wiki/Special:SetLabel/
Disallow: /wiki/Special:SetLabelDescriptionAliases/
Disallow: /wiki/Special:SetSiteLink/

thiemowmde moved this task from Proposed to Review on the Wikidata-Sprint-2016-08-30 board.Aug 31 2016, 7:24 AM

thiemowmde moved this task from Backlog to Doing on the good first task board.

Updated as requested

Tobi_WMDE_SW moved this task from Review to Done on the Wikidata-Sprint-2016-08-30 board.Sep 1 2016, 9:40 AM

You'll probably want to consider adapting the extension to make to enforce this in a better way for all users..

Perhaps the X-Robots-Tag http header can be used to remove indexing of the redirect... Not sure, redirects can be a bit problematic in that way.

Right, this will still be a problem on Wikibase installations other than wikidata.org. How does MediaWiki core do this?

Google still index these pages: https://www.google.com/search?q=African+wild+dog+site:wikidata.org (notice that the first result has a cached version of yesterday)

In T144308#2662636, @Sjoerddebruin wrote:

Google still index these pages: https://www.google.nl/search?client=safari&rls=en&q=African+wild+dog+site:wikidata.org&ie=UTF-8&oe=UTF-8&gfe_rd=cr&ei=G1HlV9TLHemRwAKeurXICg (notice that the first result has a cached version of yesterday)

Probably due to the trailing slashes in the request posted in T144308#2597079. The URL for the cached entry is http://www.wikidata.org/wiki/Special:GoToLinkedPage?site=enwiki&itemid=Q173651. (Which has no slash after GoToLinkedPage)

@thiemowmde: Maybe we should remove the trailing slashes in the robots.txt?

Note that these Special:GoToLinkedPage?… do have parameters with a question mark attached, not a slash. ~~I believe the easiest way to fix this is to remove most of the ending slashes from all lines in https://www.wikidata.org/wiki/MediaWiki:Robots.txt, or to add an other line with a question mark.~~ I would like to do a bit of research first which URLs appear in the wild.

thiemowmde moved this task from Proposed to Backlog on the Wikidata-Sprint-2016-10-12 board.Oct 12 2016, 12:40 PM

FYI: you disallowed crawling, that doesn't mean you disallowed indexing for modern search engines. If another indexed page links to the url, that google will still index it.

To quote

When you block URLs from being indexed in Google via robots.txt, they may still show those pages as URL only listings in their search results. A better solution for completely blocking the index of a particular page is to use a robots noindex meta tag on a per page bases. You can tell them to not index a page, or to not index a page and to not follow outbound links by inserting either of the following code bits in the HTML head of your document that you do not want indexed.

This why we have NOINDEX and setRobotPolicy on OutputPage etc.. it's just that these are redirects and/or not necessarily HTML. That's why i pointed at X-Robots-Tag.
Another thing to pay attention to, is indicating canonical urls whenever possible

As you said, Special:GoToLinkedPage redirects and does not output HTML (except for the form, which is a single page, and the reason why Disallow: /wiki/Special:GoToLinkedPage should not be used). The target pages of these redirects are Wikipedia articles. They should be indexed, and they already have canonical tags.

I'm not sure what an X-Robots-Tag will do when used with a redirect.

I believe there is no point in crawling Special:GoToLinkedPage URLs, because they are guaranteed to do nothing but redirect to Wikipedia articles. We know each redirect represents a sitelink, and sitelinks are already accessible on the ordinary item page. We know all this. Google does not.
Similar for Special:ItemByTitle, which redirects to a Wikidata item. The exact same links already exist in the sidebars of the connected Wikipedia articles.

@Mbch331, please add the following lines to https://www.wikidata.org/wiki/MediaWiki:Robots.txt:

Disallow: /wiki/Special:GoToLinkedPage?
Disallow: /wiki/Special:ItemByTitle?
Disallow: /wiki/Special:SetSiteLink?

Do not remove the slashes, because this would exclude the special page forms itself. We want these to appear in a Google search.

Request on https://www.wikidata.org/wiki/MediaWiki_talk:Robots.txt#Also_exclude_URLs_with_question_marks is done.

thiemowmde closed this task as Resolved.Oct 25 2016, 8:01 AM

thiemowmde moved this task from Backlog to Done on the Wikidata-Sprint-2016-10-12 board.

https://www.google.com/search?q=Ethiopian+wolf+site%3Awikidata.org was visited yesterday and is still indexed by Google.

We changed our robots.txt two and a half weeks ago. Re-visiting possibly millions of URLs in such a short time is something neither we nor Google want. At the moment there are 28,000 left, it seems.

The links we want to exclude are tools and never meant to be indexed. On the one hand, Google can't know this. On the other hand, I wonder why an existing canonical tag is basically ignored and Google acts like it found Wikipedia articles on Wikidata.

Let's check again in another two weeks.

[Task] Disallow Special:GoToLinkedPage in wikidata.org/robots.txtClosed, ResolvedPublicActions

Description

Event Timeline

[Task] Disallow Special:GoToLinkedPage in wikidata.org/robots.txt
Closed, ResolvedPublic
Actions