Page MenuHomePhabricator

[Task] Disallow Special:GoToLinkedPage in wikidata.org/robots.txt
Closed, ResolvedPublic

Description

We just realized that Google started indexing tenth of thousands of Special:GoToLinkedPage redirects.

Tasks:

Event Timeline

Mbch331 subscribed.

Added the link to the page.
Leaving it open for a few days in case more pages are found.
Might take some time before it's actually visible in the robots.txt

What about URLs like title=Special:GoToLinkedPage&site=dewiki&itemid=Q123456?

thiemowmde moved this task from incoming to in progress on the Wikidata board.

@hoo, doesn't seem to be a problem at the moment: https://www.google.com/search?q=site:wikidata.org+inurl:title=Special:GoToLinkedPage&filter=0

I played around with different Google queries and found that these three are an actual problem:

My conclusions:

  • ItemByTitle is pure search and pure redirect, similar to GoToLinkedPage.
  • EntityData is either an RDF file (not sure if we want to disallow these) or a redirect when no file extension is given.
  • All exclusions should have slashes at the end to not exclude the special page itself, only the results it produces.
  • All URLs with %3A instead of : are duplicates anyway, let's exclude them all.

@Mbch331, please edit https://www.wikidata.org/wiki/MediaWiki:Robots.txt as follows. Thanks.

Disallow: /wiki/Special%3A
Disallow: /wiki/Special:EntityData/
Allow: /wiki/Special:EntityData/*.
Disallow: /wiki/Special:GoToLinkedPage/
Disallow: /wiki/Special:ItemByTitle/
Disallow: /wiki/Special:SetAliases/
Disallow: /wiki/Special:SetDescription/
Disallow: /wiki/Special:SetLabel/
Disallow: /wiki/Special:SetLabelDescriptionAliases/
Disallow: /wiki/Special:SetSiteLink/

Updated as requested

You'll probably want to consider adapting the extension to make to enforce this in a better way for all users..

Perhaps the X-Robots-Tag http header can be used to remove indexing of the redirect... Not sure, redirects can be a bit problematic in that way.

Right, this will still be a problem on Wikibase installations other than wikidata.org. How does MediaWiki core do this?

Google still index these pages: https://www.google.com/search?q=African+wild+dog+site:wikidata.org (notice that the first result has a cached version of yesterday)

Google still index these pages: https://www.google.nl/search?client=safari&rls=en&q=African+wild+dog+site:wikidata.org&ie=UTF-8&oe=UTF-8&gfe_rd=cr&ei=G1HlV9TLHemRwAKeurXICg (notice that the first result has a cached version of yesterday)

Probably due to the trailing slashes in the request posted in T144308#2597079. The URL for the cached entry is http://www.wikidata.org/wiki/Special:GoToLinkedPage?site=enwiki&itemid=Q173651. (Which has no slash after GoToLinkedPage)

@thiemowmde: Maybe we should remove the trailing slashes in the robots.txt?

thiemowmde reopened this task as Open.EditedSep 23 2016, 4:16 PM

Note that these Special:GoToLinkedPage?… do have parameters with a question mark attached, not a slash. I believe the easiest way to fix this is to remove most of the ending slashes from all lines in https://www.wikidata.org/wiki/MediaWiki:Robots.txt, or to add an other line with a question mark. I would like to do a bit of research first which URLs appear in the wild.

FYI: you disallowed crawling, that doesn't mean you disallowed indexing for modern search engines. If another indexed page links to the url, that google will still index it.

To quote

When you block URLs from being indexed in Google via robots.txt, they may still show those pages as URL only listings in their search results. A better solution for completely blocking the index of a particular page is to use a robots noindex meta tag on a per page bases. You can tell them to not index a page, or to not index a page and to not follow outbound links by inserting either of the following code bits in the HTML head of your document that you do not want indexed.

This why we have NOINDEX and setRobotPolicy on OutputPage etc.. it's just that these are redirects and/or not necessarily HTML. That's why i pointed at X-Robots-Tag.
Another thing to pay attention to, is indicating canonical urls whenever possible

As you said, Special:GoToLinkedPage redirects and does not output HTML (except for the form, which is a single page, and the reason why Disallow: /wiki/Special:GoToLinkedPage should not be used). The target pages of these redirects are Wikipedia articles. They should be indexed, and they already have canonical tags.

I'm not sure what an X-Robots-Tag will do when used with a redirect.

  • I believe there is no point in crawling Special:GoToLinkedPage URLs, because they are guaranteed to do nothing but redirect to Wikipedia articles. We know each redirect represents a sitelink, and sitelinks are already accessible on the ordinary item page. We know all this. Google does not.
  • Similar for Special:ItemByTitle, which redirects to a Wikidata item. The exact same links already exist in the sidebars of the connected Wikipedia articles.

@Mbch331, please add the following lines to https://www.wikidata.org/wiki/MediaWiki:Robots.txt:

Disallow: /wiki/Special:GoToLinkedPage?
Disallow: /wiki/Special:ItemByTitle?
Disallow: /wiki/Special:SetSiteLink?

Do not remove the slashes, because this would exclude the special page forms itself. We want these to appear in a Google search.

thiemowmde moved this task from Backlog to Done on the Wikidata-Sprint-2016-10-12 board.

We changed our robots.txt two and a half weeks ago. Re-visiting possibly millions of URLs in such a short time is something neither we nor Google want. At the moment there are 28,000 left, it seems.

The links we want to exclude are tools and never meant to be indexed. On the one hand, Google can't know this. On the other hand, I wonder why an existing canonical tag is basically ignored and Google acts like it found Wikipedia articles on Wikidata.

Let's check again in another two weeks.