Page MenuHomePhabricator

robots.txt prevents indexing of Special:EntityData
Open, LowestPublic

Description

We currently disallow indexing of any /entity/Q... URI via our robots.txt because we disallow indexing of Special:EntityData. I've been asked if we can change this because it leads to the following search result for example:

vizquery1.png (174×923 px, 23 KB)

Related Objects

StatusSubtypeAssignedTask
OpenFeatureNone
OpenNone

Event Timeline

@daniel @hoo @aude: Can you say if this rule is needed or if we can get rid of it?

robots.txt is controlled by WMF ops, but by Wikidata. Special pages are blocked there because Special pages generally contain dynamic data, and should not be indexed.

Special:EntityData could be indexable, but it's a bit awkward. Depending on the request (particularly, the query string and Accept header), it may produce JSON or RDF, or a redirect to the regular HTML page. I think it would be fine to allow crawlers to index these, but I also see little added value in doing so.

Well, we could allow this, I guess… but we should at least set a canonical URL (or one per output?) as header (we can't put it in the html here, as there's none).

This is probably interesting especially as we already put the various EntityData URLs in to our regular URLs as <link rel="alternate" … type="…">.

thiemowmde moved this task from incoming to needs discussion or investigation on the Wikidata board.
thiemowmde subscribed.

I also wonder what the benefit of indexing these URLs would be. Currently, there are 5 million: https://www.google.com/search?q=site:wikidata.org+inurl:entity&filter=0. But all are, when clicked by a user, redirects to the corresponding …/wiki/Q… page.

If the search result in the screenshot is the only reason this ticket was created, I strongly suggest to close it, because stripping duplicates from search indexes is actually intended.

If the search result in the screenshot is the only reason this ticket was created, I strongly suggest to close it, because stripping duplicates from search indexes is actually intended.

Which should be done by setting a canonical URL, though… I suppose?

There are canonical URLs on all these pages, e.g. <link rel="canonical" href="https://www.wikidata.org/wiki/Q2833"/>. But this will not stop Google from indexing redirects. We do provide these redirects. They exist. They are real. Users are allowed to use them.

What we see on search result pages like https://www.google.com/search?q=site:wikidata.org+inurl:entity is entirely normal behavior. I would like to understand what people think needs "fixing" there.

@Denny could you have a look please? IIRC this came from a conversation the two of us had.