robots.txt prevents indexing of Special:EntityData
Open, LowestPublic
Actions

Assigned To

None

Authored By

	Lydia_Pintscher
	Oct 30 2017, 5:01 PM

Description

We currently disallow indexing of any /entity/Q... URI via our robots.txt because we disallow indexing of Special:EntityData. I've been asked if we can change this because it leads to the following search result for example:

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open	Feature	None	T16720 robots.txt (tracking)
		Open		None	T179312 robots.txt prevents indexing of Special:EntityData

Event Timeline

Lydia_Pintscher created this task.Oct 30 2017, 5:01 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 30 2017, 5:01 PM

@daniel @hoo @aude: Can you say if this rule is needed or if we can get rid of it?

robots.txt is controlled by WMF ops, but by Wikidata. Special pages are blocked there because Special pages generally contain dynamic data, and should not be indexed.

Special:EntityData could be indexable, but it's a bit awkward. Depending on the request (particularly, the query string and Accept header), it may produce JSON or RDF, or a redirect to the regular HTML page. I think it would be fine to allow crawlers to index these, but I also see little added value in doing so.

Well, we could allow this, I guess… but we should at least set a canonical URL (or one per output?) as header (we can't put it in the html here, as there's none).

This is probably interesting especially as we already put the various EntityData URLs in to our regular URLs as <link rel="alternate" … type="…">.

I also wonder what the benefit of indexing these URLs would be. Currently, there are 5 million: https://www.google.com/search?q=site:wikidata.org+inurl:entity&filter=0. But all are, when clicked by a user, redirects to the corresponding …/wiki/Q… page.

If the search result in the screenshot is the only reason this ticket was created, I strongly suggest to close it, because stripping duplicates from search indexes is actually intended.

In T179312#3754286, @thiemowmde wrote:

If the search result in the screenshot is the only reason this ticket was created, I strongly suggest to close it, because stripping duplicates from search indexes is actually intended.

Which should be done by setting a canonical URL, though… I suppose?

There are canonical URLs on all these pages, e.g. <link rel="canonical" href="https://www.wikidata.org/wiki/Q2833"/>. But this will not stop Google from indexing redirects. We do provide these redirects. They exist. They are real. Users are allowed to use them.

What we see on search result pages like https://www.google.com/search?q=site:wikidata.org+inurl:entity is entirely normal behavior. I would like to understand what people think needs "fixing" there.

Liuxinyu970226 added a parent task: T16720: robots.txt (tracking).Jul 13 2018, 2:18 PM

@Denny could you have a look please? IIRC this came from a conversation the two of us had.

robots.txt prevents indexing of Special:EntityDataOpen, LowestPublicActions

Description

Related ObjectsSearch...

Event Timeline

robots.txt prevents indexing of Special:EntityData
Open, LowestPublic
Actions

Related Objects
Search...