Page MenuHomePhabricator

Fix or remove robots.txt code for Internet Archive exclusion of user pages
Closed, ResolvedPublic

Description

https://en.wikipedia.org/robots.txt currently contains the following lines:

# Don't allow the Wayback Machine to index user-pages
#User-agent: ia_archiver
#Disallow: /wiki/User
#Disallow: /wiki/Benutzer

This does not work as intended, and probably never did, see e.g.
https://web.archive.org/web/*/https://en.wikipedia.org/wiki/User:Jimbo_Wales

While it corresponds to the Internet Archive's own advice here, that page may be outdated (it appears to have been created in 2002 or earlier). It might be that the user agent "ia_archiver" needs to be changed to "archive.org_bot " (see this IA posting from 2009).

Even if this fixes it, a colon should be appended to the #Disallow patterns, or it would also affect mainspace pages like [[User space]].

Event Timeline

Tbayer raised the priority of this task from to Needs Triage.
Tbayer updated the task description. (Show Details)
Tbayer subscribed.

That's from robots.txt in operations/mediawiki-config

Even if these values were correct, the reason it isn't working is most likely because these lines are prefixed with # (hash) which is comment syntax.

Even if these values were correct, the reason it isn't working is most likely because these lines are prefixed with # (hash) which is comment syntax.

Right, I'll make it a blocking task of T104251: Move wiki-specific robots.txt out of the global file to Mediawiki:Robots.txt on specific wikis as well ;)

Change 240065 had a related patch set uploaded (by Dereckson):
Tidy robots.txt

https://gerrit.wikimedia.org/r/240065

Luke081515 triaged this task as Medium priority.Nov 14 2015, 9:11 PM
Luke081515 subscribed.

A change has just been deployed, but not marking as done yet.

Code removed, this appears to have resolved the underlying issue. If there is anything else, please open a new bug to request this is reinstated.