Page MenuHomePhabricator

Disallow ia_archiver on user and user_talk pages (robots.txt)
Closed, DeclinedPublic

Description

Author: wiki.bugzilla

Description:
Somehow related to bug 4937:
We've got several complaints about http://www.archive.org/web/web.php
accidentally storing specific offensive revisions of Wikipedia's pages, especially
of user and user_talk pages (mainly privacy issues because of publication of
personal data). The Wayback Machine stores years old data and may also keep it,
even if the original page is gone, for those few "bug readers" who don't know ...

So revisions were already deleted on Wikipedia, but are still stored by the Wayback
crawler (in opposite e.g. to Google's cache, which simply updates/overwrites old data).

According to
http://www.sims.berkeley.edu:8000/research/conferences/aps/removal-policy.html
our users normally don't have an easy and suitable possibility to request the removal
of "their" data, because they have to prove their identity and the identity of a Wikipedia
account (possible but complicated).

Also this is a more general problem, because most users are not aware of the
Wayback Machine at all. A Common argument in the discussion about this topic is,
that there is no need for such an external storage because we've got our own history
which is widely distributed ...

To exclude the Internet Archive's crawler (and remove old documents there)
robots.txt should say:

User-agent: ia_archiver
Disallow: /

I don't see any disadvantages in adding this, at least for NS:2 and NS:3, where
nearly all requests were reffering to, afaik.


Version: unspecified
Severity: normal

Related Objects

StatusSubtypeAssignedTask
OpenFeatureNone
DeclinedNone

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 9:11 PM
bzimport set Reference to bz5582.
bzimport added a subscriber: Unknown Object (MLST).

marco wrote:

It should be
User-agent: ia_archiver
Disallow: /wiki/User
Disallow: /wiki/Benutzer
etc.

Marco

We don't have our own history. We need the Internet Archive to protect us
against deletionist admins who would erase the early history of Wikipedia
without a second thought, on the basis that it's unencyclopedic or not part of
our mission or whatever.

wiki.bugzilla wrote:

If that is generally regarded as unwanted: What would be a feasible
method for removing specific user pages, and probably some others too,
– that were already deleted for good reason on Wikipedia, and on
particular request, of course – from the Archive then?

wiki.bugzilla wrote:

Right, this is the option to use if we do not want to exclude ns:2 and ns:3 in general.

But this would mean that you have to change robots.txt pretty often, and also,
that this file would become rather long. Thinking in the long run, it is likely that
you would have to handle such requests on a daily basis. Would that be feasable?
In addition, if this methods gets better known somewhen, robots.txt could serve
as a focus for others who want to pitch on such issues to publish especially those
pages elsewhere ... odd.

Hmm, another option could be to authorise the Foundation/the Office with handling
such issues on individual request, presumably per email to archive.org (perhaps
with a bunch of specific pages for all different projects in one mail, if sending out
many mails for single requests would keep them too busy).

Note: We had a case recently were a user tried to get his pages removed through a
very complicated and long correspondence with archive.org. And finally he failed due
to being not able (from archive.org's view, afaik) to prove that he's really the same as
the Wikipedia account, and also because it was hard to communicate about non-English
matters.

So if someone (or a position) could be named as contact here
(either "please poke a dev to add it" or "please contaxt X, WMF")
I suggest to close this bug.

jeluf wrote:

robots.txt can now be edited on-wiki by editing Mediawiki:robots.txt => closing this bug

Restricted Application added subscribers: JEumerus, Matanya. · View Herald Transcript

Change 358171 had a related patch set uploaded (by Framawiki; owner: Framawiki):
[operations/mediawiki-config@master] robots.txt: Remove old and disabled archive.org_bot rule

https://gerrit.wikimedia.org/r/358171

Change 358171 merged by jenkins-bot:
[operations/mediawiki-config@master] robots.txt: Remove old and disabled archive.org_bot rule

https://gerrit.wikimedia.org/r/358171

Mentioned in SAL (#wikimedia-operations) [2019-09-30T21:42:06Z] <jforrester@deploy1001> Synchronized robots.txt: Remove old InternetArchive bot rule that's been disabled since 2008 T7582 (duration: 00m 57s)