Page MenuHomePhabricator

Let public archives be indexed and archived
Closed, DeclinedPublic

Assigned To
Authored By
Nemo_bis
Feb 22 2015, 10:27 PM
Referenced Files
None
Tokens
"Like" token, awarded by waldyrious."Like" token, awarded by Anomie."Like" token, awarded by Steinsplitter."Like" token, awarded by yuvipanda."Like" token, awarded by Bawolff.

Description

https://lists.wikimedia.org/robots.txt

# robots.txt for lists.wikimedia.org
#
# Disabled crawling for several lists 2005-11-26 to
# discourage people from complaining about items they
# post on public mailing lists being the first Google
# search result about them.
#
# Note that list archives remain public.
#

User-agent: *
Disallow: /pipermail/

This is very silly. Ten years later, we should come up with a sensible process/policy rather than surrender our goals to complaint-trolls.

As someone who contributed about 2900 messages to lists.wikimedia.org (about 30 % of my mailing list history), I realise that this robots.txt policy may be a gentle way to tell so-called power posters that they're wasting their time and their graphomania will not leave traces in history. However, as a free knowledge advocate I'm not comfortable with hundreds of thousands knowledge base items locked into a domain which actively discourages discoverability.

While our mailing lists get increasingly inservible, people move their contributions to places where they feel more visible and useful in the long-term, like scattered blogs or Q&A websites or even Facebook. It would be easy to make our mailing lists a better publishing platform than Facebook and most random blogs, if only they were not isolated from the World Wide Web. But perhaps we actually *want* them to be isolated?

(Notified owners of all the 175 known public mailing lists. But forgot the link, *facepalm*.)

Event Timeline

Nemo_bis raised the priority of this task from to Needs Triage.
Nemo_bis updated the task description. (Show Details)
Nemo_bis subscribed.
Nemo_bis set Security to None.
Nemo_bis updated the task description. (Show Details)

Dear Federico Leva, spamming all mailman admins is not appreciated. Please don't do that again.

(for the people who wonder Nemo_bis == Federico Leva)

Was is really necessary to inform EVERY mailing-list? This is clearly a problem that only OPs can solve, not we mail-list-admins.

As someone who also contributed quite a number of messages (but I won't bother counting them), I can very much feel with the feeling that one does not want their message to be number one in Google. Simply blaming 'complaint trolls' is too easy. Mailing lists are not supposed to be highly visible, they are meant for meaningful exchange of arguments and thoughts. The same goes for deletion discussions etc. While they don't have to be fully hidden, being a little more obscure won't hurt.

There is actually a policy (or guideline or procedure, what you want to call it) that ops follow regarding removal of mailman archive content which is more or less "unless someone asks you with a legal order or it is the Foundation's legal team, decline" which is also documented at https://wikitech.wikimedia.org/wiki/Remove_a_message_from_mailing_list_archive

Regarding removing the robots.txt, I've added Mark and Faidon to make that decision as they are the closest to be ops members associated with mailman.

@Nemo_bis, also in future please use listadmins@lists.wikimedia.org for announcements as they go to all lists and are vetted by TheHelpfulOne and/or myself so as to prevent these cases.

To be fair, if he had _not_ mailed all list admins and just asked ops to change global settings, i would have definitely expected list admins to complain about not being informed about it (and rightfully).

@Jalexander and @Philippe-WMF as legal & co. will likely have a comment about this regarding public data. ops will as well as it complicates the removal of archived information.

Re-poke for legal mostly. Since mailman has awkward methods of removing private content which sometimes happens (well - all of the time) per legal request, indexing of lists may store this content in search engines and so on. The comment in the robots.txt (https://lists.wikimedia.org/robots.txt) seems ambiguous and as asked by Nemo for it to be removed, this should ideally be given a look over by legal.

In sort - are there any reasons either legal or ops have regarding allowing public indexing of list archives?

I'm not sure this request even has the kind of support behind it by the mailing list administrators to warrant enabling the indexing by various search engines.

Additionally, folks have noted that mailing lists are primarily for discussion (other than announce lists) not documentation. Enabling indexing would make all public lists archives are more transparent in their ability to use third party searches; but at the cost of higher scrutiny. We get a fair number of 'remove this from list posting x' annually, despite them being slightly painful to work on; adding in search engine indexing and caching and it makes any content removal even more futile.

As such, I'm not taking a stance in the above, merely noting that this issue seems far from consensus.

My personal viewpoint is that mailing lists are announcements (linking to tasks, blog postings, patchsets, etc) or discussion, wikis and the like are for documentation.

Aren't all the public archives on gmane.org anyways and get indexed there?

Aren't all the public archives on gmane.org anyways and get indexed there?

Gmane is opt-in (you have to request inclusion manually) so it is not the case for all lists, but most of the public lists are indexed anyway.

chasemp claimed this task.
chasemp added subscribers: Dzahn, chasemp.

Aren't all the public archives on gmane.org anyways and get indexed there?

Gmane is opt-in (you have to request inclusion manually) so it is not the case for all lists, but most of the public lists are indexed anyway.

Thanks for this input.

There was a large discussion on irc spanning wikimedia-mailman and -ops that boils down to no one is comfortable or feels it is good practice to index lists that explicitly historically have not been. Users participated with a clear expectation and we should honor that en-mass. List admins are free to request mbox archives or file tasks to make their own lists public with SRE support T59246.

We are adding an outcome to https://wikitech.wikimedia.org/wiki/Mailman#Step-by-step_procedure with this.

This is a per list consideration.

This comment was removed by Dzahn.

There was a large discussion on irc spanning wikimedia-mailman and -ops that boils down to no one is comfortable or feels it is good practice to index lists that explicitly historically have not been.

Historically the mailing lists have been indexed. Their not being indexed is only a recent thing, imposed by the sysadmins on the users when there was no clarity on how to handle takedown requests, and it wasn't by any means explicitly said or promised to users.

There is still interest in mirroring most mailing lists, as confirmed not only by mailing list owners (they all agreed) but by the fact that people are starting to make mirrors mushroom elsewhere, which can only worsen the issues the non-indexing is supposed to prevent.

The indexing can be opt-out or opt-in, but it's hardly defensible to have a blanket ban against the users' will.