Page MenuHomePhabricator

RSS/Atom feeds prohibited by robots.txt
Closed, DeclinedPublic

Description

Author: me

Description:
http://en.wikipedia.org/wiki/Main_Page lists its RSS feeds as:

<link rel="alternate" type="application/rss+xml" title="Wikipedia RSS Feed" href="http://en.wikipedia.org/w/index.php?title=Special:RecentChanges&amp;feed=rss" />
<link rel="alternate" type="application/atom+xml" title="Wikipedia Atom Feed" href="http://en.wikipedia.org/w/index.php?title=Special:RecentChanges&amp;feed=atom" />

Both of these are in the /w/ directory which http://en.wikipedia.org/robots.txt prohibits to the default robot. This means that clients which obey robots.txt can't read Wikipedia's RSS feed.

http://en.wikipedia.org/wiki/Special:RecentChanges?feed=atom is presumably permitted, but that's not linked to.


Version: unspecified
Severity: enhancement
URL: http://en.wikipedia.org/wiki/Main_Page

Details

Reference
bz16007

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 10:19 PM
bzimport set Reference to bz16007.
bzimport added a subscriber: Unknown Object (MLST).

My understanding is that feed readers should be acting as user-agents, not robots, so this _ought_ not to be a problem unless you want eg search engines to index the feed contents. (Which we probably don't.)

Can you turn up examples of feed readers that are using robots.txt prohibitions which are affected by this?

me wrote:

I ran into this because the Python feedfinder library follows robots.txt and thus won't find the Wikipedia RSS feed.

Yahoo does the same thing: http://jeremy.zawodny.com/blog/archives/001474.html

I'm not convinced the feeds are really search-friendly... each page has a history feed, and the whole site has a number of feeds (RC and otherwise) which tend to change very quickly. In addition, the generation of diffs etc for the feeds may result in nastiness on a spider crawl visit. I'm marking this WONTFIX for now.