Page MenuHomePhabricator

Only include the last e.g. 6 months of news
Closed, ResolvedPublic

Description

The https://en.planet.wikimedia.org/rss20.xml feed currently has 117,422 lines of XML in it, with the last item dating from Wed, 11 Jan 2006. It's 11.4 MB (and my Firefox live bookmark isn't loading).

The included items should be limited to some max number (100?), or maybe age (six months old?).

Event Timeline

Change 439897 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] Planet: Set xmlmaxarticles to 100 in rawdog config

https://gerrit.wikimedia.org/r/439897

Change 439897 merged by Dzahn:
[operations/puppet@production] Planet: Set xmlmaxarticles to 100 in config

https://gerrit.wikimedia.org/r/439897

Thanks for reporting this!

We are now limiting to 100 posts and the file is a little under 1MB. It loads much faster.

(I had to check https://en.planet.wikimedia.org/rss20.xml? to work-around caching)

Dzahn claimed this task.

@Paladox @Samwilson Upon further thought .. that file was kind of cool for a different purpose. A single file with ALL posts since 2006 was a nice archive of content. Does it make sense to get these large XML files for each language once and release them on dumps servers? I was thinking it's a nice file for dumps or archive.org

Interesting idea! Might as well, could be useful for someone. Here's the rss20.xml from yesterday:

Does Rawdog have any system for storing posts locally without purging them over time?

(Oh, and the other thing I wondered, although it hardly matters, is that at the moment /atom.xml redirects to an Atom feed at /rss20.xml... it sort of makes more sense to do it the other way around.)

Thanks for the file! I was already thinking "i wish i had made a copy".

Regarding the archive feature, i see this:

Name Maintainer Purpose
archive Adam Sampson Write incoming articles in Atom format to a local archive (needs my atomwriter module)

http://offog.org/code/rawdog/
http://offog.org/git/rawdog-plugins/archive.py

but haven't tried it yet.

It seems we could also just have a cron job that copies the file somewhere every once in a while.

Regarding the redirect, @Paladox might have some thoughts on it. We discussed whether the new format is Atom or RSS 2.0 and came to the conclusion that this way is more accurate than the other way around.

Oh yes, I didn't look close enough at the source before; it is indeed RSS (as much as anything ever is)! :-) It's that old thing of the description tag not being very well defined... some people say it shouldn't contain any markup and that the article body should go in the Atom element called <encoded>...

Anyway, yeah, the archive plugin looks interesting. I've often wondered about a tool to import RSS articles into MediaWiki (along with images etc.)... of course, there are copyright problems with storing all articles, aren't there? Or is it okay if it's for an archive?

I've often wondered about a tool to import RSS articles into MediaWiki (along with images etc.)...

There is the Mediawiki RSS extension which yours truly once started and others have enhanced a lot:

https://www.mediawiki.org/wiki/Extension:RSS

of course, there are copyright problems with storing all articles, aren't there? Or is it okay if it's for an archive?

That's a good question that i don't know the answer to.

Vvjjkkii renamed this task from Only include the last e.g. 6 months of news to 27aaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Paladox as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: Paladox; removed: gerritbot, Aklapper.
CommunityTechBot renamed this task from 27aaaaaaaa to Only include the last e.g. 6 months of news.Jul 2 2018, 2:09 PM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to Paladox.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot edited subscribers, added: gerritbot, Aklapper; removed: Paladox.