Page MenuHomePhabricator

Provide database dumps of just article namespace and/or remove project-space from "articles" dump
Open, LowPublic

Description

Author: matthew.britton

Description:
At the moment I can download "pages-meta-current", a dump of all pages, or "pages-articles", which is articles, templates, image descriptions and "primary meta pages". The latter is nice if I want to redistribute Wikipedia's content, but if I'm just trying to gather some data about articles, and I don't want to try to download them all individually, I only need the articles.

Since for en.wikipedia the "pages-articles" dump contains 8559359 pages, and there are only 2892000 articles, I'm obviously getting a lot of stuff I don't actually need. Seems it would save GBs of bandwidth (and processing time for users) if there was just a dump of article text.


Version: unspecified
Severity: enhancement

Details

Reference
bz18919

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 21 2014, 10:38 PM
bzimport set Reference to bz18919.
bzimport created this task.May 25 2009, 7:13 PM

I suspect that the amount of actual page content in the template, project, image, etc pages is much smaller than the content in the article pages, so it may be a much smaller difference in download size than it looks like from the page counts.

Tomasz, can you take a peek and see if it looks like it might be worth creating such dumps?

matthew.britton wrote:

(In reply to comment #1)

I suspect that the amount of actual page content in the template, project,
image, etc pages is much smaller than the content in the article pages

On most wikis I suspect it is, but you should see some of the templates en.wikipedia (and possibly others) has come up with recently. :)

The dump includes all non-talk namespaces except user:

From the toolserver, for enwiki, mainspace makes up about 62% of the page content (SUM(page_len)) and 69% of the number of pages (including redirects).

So you could probably save about 35-40% by making a mainspace-only dump.

matthew.britton wrote:

(In reply to comment #3)

The dump includes all non-talk namespaces except user:

Ah... so by "primary meta-pages" it actually means "all project pages", which on en.wikipedia includes rather a lot of junk.

Given even with the extra stuff the "pages-articles" dump is only meant for redistributors of projects' content, project-space should probably be removed. Category, template and portal are all needed because they're part of the actual content, but project isn't. That would probably account for a fair bit of the reduction in size.

brion added a comment.Jun 17 2009, 4:54 PM

Project space includes licensing & credit information which shouldn't be removed. There's not really a good separation between "syndicate me" and "for internal use"...

matthew.britton wrote:

(In reply to comment #5)

Project space includes licensing & credit information which shouldn't be
removed. There's not really a good separation between "syndicate me" and "for
internal use"...

Too true... my efforts to get en.wikipedia's template and category namespaces properly organized into "part of content, intended for redistribution" and "for internal use only" were shot down by people who didn't really know what they were talking about. You are right that project-space is the same (though the licenses are only a couple of pages).

My original request for a dump of articles only stands if it's considered worthwhile, but I guess nothing better can be provided for redistributors with things as they are.

A database dump of only articles would likely be very useful for bot operators and people doing statistics. For people who don't plan on redistributing or republishing the content outside of Wikipedia, the actual content of templates, category pages, image pages, etc. may not be necessary for what they want to do, nor the license information (and couldn't we just put a link to http://wikimediafoundation.org/wiki/Terms_of_Use on the download page?)

matthew.britton wrote:

Another thing to add to this request: when the end user is only interested in article content (which is most of the time, especially for 'internal' uses) having the non-article namespaces there as well means that this extra data also has to be parsed and then excluded, adding a significant amount of time to the already lengthy process of searching through one of these dumps. So the saving would not just be on bandwidth.

Givng dump bugs to Ariel.

Nemo_bis removed ArielGlenn as the assignee of this task.Apr 9 2015, 7:33 AM
Nemo_bis lowered the priority of this task from Normal to Low.
Nemo_bis set Security to None.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 28 2015, 12:06 PM