Page MenuHomePhabricator

Dump format could include some statistics on content (e.g. counts by namespace)
Open, LowPublicFeature

Description

The XML dump files released by Wikimedia contain a <namespaces> section which declares namespace names and numbers for the wiki it was dumped from.

But it does not tell you which of those namespaces are actually covered by the dump files.

For instance *-*-pages-articles.xml dumps do not contain any "Talk", "* talk", or "User" entries. Not even page title and redirect information.

This is fine but with wiki dumps now being produced in the same format also outside Wikimedia with different subsets of namespaces covered, such as http://devtionary.org/w/dump/xmlu/ the dump format is now an interchange format of sorts. So it would be nice if such information which is currently metadata external to the dump files could be made internal and self-contained. This could be quite useful to tools designed to process dump files.

Perhaps a new section of the dump files named <dumpinfo> could be added to complement the <siteinfo> section.


Version: unspecified
Severity: enhancement

Details

Reference
bz21200

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:48 PM
bzimport set Reference to bz21200.

jcsahnwaldt wrote:

Similar to bug 34218 and bug 31955.

jcsahnwaldt wrote:

Similar to bug 36178.

Nemo_bis renamed this task from dump format could declare which namespaces it covers to Dump format could include some statistics on content (e.g. counts by namespace).Aug 12 2016, 7:20 AM
Nemo_bis subscribed.

IMHO the way pages were selected is not an information about the dump itself, only about what came before it. Covered namespaces could be usefully described by some short statistics on the content (useful for validation too?), though these would need to be appended at the end of the XML.

Aklapper added subscribers: ArielGlenn, Aklapper.

@ArielGlenn: Hi, I'm resetting the task assignee due to inactivity. Please feel free to reclaim this task if you plan to work on this - it would be welcome! Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for more information - thanks!

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:01 AM
Aklapper removed a subscriber: Tfinc.