Page MenuHomePhabricator

Divide XML dumps by page.page_namespace (and figure out what to do with the "pages-articles" dump)
Open, LowPublic

Description

Looking at https://dumps.wikimedia.org/enwiki/20170101/ currently, I see:

2017-01-02 21:16:30 done Articles, templates, media/file descriptions, and primary meta-pages, in multiple bz2 streams, 100 pages per stream
enwiki-20170101-pages-articles-multistream.xml.bz2 13.5 GB

I personally always avoid this dump because it's so infuriatingly ambiguous about what it does and does not include. Does "templates" include Scribunto/Lua modules? Does "primary meta-pages" include pages such as customized MediaWiki messages?

Instead of dividing seemingly arbitrarily, we should instead divide by page.page_namespace. For people interacting with the dumps, I think such a technical division would be quite useful. I somewhat regularly get asked to do dump scans of just a particular namespace. Even if the requests were for multiple namespaces, I could pick and choose what I need and don't need more easily.

For all namespaces, we have "meta-current" of course. This is sufficient.

This task is vaguely related to T20919: Provide database dumps of just article namespace and/or remove project-space from "articles" dump.

It may make sense to do away with the "pages-articles" dump altogether, but it's unclear how much of a breaking change that would be. We could just leave it alone for now, but that would take up more disk space.

Event Timeline

MZMcBride raised the priority of this task from to Needs Triage.
MZMcBride updated the task description. (Show Details)
MZMcBride renamed this task from "pages-articles" dump needs further thought and consideration to "pages-articles" XML dumps need further thought and consideration.May 18 2015, 2:33 AM
MZMcBride set Security to None.

Division by page.page_namespace seems a great idea. For example most of replacements never goes out of main namespace, so we will save lots of time.

Vito

I use the pages-articles dump very often to get a local copy of Wikipedia. It contains everything except user pages and talk pages (for all namespaces).

I use the enwiki pages-articles dumps for finding typos and grammar errors. Articles, of course, but also templates, portal pages, file descriptions, help pages, project-space policy pages... nearly everything except user pages and talk pages.

I use the enwiki pages-articles dumps for finding typos and grammar errors. Articles, of course, but also templates, portal pages, file descriptions, help pages, project-space policy pages... nearly everything except user pages and talk pages.

So would it make a big difference for you to use a dump including all namespaces instead?

So would it make a big difference for you to use a dump including all namespaces instead?

You mean, enwiki-20150602-pages-meta-current.xml.bz2 (21.8GB) instead of enwiki-20150602-pages-articles.xml.bz2 (11.2 GB)? I would be able to manage with that, though obviously that's a factor of two in download time, disc space, and disc i/o each time I use it.

The pages-articles files are the ones that get turned into torrents and listed at https://meta.wikimedia.org/wiki/Data_dump_torrents . My μTorrent statistics tell me I've downloaded six dumps this year and then uploaded each one between 9 and 33 times. There are a lot of users of dumps out there.

Would you like to bring this issue up on the xmldatadumps-l list and point folks to this ticket? A change like this should have more than input from a handful of people. I'm particularly interested in bot users, as they might be the hardest hit by any change.

Hmm seems I typed this comment a while back and never saved it, but phab or ff remembered it...

Since my last post here I've discovered that burnbit.com won't create a Torrent of a "pages-meta-current" file: "Sorry! but that file is too big". If "pages-articles" is discontinued we'll need some new recommendations for downloading a "pages-meta-current".

I still haven't seen this discussion on the xmldatadumps-l mailing list, and it should be cross-posted to wikitech-l and to research-l as well.

I still haven't seen this discussion on the xmldatadumps-l mailing list, and it should be cross-posted to wikitech-l and to research-l as well.

Perhaps @Nemo_bis would be interested in posting such a message? I'm not subscribed to two of those lists.

I wouldn't know what to write, but I can forward.

MZMcBride renamed this task from "pages-articles" XML dumps need further thought and consideration to Divide XML dumps by page.page_namespace (and figure out what to do with the "pages-articles" dump).Jan 18 2017, 12:27 AM

I wouldn't know what to write, but I can forward.

Posted to wikitech-l: https://lists.wikimedia.org/pipermail/wikitech-l/2017-January/087393.html. I included a request to forward.

Hi Nemo. Thanks for adding me as a subscriber.

Regarding the task, I think it would be great to go the "à la carte" approach and divide the dump into separate namespace files. If that happens, then I'd agree that there would be no real need for the pages-articles dump. Interested users can just get what they need from the namespace dumps. Arguably, the meta-current dump could also be dropped as it too would be consuming unnecessary space.

With that said, I don't think that the pages-article dump should be dropped until these namespace dumps are available. Althought the meta-current dump is a superset of the pages-article data, it is generally double the size. Most users I've dealt with don't use the talk and user namespaces of the meta-current dump, and would be downloading / processing all that data for no purpose.

In other words, adding namespace dumps, keeping meta-current, and dropping pages-articles would be good. Not adding namespace dumps, keeping meta-current, but dropping pages-articles would not be good.

Finally, I'd vote for generating dumps for each namespace, and not grouping any together. For English Wikipedia, that may mean as many as 32 namespace files (https://en.wikipedia.org/wiki/Wikipedia:Namespace) even though several (like Gadget or Timed Text) would probably be only a few KB.

The AutoWikiBrowser dump scanner can only read from one XML file, so if pages-articles is dropped I would probably have to switch to pages-meta-current even though it would be twice as big to download and process. I work with about nine namespaces, so could only work with per-namespace dumps if someone provided me with scripting/programming help to download nine dumps, uncompress them, and combine them into a single file for AWB.

Alternatively AWB could be modified to read in several xml files.
Pinging @Magioladitis and @Reedy for that.

On what platform are you running?

Also bots might need to use different namespaces. Anyway piping them locally shouldn't be so hard, though it should be supported explicitly by pywiki/AWB scripts (basically ignoring any further <siteinfo> section).