Page MenuHomePhabricator

Data dumps need better documentation
Closed, ResolvedPublic

Description

I've just started exploring the XML data dumps available in /public/dumps/public/. It is difficult to understand what's available because the documentation is incomplete.

I'm looking specifically at https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download#XML files. It gives a general description of what's available, but unlike the Database tables section above it, gives no concrete mapping to file names.

Could somebody who is familiar with the dump process please go through and provide more detailed documentation of what file name patterns correspond to which XML data sets? I started recording some of my own observations at https://meta.wikimedia.org/wiki/Talk:Data_dumps/Dump_format, but that's both laborious and error-prone; I might simply be memorializing my misunderstanding of how things actually are.

Event Timeline

Hi, dumps maintainer here.

The database schema for each sql table is linked to from the docs for those files, at https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download#Available_for_download_from_XML/Sql_dumps_per_project

A description of the xml contents for the stubs/page content files is linked to from the docs for those files as well, see the description in https://meta.wikimedia.org/wiki/Data_dumps/Dump_format, especially the section "Format of the XML files".

Is it the file naming scheme that is confusing, and in particular the numbering? I can certainly add some information about that.

I have added some information about filenames here: https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download#XML_files if that is helpful. Please let me know if there is other information missing!

Thanks for the quick response. That looks like it covers most of the questions I had.

I assume the md5sums and sha1sums files are just checksumming the corresponding data files for transmission verification?

Thanks for the quick response. That looks like it covers most of the questions I had.

I assume the md5sums and sha1sums files are just checksumming the corresponding data files for transmission verification?

Yes, sometimes downloads are corrupt causing any number of weird errors,and the hashes are a quickie way to check that.

If you have no further questions, I'll close this task. You might also consider subscribing to the xmldatadumps-l mailing list https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l as a relatively low traffic list where announcements about dumps are sent, and people familiar with uses of the dumps discuss with each other.

ArielGlenn claimed this task.