Page MenuHomePhabricator

Wikibase Cloud data export tool
Open, Needs TriagePublicFeature

Description

Feature summary (what you would like to be able to do and where):

I'd like to be able to download the data (Qs/Ps) in a wikibase as a dataset. I would like this feature to be supported as part of Wikibase Cloud. Possible need for history, discussion, etc, but I'm focused on getting the statements/triples out in some kind of plain text form (rdf, ttl, yml, etc).

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):

I work in cultural heritage. I work on the principle that interactive sites we build as part of projects, like a wikibase, will at some point cease to be updated as projects lose momentum, team members, etc, and then at some point go offline. However, the data remains of value *after* the interactive website is gone. So my model of preservation involves creating dumps of the data produced and depositing those somewhere for reuse at a later date by somebody who want to use that data.

Benefits (why should this be implemented?):

There are ineffecient/computational solutions to this problem - scraping a site with wget, using dumpgenerator https://github.com/WikiTeam/wikiteam/issues/395 - but they are unsupported by Wikibase Cloud. This feature would encourage Wikibase Cloud creators to carefully consider the preservation of the data they produce, and given them a supported tool for exporting their data at the end of a project (which is often a requirement of a research project).

Event Timeline

Addshore subscribed.

Some wbstack.com context here is that I always said this would be desired, and did provide the odd JSON or RDF dump to people that requested them (manually creating them).
My past goal would have been to have this be self serve.
But also not allow folks to make infinite dumps arbitrarily, as creating the dumps is not free.

Thanks for chipping in @Addshore. To amend my feature request:

  • json or rdf is sensible.
  • agree on self-serve and some kind of dump limit (I'd be looking to do this roughly every 1-3 months)

Morning. I note that since this ticket went in a more stable Python 3 verison of dumpgenerator has emerged (huge thanks to all involved): https://github.com/elsiehupp/wikiteam3/ This solution works for me for now, but I suspect there are users for whom this use case remains valid.

Sharing the data and using the exported data for import into our UI are primary use cases for us. We definitely need RDF and may need JSON as well, but the dumpgenerator @Drjwbaker cites can be used to get the JSON.