Option to produce RDF dump on demand
Open, Needs TriagePublicFeature
Actions

Assigned To

None

Authored By

	Harej
	Jun 21 2023, 10:11 PM

Description

Feature summary (what you would like to be able to do and where):

Basically I want dumpRdf.php but for Wikibase.Cloud

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):

For importing the data into another graph database, such as my local Blazegraph instance

Benefits (why should this be implemented?):

While I *can* use the query service as-is, I would save on a lot of network traffic if I kept the data locally, as I do for Wikidata's graph. Also, being able to export data into a common format is a good feature of SaaS products in general.

Related Objects

Mentioned Here: T312131: Wikibase Cloud data export tool
T351288: Create something like the Wikidata TTL dumps, but for a Wikibase.Cloud instance

Event Timeline

Harej created this task.Jun 21 2023, 10:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 21 2023, 10:11 PM

Pppery added a project: Wikibase Cloud.Jun 21 2023, 11:52 PM

Aklapper removed a subscriber: Wikibase Cloud.Jun 29 2023, 2:14 PM

Evelien_WMDE moved this task from Backlog (incoming) to Product prioritized backlog on the Wikibase Cloud board.Oct 20 2023, 9:21 AM

From merged ticket:

In T351288#9341135, @Skybristol wrote:

I'd be interested to learn more about the use case be for the full TTL encoded dump. Maybe there's a cool usage pattern I could use. It seems like the SPARQL query service can be used to generate a full graph, and it supports offset/pagination to handle scale. Trying to work against a massive TTL encoding means a lot of data being loaded into memory somewhere just to operate on. And the SPARQL/Blazegraph capability does also have its own issues with scalability, especially on queries requiring a filter approach.

I have been thinking about spinning off aspects of what will end up being a pretty massive knowledge graph into modules in an optimized binary format for some other kinds of direct computation work on the cloud. I've looked at the GAR project but haven't gotten into experimentation yet, but it seems like some kind of binary format with high level metadata and chunking might actually give us a better archival platform to operate against vs. serialized text triples that can only be dealt with in memory.

Assuming it will take some time to develop the proper plumbing between the Wikibase Cloud frontend and dumpRdf.php, I wonder if an acceptable workaround would be for me to go to Special:EntityData for every single property and item on my Wikibase, in TTL format, concatenate that into a single file, and then import it like a Wikidata dump.

I have my use case now. After seeing Hannah Bast and Johannes Kalmbach present on Qlever today, I went and spun up an instance of the back-end indexer/SPARQL service and the UI tool on a cloud VM. It's so cool when software actually works like it's supposed to! The approach to indexing and query federation there is fantastic, and I think I'll be able to jettison some of what I reorganized from sources that already conform to RDF fairly well and keep our wb.c instance focused on messier data. The Qlever SPARQL performance is truly outstanding compared to anything else I've tried.

I'd actually love to see us replace Blazegraph in WBStack with Qlever, but that's another discussion. In the meantime, we do need a way to a) get a Wikibase dump in RDF and b) some reasonable means of tapping an update stream. Once baselined, a combination of an API call for changes and serial/parallel retrieval of changed items in TTL encoding would likely work, but it will be highly inefficient. I'm a huge fan of cutting out or at least minimizing HTTP transfer of data wherever possible. That's a massive bottleneck in any process. I can't tell exactly where wikibase.cloud sits with a little trace routing, but the ideal situation would be some dump process running locally to the wb.c infrastructure that drops files on somebody's cloud for access. We might then replicate files elsewhere for practical use, but if we can shove them onto a cloud storage infrastructure somewhere for direct access, then we can deploy capabilities like Qlever onto that same infrastructure.

A full dump of geokb.wikibase.cloud to Wikibase's dump format, including full history, took almost a week to run with two hiccups. That's about a half million edits. That's fine for a process that maintains full history and something we will employ as a future-proofing strategy. But we need current state RDF dump capability for wikibase.cloud instances that gives us close to real time state to be spun up in any other infrastructure as needed.

I've been able to dump a subset of items from geokb.wikibase.cloud as NT files, index them using Qlever's indexing engine (Docker on a cloud-based Ubuntu machine), spin up the Qlever SPARQL back-end on that index, and then query it using the Qlever UI (on the same machine). Both the back-end and UI are proxied via Nginx, and it seems to be working quite well. I don't yet know what my resource load is going to look like and what memory I'll need to allocate, but I noticed the Qlever instance of U. Frieburg going down a fair bit after they publicized it at the Wikidata modeling conference.

I wrote a Python script that runs a query for what I want and then hits /Special:EntityData/<qid>.nt to retrieve and cache individual files for indexing. Blazegraph's query service is supposed to respond with n-triples or turtle, but I've not been able to get that to work. I can only get the standard XML response back no matter whether I use the format parameter or an accept header. I found the NT format to work with Qlever so far, and I've had some trouble with TTL I've not worked out yet.

Even running individual item pulls in parallel takes a bunch of time, and this seems a super inefficient and duplicative method of pulling data. I'd imagine there's got to be a better way to go about it.

The good news is, I think I can offload some of what I had been working to reorganize into Wikibase's specific knowledge representation to other collections of RDF data federated through Qlever indexing to give me a ton of reference sources I need. For instance, I also spun up a process to build an index from select DOI metadata from both Crossref and DataCite, using their native support for RDF from their APIs. That gives me the basis for linking on DOIs where I have other content associated with those. I should also be able to leverage the sameAs relationships I've built in claims to Wikidata entities, either federating with the University of Freiburg instance of the full Wikidata store or building a selective index on my instance.

I realize there would be a ton of engineering involved in retuning WBStack to use Qlever in place of Blazegraph, but I really like how it is operating thus far. It sounds like the current infrastructure for wikibase.cloud uses a message queue process on changes with multiple subscribers, including whatever is writing to Blazegraph. If that could be tapped to write triples to another store, that could be very cool!

Desires related to T312131.

Superraptor123 subscribed.Mar 26 2024, 1:18 AM

Just adding my specific use case for my instance here; the reason I think this would be useful for me specifically is so that my instance can be made into an ontology posted quarterly or yearly to GitHub as a TTL file. This is because my instance is, in many ways, a successor to an ontology I created for my thesis that is continually updated. Being able to pipeline this out in a semi-automated manner 1-4 times a year would be ideal, since BioPortal, Ontobee, etc. automatically take and curate from GitHub.

The only alternative solution right now is to use dumpgenerator (https://github.com/mediawiki-client-tools/mediawiki-dump-generator) to download the XML (which takes many hours), then I have to reupload the dump into a local Wikibase instance, then use the local maintenance scripts to download the RDF, which is then uploaded to GitHub.

If the dumps were made available at request, with some capping to prevent downloads of too much frequency, then I could simply download and automate the GitHub portion of the pipeline, which would be significantly less overhead.

Option to produce RDF dump on demandOpen, Needs TriagePublicFeatureActions

Description

Related Objects

Event Timeline

Option to produce RDF dump on demand
Open, Needs TriagePublicFeature
Actions