Page MenuHomePhabricator

Rebuilding the WDQS index from the entire Wikibase history
Closed, ResolvedPublic

Description

As part of my weekend experiments with Wikibase Release Strategy, I bumped into a tricky issue preserving data across versions of the query service. I pulled the freshest code and updated the docker images, to see if an issue with the WDQS UI (T186467) had been fixed (and indeed it got fixed)!

However, I noticed that when creating new items and statements in Wikibase, WDQS stopped picking them up. A quick chat with @Addshore suggested that since I updated the query service image, the WDQS index would need to be rebuilt. As Adam explained, "the issue is the updated WDQS version will have differences in the way it stores the data in the index file".

The suggested fix was to delete the volume, or the data from the corresponding directory, and restart blazegraph to start all over again: the index will be rebuilt from recent changes.

The problem is that some of my oldest changes are not in RC any more, since they were made more than 90 days ago AND I never bothered changing the default $wgRCMaxAge on my local instance.

It looks like the way in which the WDQS index is designed depends on the RC api, and as a result I won't be able to recover my older changes unless I go through a full JSON dump export/ingestion.

@Smalyshev can you think of a way to force an index rebuild from the entire revision table as opposed to RC for standalone Wikibase instances? This is an issue that every user of Wikibase Release Strategy sooner or later will bump into. It might also be wise to change the default RC purging period for local Wikibase instances to mitigate this problem.

(this may also be related to T186161, cc'ing @Andrawaag for visibility)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Theoretically it should be possible to dump the entire DB (with CONSTRUCT query or otherwise) as RDF and then reload it into the new instance. For docker containers, which are usually used for smaller installs, having very long $wgRCMaxAge is probably also good.
But if you need to just take data out of existing wikibase instance, you don't need RC feed - you can produce an RDF dump with extensions/Wikibase/repo/maintenance/dumpRdf.php and load it into the database.

@Smalyshev @Addshore I generated an RDF dump using Adam's instructions:

docker-compose exec wikibase php ./extensions/Wikibase/repo/maintenance/dumpRdf.php

How do I use this to populate the WDQS index?

Theoretically it should be possible to dump the entire DB (with CONSTRUCT query or otherwise) as RDF and then reload it into the new instance. For docker containers, which are usually used for smaller installs, having very long $wgRCMaxAge is probably also good.

Yep, I'm going to modify some of the readmes etc to suggest a longer RC period.

But if you need to just take data out of existing wikibase instance, you don't need RC feed - you can produce an RDF dump with extensions/Wikibase/repo/maintenance/dumpRdf.php and load it into the database.

@Smalyshev @Addshore I generated an RDF dump using Adam's instructions:

docker-compose exec wikibase php ./extensions/Wikibase/repo/maintenance/dumpRdf.php

How do I use this to populate the WDQS index?

I guess we need the "load the dump" part of https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md :)

@Smalyshev, how would you feel about pointing to the docker images in that readme?

how would you feel about pointing to the docker images in that readme?

Sure, please feel welcome to submit a patch.