Page MenuHomePhabricator

[Story] Blazegraph support for owl:sameAs and redirects
Closed, ResolvedPublic

Description

One of the features of wikidata.org are redirects - when one entity ID is "redirected" to mean another. This usually happens when data items are merged but can also happen for other reasons. From the semantic standpoint, that should mean Q1 and Q2 now IDs for the same thing, completely equal and used interchangingly, though in practice on the technical level there may be challenges implementing it as such.

This task is meant to collect information on the capabilities of BlazeGraph in implementing aliasing - probably via OWL as owl:sameAs, though if other ways are available we are open to using something else - and figuring out if it is the right way to represent redirect semantics. Main questions to answer:

  1. Does Blazegraph have facility to handle OWL statements like owl:sameAs with the semantics we need?
  2. What are requirements to use it - i.e. do we need any OWL definitions to exist or it can work only off owl:sameAs alone? What are the potential performance costs?
  3. What would actually happen in the DB when using owl:sameAs and would it achieve complete Q-id unification or if not, what would work and what won't?
  4. If we can't make it work with existing facilities, do we have a possible customization (e.g. custom IRI->IV mapping) that would allow us to implement the same semantics with reasonable overhead?

Event Timeline

Smalyshev raised the priority of this task from to Medium.
Smalyshev updated the task description. (Show Details)

One approach is to update the forward dictionary to make the IRIs onto one of the existing IVs and then rewrite all of the statements. However, this sort of destructive rewrite of does not help if you need the "redirects" to be reversible.

I am pretty sure that we only support this through eager materialization of the owl:sameAs inferences. While there is some code to dynamically expand access paths (BackchainAccessPath), I think that there are some issues with this code and it is disabled by default. See InferenceEngine.Options for the relevant options and comments.

It sounds like you have a use case where all you want is the ability to snap together two resources. That is, just the limited set of sameAs entailments. Do you know if this would include only same-as for instances or also for properties and classes?

if you need the "redirects" to be reversible.

I don't think redirect is reversible, but that's a good point, need to verify that. Technically it can be reversed, as pretty much every action on wiki, but I'm not sure if it's a normal occurrence. Will check.

Do you know if this would include only same-as for instances or also for properties and classes?

Probably just for regular nodes (instances) right now. Not sure about properties - I'm pretty sure it's not allowed in wikidata (instead, properties are just obsoleted and new ones are added). Will verify.

daniel raised the priority of this task from Medium to High.Apr 20 2015, 10:19 AM

Bumping prio, since it seems that we at least need to investigate & decide soon.

Do you know if this would include only same-as for instances or also for properties and classes?

Currently, property redirects are not allowed, though it may happen in the future, at least in theory.

@Smalyshev mentioned that Blazegraph has a rewrite mechanism we could use to resolve redirects when processing updates and queries. That works when we know that A redirects to B (we have <A> owl:same <B>). Problems arise when a redirect is created or undone, that is, when <A> owl:same <B> is added or removed from the database:

When <A> owl:same <B> is added, we have to replace all occurrences of <A> in the graph with <B> (and delete triples with <A> as the subject, since A's old content is now gone):

?s  ?p  <A>   becomes   ?s  ?p  <B>
?s  <A>  ?o   becomes   ?s  <B>  ?o
<A>  ?p  ?o   is deleted

But when <A> owl:same <B> is removed, we'd have to revert only some occurrences of <B> back to <A>, namely the ones that still reference A in the primary dataset on Wikidata. We'd need a way to find these.

Maybe named graphs could be used to solve this? When replacing <A> with <B>, we'd copy the original triple to a secondary graph <R> (R for "redirect"; and let's call the default graph <Q> for "query"):

?s  ?p  <A>  <Q>   becomes   ?s  ?p  <B>  <Q>   and   ?s  ?p  <A>  <R> 
?s  <A>  ?o  <Q>   becomes   ?s  <B>  ?o  <Q>   and   ?s  <A>  ?o  <R>
<A>  ?p  ?o  <Q>   is deleted

We can then undo a redirect by copying the redirects backed up in the <R> graph back into the <Q> graph.

All this has to be done for all URIs derived from the redirected entity ID, not just the concept URI.

we have to replace all occurrences of <A> in the graph with <B> (and delete triples with <A> as the subject, since A's old content is now gone)

Deletion happens automatically as part of the update, since new data only has one triple with A as a subject. However, replacement is harder. If we would replace all triples that mention <A> as an object, it can be an expensive operation.

But when <A> owl:same <B> is removed, we'd have to revert only some occurrences of <B> back to <A>

That would be pretty hard to do, given that the data may have changed - i.e., even if we kept the backup of the pre-redirect data, some of it may not be relevant anymore, as the corresponding subjects may have been updated by the user.

All this has to be done for all URIs derived from the redirected entity ID, not just the concept URI.

What other URIs are there?

we have to replace all occurrences of <A> in the graph with <B> (and delete triples with <A> as the subject, since A's old content is now gone)

Deletion happens automatically as part of the update, since new data only has one triple with A as a subject. However, replacement is harder. If we would replace all triples that mention <A> as an object, it can be an expensive operation.

Yes, but it would be done only once per redirect, and even that should be rare.

Do you see an alternative that would be less expensive?

But when <A> owl:same <B> is removed, we'd have to revert only some occurrences of <B> back to <A>

That would be pretty hard to do, given that the data may have changed - i.e., even if we kept the backup of the pre-redirect data, some of it may not be relevant anymore, as the corresponding subjects may have been updated by the user.

You would have seen these modifications, and tracked them. Of course, if we have <X> <P> <B> <Q> and <X> <P> <A> <R>, an edit that changes X to have P:C instead of P:A would need to remove both <X> <P> <B> <Q> and <X> <P> <A> <R> before inserting <X> <P> <C> <Q>. You'd apply the update to the "backup" graph <R> without rewriting A to B, and the update of the "query" graph <Q> after rewriting A to B. Seems straight forward to me.

All this has to be done for all URIs derived from the redirected entity ID, not just the concept URI.

What other URIs are there?

As far as I can tell, there are currently no extra URIs for Items, but several variants of Property URIs (/prop/, /prop/statement/, /prop/statement/value/, etc). We currently don't have redirects for properties, but the fact that more than one URI may be derived from a given entity ID is something to keep in mind.

Yes, but it would be done only once per redirect, and even that should be rare

True, once per redirect, but redirects are not that rare... It may be an option though. We need to consider performance impact for this carefully.

As far as I can tell, there are currently no extra URIs for Items, but several variants of Property

I'd really hate redirecting properties, there's so much potential mess in it. Couldn't we just ban redirecting properties and if you really need it leave it to bots and community to clean up the data?

Jonas renamed this task from Blazegraph support for owl:sameAs and redirects to [Story] Blazegraph support for owl:sameAs and redirects.Nov 2 2015, 3:51 PM
Smalyshev lowered the priority of this task from High to Low.Aug 22 2016, 5:55 PM

Seeing as we've run for a while with only owl:sameAs implemented and got no complaints, I don't think priority of "High" is warranted. We still want to get to it, but other things for which we do have an idea how to proceed will have to come first.

Gehel claimed this task.
Gehel subscribed.

Given the previous comment from 2016, it looks like things are working well enough. Feel free to re-open as needed.