Page MenuHomePhabricator

Stop using blank nodes for encoding SomeValue and OWL constraints in WDQS
Open, HighPublic

Description

Problem statement:

We are experiencing severe performance issues on the process that keeps wikidata and the triple store behind WDQS synced. These performance issues cause edits on wikidata to be throttled. While reviewing the way we do updates on the store we decided to move most of its synchronization/reconciliation process out of the triple store with an objective in mind of sending only the minimal amount information needed to mutate the graph with a set of trivial operations (ADD/REMOVE triples). This is where blank nodes are problematic (to dig further into why it's problematic I suggest reading the proposal on TurtlePatch which is an attempt to formalize a patching format for RDF backends).

Where blank nodes are currently used

wikibase we use blank nodes for two purposes:

  • denote the existence of a value (ambiguously named unknown value in the UI) (originally discussed in T95441)
  • owl constraints of wdno property

For the SomeValue use-case we seem to only use the blank node as a way to filter such value.
For the OWL constraints it's unclear if it is actually used/useful.

Suggested solution

One option is to do blank node skolemization as explained in RDF 1.1 3.5 Replacing Blank Nodes with IRIs.

@prefix genid: <http://www.wikidata.org/.well-known/genid/>

 wd:Q3 a wikibase:Item, wdt:P2 genid:a8d14fa93486370345412093add8f50c .
 wds:Q3-45abf5ca-4ebf-eb52-ca26-811152eb067c a wikibase:Statement ;
     ps:P2 genid:a49fd4307e7deef3b569568be8019566 ;
     wikibase:rank wikibase:NormalRank .

This way such triples would remain "reference-able" allowing to patch the WDQS backend without querying the graph with simple INSERT DATA/DELETE DATA statements.

Problems induced with the approach in WDQS
  • Queries using isBlank() will be broken
    • Mitigate the issue by introducing a new function wikimedia:isSomeValue() so that queries relying on isBlank() can be rewritten.
  • Conflating classic IRIs with SomeValue IRIs (use of isURI/isIRI)
    • Queries using isIRI/isURI will have a risk to conflate SomeValue IRIs and thus would have to be verified.
  • Consumers of WDQS results expecting blank nodes in results:
    • will have to change to understand the skolem IRIs
Suggested migration plan
  1. Introduce a new wikibase:isSomeValue() function to ease the transition
  2. Start using stable and unique labels for blank nodes in wikibase RDF Dumps
  3. Do blank node skolemization in the WDQS update process [BREAKING CHANGE]
  4. Skolemize blank nodes in the RDF Dump [BREAKING CHANGE]
NOTE: step 4 is not strictly required to address the work regarding the performances of the update process. It is added as there was some concerns of adding another difference between the dump format and WDQS.

There are more detailed discussions around this topic here as well.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2020, 10:32 AM

And

SELECT ?human
WHERE { ?human wdt:P106 ?o }

Would now mean: All entities with a known occupation
As opposed to All entities with a known or unkown occupation
which should be written as:

SELECT ?human
WHERE { {?human wdt:P106 ?o} union {?human a wdunk:P106} }

I strongly oppose this part. I wasn’t around when the Wikibase RDF model was designed, so I can’t say whether this was an intentional feature or a happy design accident (though I suspect it’s intentional), but the fact that ?subject wdt:P570 ?died matches both known and unknown values (but not missing values) is a very useful feature, and one that many of my own queries (and, I suspect, others’ as well) rely on for correctness.

If the problem is just the blank nodes themselves, why not use this new wdunk:P2 in the same way, as in wd:Q3 wdt:P2 wdunk:P2? That’s still worse than the blank nodes (multiple “unknown value” statements collapse into one triple, just as is currently the case for “no value” statements), but at least it shouldn’t break as many queries.

If the problem is just the blank nodes themselves, why not use this new wdunk:P2 in the same way, as in wd:Q3 wdt:P2 wdunk:P2? That’s still worse than the blank nodes (multiple “unknown value” statements collapse into one triple, just as is currently the case for “no value” statements), but at least it shouldn’t break as many queries.

Yes the problem are the blank nodes themselves as there are no ways to mutate the graph without querying it.
I'm OK with your suggestion but this makes two unrelated unknown values equal.

Would something like

wd:Q2 wdt:P2 wdunk:Q2-6657d0b5-4aa4-b465-12ed-d1b8a04ef658

be acceptable?

This would be very similar to the previous approach using blank nodes.

No different unknown values could be collapsed, the drawback is that to extract unknown values one would have to rely on a uri prefix filter using STRSTARTS.

SELECT ?human
WHERE {
	?human wdt:P106 ?o
	FILTER isBLANK(?o) .
}

would become

PREFIX wdunk: <http://www.wikidata.org/prop/unknown/> 

SELECT ?human
WHERE {
	?human wdt:P106 ?o
	FILTER STRSTARTS( STR(?o), 'http://www.wikidata.org/prop/unknown/' ) .
}

Any other suggestions?
Ideally I'd like to find a structure that does no require having to run filters.

Yeah, I also thought of encoding the statement ID in it, but the STRSTARTS is a bit ugly. Also, this doesn’t provide an obvious path for unknown values used in qualifiers and references… what could those use? (I assume the main requirement for the updater is that, whatever these IDs look like, they should remain stable between RDF exports of different revisions?)

For the STRSTARTS, we could of course throw more triples at the wall to solve the problem –

wd:Q2 wdt:P2 wdunk:Q2-6657d0b5-4aa4-b465-12ed-d1b8a04ef658.
wdunk:Q2-6657d0b5-4aa4-b465-12ed-d1b8a04ef658 a wikibase:UnknownValue.

– but that seems wasteful, and there might also be queries relying on the fact that the current “unknown value” blank nodes have no outgoing triples (not sure). Perhaps we could hide the STRSTARTS() in a function? That could improve readability a lot:

?human wdt:P106 ?occupation.
FILTER(wikibase:isUnknownValue(?occupation))

I’m not sure how well that would perform, though. (Currently, I believe we only have one custom function: wikibase:decodeUri, introduced in T168923: Add urldecode function.)

Yes the issue with blank nodes is that they are not "reference-able" and thus point delete queries are impossible which is what we want to achieve with the next gen updater.

I did some tests and isBlank is a lot faster (I suppose because this information is inlined as opposed to the IRI that has to be fetched from its dictionary). So materializing the unknown value with the statement identifier we risk to encounter timeouts more frequently.

So unless we have a third alternative we have two choices:

  • use a constant value: probably very fast but we now say: all unknown values are equal.
  • use the statement identifier: very close to the previous semantic but a lot slower

I think I prefer the first approach you suggested, dealing with perf issues seems more annoying than a less precise graph.
The usecases that I can think of that could be affected are:

  • queries based on equality: find entities which share the same value. Such queries will have to filter out explicitly the "unknown value"
  • queries based on the number of unknown values on a particular property? Examples would help here I think.
  • other usecases?
dcausse updated the task description. (Show Details)Feb 7 2020, 1:55 PM
dcausse updated the task description. (Show Details)
dcausse updated the task description. (Show Details)Feb 7 2020, 2:11 PM

CCing @mkroetzsch and @Denny for input on the RDF model – they probably have some use cases in mind.

queries based on the number of unknown values on a particular property? Examples would help here I think.

Note that people who are counting unknown values with wdt:P106 ?blank already only count best-rank statements; and if they count all statements via p:P106 ?statement. ?statement ps:P106 ?blank, then they should still get the correct count. So I’m not sure how bad the collapsing of unknown values would be in practice.

The “find entities which share the same value” example is a very good point, though. That might be dangerous.

I did some tests and isBlank is a lot faster (I suppose because this information is inlined as opposed to the IRI that has to be fetched from its dictionary).

If we go with an wikibase:isUnknownValue function, then its implementation might also be able to be very efficient, also looking only at the inlined part of the information? Not sure. (I’m assuming that these wdunk: nodes would be inlined as a special “unknown value” type and the “entity ID + UUID” part, similar to how entity IDs, I believe, are inlined as their type plus the numeric ID part. Hopefully the UUID isn’t too long to be inlined? But maybe even if it is, the wikibase:isUnknownValue function wouldn’t need to load the not-inlined part? I don’t know enough Blazegraph internals.)

Hi,

Using the same value for "unknown" is a very bad idea and should not be considered. You already found out why. This highlights another general design principle: the RDF data should encode meaning in structure in a direct way. If two triples have the same RDF term as object, then they should represent relationships to the same thing, without any further conditions on the shape of that term. Otherwise, SPARQL does not work well. For example, the property paths you can write with * have no way of performing extra tests on the nodes you traverse, so the meaning of a chain must not be influenced by the shape of the terms on a property chain, if you want to use * in queries in a meaningful way.

This principle is also why we chose bnodes in the first place. OWL also has a standard way of encoding the information that some property has an (unspecified) value, but the encoding of this looks more like what we have in the case of negation (no value) now. If we had used this, one would need a completely different query pattern to find people with unspecified date of death and for people with specified date of death. In contrast, the current bnode encoding allows you to ask a query for everybody with a date of death without having to know if it is given explicitly or left unspecified (you don't even have to know that the latter is possible). This should be kept in mind: the encoding is not just for "use cases" where you are interested in the special situation (e.g., someone having unspecified date of death) but also for all other queries dealing with data of some kind. For this reason, the RDF structure for encoding unspecified values should as much as possible look as the cases where there are values.

I am not aware of any other option for encoding "there is a value but we know nothing more about it" in RDF or OWL besides the two options I mentioned. The proposal to use a made-up IRI instead of a bnode gives identity to the unkown (even if that identity has no meaning in our data yet). It works in many unspecified-value use cases where bnodes work, but not in all. The three main confusions possible are:

  1. confusing a placeholder "unspecified" IRI with a real IRI that is expected in normal cases (imagine using a FILTER on URL-type property values),
  2. believing that the data changed when only the placeholder IRI has changed (imagine someone deleting and re-adding a quantifier with "unspecified" -- if it's a bnode, the outcome is the same in terms of RDF semantics, but if you use placeholder IRIs, you need to know their special meaning to compare the two RDF data sets correctly)
  3. accidental or deliberate uses of placeholder IRIs in other places (imagine somebody puts your placeholders as value into a URL-type property)

Case 3 can probably be disallowed by the software (if one thinks of it).

Another technical issue with the approach is that you would need to use placeholder IRIs also with datatype properties that normally require RDF literals. RDF engines will tolerate this, and for SPARQL use cases it's not a huge difference from tolerating bnodes there. But it does put the data outside of OWL, which does not allow properties to be for literals and IRIs at the same time. Unfortunately, there is no equivalent of creating a placeholder IRI for things like xsd:int or xsd:string in RDF (in OWL, you can write this with a class expression, but it will be structurally different from other cases where this data is set).

For the encoding of OWL negation, I am not sure if switching this (internal, structure) bnode to a (generated, unique) IRI would make any difference. One would have to check with the standard to see if this is allowed. I would imagine that it just works. In this case, sharing the same auxiliary IRI between all negative statements that refer to the same property should also work.

So: dropping in placeholder IRIs is the "second best thing" to encode bnodes, but it gives up several advantages and introduces some problems (and of course inevitably breaks existing queries). Before doing such a change, there should be a clearer argument as to why this would help, and in which cases. The linked PDF that is posted here for motivation does not speak about updates, and indeed if you look at Aidan's work, he has done a lot of interesting analysis with bnodes that would not make any sense without them (e.g., related to comparing RDF datasets; related to my point 2 above). I am not a big fan of bnodes either, but what we try to encode here is what they have genuinely been invented for, and any alternative also has its issues.

Jheald added a subscriber: Jheald.EditedFeb 8 2020, 9:52 PM

Please don't think or refer to the blank nodes as just "unknown values".

The term used by the wikibase software is "somevalue". The blank nodes are now often commonly used where the information *is* known, but does not have a wikidata item. This is represented by giving the statement the magic "somevalue" status, plus adding a P1932 "stated as" qualifier to give the (known) information as a text string.

The fact that the UI reports the value as "unknown" is already a menace, an undesirable misrepresentation of how the value is being used. Please don't compound this by letting the characters "unk" or "unknown" anywhere near the RDF data model and the sparql interface.

Please don't think or refer to the blank nodes as "unknown values".

I fully agree. The use of the word "unknown" in the UI was a mistake that stuck. The intention was always to mean "unspecified" without any epistemic connotation. That is: an unspecified value only makes a positive statement ("there is a value for this property") and no negative one ("we [who exactly?] do not know this value").

Example of a Listeria tracking page, counting how many blank nodes are being used this way for the properties used on a particular set of items (in this case: a particular set of books, where the publisher (known) may not yet have an item, or at least not yet a matched item): https://www.wikidata.org/wiki/Wikidata:WikiProject_BL19C/titles_stmts

Yes, at the end of the day it's just using

FILTER(isBlank(?stmt_value)) .

and counting statements, so any of the routes above would work.

But please let's call them "blank values" rather than "unknown values", with functions called wikibase:isBlankValue() or wikibase:isSomeValue() rather than wikibase:isUnknownValue(). Thanks!

Why would we call them “blank values” if we’re transitioning away from blank nodes as the underlying mechanism?

Thanks for all the feedback.
I'll discard the "constant" option.

A note on the motivations:
we plan to redesign the update process as a set of trivial mutations to the graph, as far as I can see updating a graph with blank nodes cannot be a "trivial operation", citing
http://www.aidanhogan.com/docs/blank_nodes_jws.pdf (page 10 Issues with blank nodes):

Given a fixed, serialised RDF graph (i.e., a document), labelling of blank nodes can vary across parsers and across time. Checking if two representations originate from the same data thus often requires an isomorphism check, for which in general, no polynomial algorithms are known.

By making some assumptions on the wikibase RDF model I believe that generating a diff between two entity revisions should be relatively easy even if blank nodes are involved, the problem is when applying this diff to the RDF backend, if it involves blank nodes it cannot be a set of trivial mutations (here trivial means using INSERT|DELETE DATA statements). E.g. if the diff indicates that we need to remove:

wd:Q2 wdt:P576 _:genid1

because DELETE DATA is not possible with blank nodes we have to send something like

DELETE { ?s ?p ?o }
WHERE {
  wd:Q2 wdt:P576 ?o .
  FILTER(isBlank(?o))
  ?s ?p ?o
}

Which will delete all blank nodes attached to wd:Q2 by wdt:P576. I haven't checked but I hope that at most one blank node can be attached to the same subject/predicate, if not this makes the sync algorithm a bit more complex.

dcausse renamed this task from Wikibase RDF dump: stop using blank nodes for encoding unknown values and OWL constraints to Wikibase RDF dump: stop using blank nodes for encoding SomeValue and OWL constraints.Feb 17 2020, 1:29 PM
dcausse updated the task description. (Show Details)

I haven't checked but I hope that at most one blank node can be attached to the same subject/predicate, if not this makes the sync algorithm a bit more complex.

At least currently, this is not the case. I added a second “partner: unknown value” statement to the sandbox item, and now wd:Q4115189 wdt:P451 ?v produces two blank nodes as result.

Once we stop using blank nodes for OWL constraints, though, I believe you can at least assume that blank nodes are never the subject of a triple – would that help? (I feel like this ought to eliminate the need for a full isomorphism check from your quote.)

dcausse added a comment.EditedFeb 18 2020, 8:36 AM

I haven't checked but I hope that at most one blank node can be attached to the same subject/predicate, if not this makes the sync algorithm a bit more complex.

At least currently, this is not the case. I added a second “partner: unknown value” statement to the sandbox item, and now wd:Q4115189 wdt:P451 ?v produces two blank nodes as result.

Thanks for checking, this makes the diff process and the update query a bit more complex as now we need to track the number of blank nodes attached to a particular subject/predicate. As for the update query I believe this is still possible with:

DELETE { ?s ?p ?o }
WHERE {
  SELECT ?s ?p ?o {
 	wd:Q4115189 wdt:P451 ?o .
  	FILTER(isBlank(?o))
 	?s ?p ?o
  } LIMIT 1 # number of blank nodes to delete
}

But overall this makes updating a triple with a blank node a completely separate operation that cannot be batched with and like INSERT DATA or DELETE DATA.

Once we stop using blank nodes for OWL constraints, though, I believe you can at least assume that blank nodes are never the subject of a triple – would that help? (I feel like this ought to eliminate the need for a full isomorphism check from your quote.)

Indeed, this and the fact that for SomeValue all blank nodes are unique, even the same statement "SomeValue" used as wdt and ps is different currently.
From the point of view of a "simple diff operation" this is a fortunate situation as it makes the update process simpler in the scenario we decline this task and stick with blank nodes. In the case we decide to move forward with IRIs placeholders the object of wdt and ps predicates of the same statement will become identical for SomeValue.

To move this forward I propose the following plan:

  1. add a wikibase:isSomeValue custom function configurable to work as a proxy to isBlank() or STRSTARTS( STR(?o), 'http://www.wikidata.org/prop/somevalue/' ) and announce it
  2. instead of changing the RDF representation generated by wikibase add a new option to the updater/munger to transform (on the fly) blank nodes as IRIs placeholders
  3. setup a test instance of the query service using this proposal and ask for feedback
  4. if no major blockers are encountered we can announce that the RDF representation is about to change
  5. start emitting deprecation warnings when seeing isBlank
  6. after a deprecation period activate placeholder IRIs everywhere
  7. change the wikibase RDF representation

Well, I’d like to see what the IRIs for unknown value in qualifiers and references look like before we move ahead with this plan.

I’m also not yet sold on the rename from “unknown value” to “some value” in this more user-facing location. @Jheald, I’m aware that the snak type is also used to encode “we know the value but can’t represent it”, but do you have a source for how common this is?

(Also, the snak type is somevalue as one word, so to me isSomevalue would make more sense than isSomeValue.)

Well, I’d like to see what the IRIs for unknown value in qualifiers and references look like before we move ahead with this plan.

Sure, I tried to add some but I'm not sure how I did not find my way in the UI, could you try to update the sandbox item so that we can have a look?

@Lucas_Werkmeister_WMDE The qualifier "stated as" (p1932) is currently used on 6.6 million statements. I couldn't get a query to complete to count how many of those statements have an object that's a blank node. My guess might be on the order of about 10,000 but that's just a number pulled out of the air, not based on anything. Could be a *lot* more, if this mechanism has been used eg for scientific papers with unmatched editors, publishers, etc.

(Maybe it will be easier to count under a new approach?)

The number of cases of “we know the value but can’t represent it” may soon be much bigger on Commons though, where the pattern is being used as part of an idiom for creators that don't have a Wikidata item, but are known -- including creators known only by their wiki user-names. The number of those cases -- eg self-taken pictures, self-made diagrams etc -- would probably go into the millions, once it's systematically applied.

@Lucas_Werkmeister_WMDE thanks!

Indeed this becomes a bit more challenging as the statement identifier alone cannot be used to identify a bnode under a particular statement. I'll continue to discuss about this specific issue in T245541 to limit noise on this ticket.

@Jheald about blank nodes usage in T239414 we investigated how blank nodes are currently used and extracted some numbers here: P9859 (count per predicate where a blank node is used a an object).

Sadly such counts won't be faster using this new proposed approach.

@Jheald about blank nodes usage in T239414 we investigated how blank nodes are currently used and extracted some numbers here: P9859 (count per predicate where a blank node is used a an object).

Sorted TSV version: P10531 – the most common properties (apart from the owl:complementOf construct) are described by source (78k), publisher (58k), date of death (14k), given name (13k), and then the first qualifier use, end time (10k).

I had no luck investigating the qualifiers of those properties (assuming that some of the “unknown value” publishers, for instance, may specify the value in some qualifier, be it named as or something else) – T246238 will hopefully shed more light on this.

I've done a lot of work with GLAM data that often includes "unknown" for creator.
Getty ULAN has a whole slew of "unknowns" http://vocab.getty.edu/doc/#ULAN_Hierarchy_and_Classes (note: the counts are several years old, I imagine there are a few more thousands of those now):

  • 500355043 Unidentified Named People includes things like "the master of painting X"
  • 500125081 Unknown People by Culture includes things like "unknown Egyptian" (to be used in situations like "unknown creator, but Egyptian culture"). We've modeled those as gvp:UnknownPersonConcept and groups (schema:Organization) but users still think of them as "persons".
  • Further, there are things like "unknown but from the circle of Rembrandt" or "unknown but copy after Rembrandt" etc, about 20 varieties of them, see

https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Visual_arts/Item_structure#Attribution_Qualifiers and https://www.wikidata.org/wiki/Wikidata:Property_proposal/Attribution_Qualifier

Despite the special value "unknown", actual WD usage shows there are 62k creator|author using the item wd:Q4233718 Anonymous: https://w.wiki/JVr.

I think the two special values are unfortunate because:

  • they introduce special patterns that someone writing a query needs to cater for. Eg I couldn't remember the Novalue syntax to compare the query above to one that uses Novalue
  • they don't reflect the real-life complexity needed in some cases
  • they can't be fitted easily in faceted search interfaces or semantic search UIs: one needs special coding for these special values.

Coming from CIDOC CRM, I also used to worry about the ontological impurity of "makes two unrelated unknown values equal" and "find entities which share the same value". But in practical terms, people would like to be able to search for "anonymous" and "unknown Egyptian" and are smart enough to understand that even if "anonymous" may have the most items in a collection, that doesn't make him the most prolific creator of all times.

Cheers!

Luitzen added a subscriber: Luitzen.Mar 4 2020, 7:30 PM

In order it make it possible to update the graph without querying, you could probably adapt/tailor the com.bigdata.rdf.store.AbstractTripleStore.Options.STORE_BLAN‌​K_NODES Blazegraph option.

@Luitzen thanks for bringing this up but I haven't included this in the possible solutions because:

  • this feature does not seem to be fully integrated/finished/tested, while I was able to tell blazegraph to store some specific bnode ids I was never able to fully control what the id was. Sesame did seem to still generate its own id depending on the API being used (see https://jira.blazegraph.com/browse/BLZG-1915)
  • blank nodes are not allowed in DELETE/DELETE DATA (even in blazegraph with this option enable) sparql statements so I fear that low-level blazegraph integration would have to be done to take benefit of this option.
  • it's blazegraph specific
Gehel moved this task from Scaling to RDF Model on the Wikidata-Query-Service board.
Gehel moved this task from Incoming to Epics on the Discovery-Search (Current work) board.

You should be aware that also the functions isIRI or isLiteral (depending on property type) and datatype can be used and probably is used to test if a value is somevalue or a real value.

isLiteral should still work, right? Blank nodes aren’t literals, the replacement IRIs won’t be literals either, no change.

isIRI and datatype is a good point, though – such queries will have to be updated.

Yes, isLiteral should still work for properties where the real values are literals. Without knowing the internal workings of Blazegraph I would guess that it is more efficient than STRSTARTS( STR(?o), 'http://www.wikidata.org/prop/somevalue/' ) . Maybe that could be used in some way?

Yes, isLiteral should still work for properties where the real values are literals. Without knowing the internal workings of Blazegraph I would guess that it is more efficient than STRSTARTS( STR(?o), 'http://www.wikidata.org/prop/somevalue/' ) . Maybe that could be used in some way?

What we will implement internally for the isSomeValue function won't be doing exactly STRSTARTS( STR(?o), 'http://www.wikidata.org/prop/somevalue/' ) but uses blazegraph vocabulary and inlining facilities, not sure if this answers your question though.

What we will implement internally for the isSomeValue function won't be doing exactly STRSTARTS( STR(?o), 'http://www.wikidata.org/prop/somevalue/' ) but uses blazegraph vocabulary and inlining facilities, not sure if this answers your question though.

Yes, thank you. I was wondering if it is better (faster) to use isLiteral than wikibase:isSomeValue where possible .

BTW also isNumeric can be used to test if a value is numeric or a blank node, and lang can used to test if a value is a monolingual text or a blank node. These should also still work.

Mmarx added a subscriber: Mmarx.Apr 16 2020, 6:31 PM

Many queries use the optimizer hint hint:Prior hint:rangeSafe true. when e.g. comparing date or number values with constants in a filter as suggested at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization#Fixed_values_and_ranges. Is there a risc that such queries will fail or give wrong results when somevalue become IRI's, and thus the values will be of different types?

Many queries use the optimizer hint hint:Prior hint:rangeSafe true. when e.g. comparing date or number values with constants in a filter as suggested at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization#Fixed_values_and_ranges. Is there a risc that such queries will fail or give wrong results when somevalue become IRI's, and thus the values will be of different types?

I cannot tell for sure, anything that involve query optimization via hints are by nature extremely fragile. But I believe that these kind of queries will remain as dangerous as they were before the switch.

Multichill added a subscriber: Multichill.

This needs community consensus before moving forward.

dcausse added a comment.EditedApr 30 2020, 12:39 PM

@Multichill the discussion seems to have stalled. Thanks to Peter the pros and cons have been well summarized now. I also understand that part of the misunderstanding of this change was the lack of clarity on the motivations as to why we require a breaking change like that. I hope it had been addressed in the linked discussion.
Do you have additional comments to make here? Thanks!

Pfps added a comment.Apr 30 2020, 2:40 PM

I don't understand why it was considered necessary to make a breaking change the RDF dump to improve WDQS performance when there is a solution that does not make a breaking change to the dump.

dcausse renamed this task from Wikibase RDF dump: stop using blank nodes for encoding SomeValue and OWL constraints to Stop using blank nodes for encoding SomeValue and OWL constraints in WDQS.Apr 30 2020, 5:12 PM
dcausse updated the task description. (Show Details)

I don't understand why it was considered necessary to make a breaking change the RDF dump to improve WDQS performance when there is a solution that does not make a breaking change to the dump.

It was not considered "necessary" it was considered "preferable" in discussions we had while drafting this 4 steps plan. The sole reason was to limit the divergences between WDQS results and the dumps (current divergences are listed here).
I'm perfectly fine dropping this step (assuming others agree) if this causes much annoyance and only deprecate blank nodes at the WDQS level (I've updated the description of this ticket to reflect the discussions we had on the wikipage).

Pfps added a comment.Apr 30 2020, 6:29 PM

My view is that fewer breaking changes are to be preferred, and breaking changes in fewer "products" is to be even more preferred. So, again, I wonder why there is a breaking change proposed for the RDF dump instead of no breaking changes or limiting breaking changes to the WDQS only.

Gehel added a subscriber: Gehel.May 6 2020, 9:58 AM

My view is that fewer breaking changes are to be preferred, and breaking changes in fewer "products" is to be even more preferred. So, again, I wonder why there is a breaking change proposed for the RDF dump instead of no breaking changes or limiting breaking changes to the WDQS only.

While backward compatibility is important, it isn't the only consideration. As pointed out by @dcausse on wiki having divergence between Wikidata and WDQS is also problematic. We already have a small number of documented divergence and increasing this number is also problematic. Given the current discussion, it seems that keeping as much backward compatibility as possible at the cost of divergence between Wikidata and WDQS is the way to go.

Given the current discussion, it seems that keeping as much backward compatibility as possible at the cost of divergence between Wikidata and WDQS is the way to go.

I strongly disagree – we should apply the change to Wikibase as well. Increasing long-term divergence between the query service and the RDF output or dumps will make working with both of them harder.

Pfps added a comment.May 6 2020, 3:43 PM

If divergence between Wikidata and WDQS is bad, then this proposed change has another bad feature as it turns the some value snaks into something that is less like an existential. And this proposed change is for both the RDF dump and the WDQS.
And then there is the problem of the proposed change requiring changes to SPARQL queries - not just a change, but a change from how SPARQL queries are writtern in just about any other context.

And then there is the problem of the proposed change requiring changes to SPARQL queries - not just a change, but a change from how SPARQL queries are writtern in just about any other context.

In what other context do you write SPARQL queries about Wikibase SomeValue snaks?

Pfps added a comment.May 6 2020, 3:49 PM

Is anyone proposing a change to Wikibase (or Wikidata)?

I would view the proposed change as having the negative outcome that the RDF dump moves further from Wikidata. There are people (myself included) who use the RDF dump without using the WDQS (much).

Is anyone proposing a change to Wikibase (or Wikidata)?

Yes – the goal is that the RDF in the query service, the RDF dumps, and the output of Special:EntityData all change. (Special:EntityData isn’t explicitly mentioned in the task description, but I assume it should change together with the dumps.) Not all at the same time, but in the end they should be consistent again, at least with regards to their handling of SomeValue snaks and OWL constraints (notwithstanding other differences).

I would view the proposed change as having the negative outcome that the RDF dump moves further from Wikidata.

Can you clarify what you mean here by “Wikidata”?

Pfps added a comment.May 6 2020, 3:56 PM

The difference is not with other SPARQL queries in the WDQS but against SPARQL queries in general (including SPARQL queries that use Wikidata URLs).

Of course, there are already are a few important differences between WDQS queries and SPARQL queries against most other RDF KBs.

Pfps added a comment.May 6 2020, 4:32 PM

I would view the proposed change as having the negative outcome that the RDF dump moves further from Wikidata.

Can you clarify what you mean here by “Wikidata”?

From https://www.wikidata.org/wiki/Wikidata:Main_Page: the free knowledge base with 84,918,558 data items that anyone can edit.
So I don't count the RDF dump or WDQS, but I do count Wikibase and its data model.

@Multichill the discussion seems to have stalled. Thanks to Peter the pros and cons have been well summarized now. I also understand that part of the misunderstanding of this change was the lack of clarity on the motivations as to why we require a breaking change like that. I hope it had been addressed in the linked discussion.
Do you have additional comments to make here? Thanks!

See the recent comments. You need to get community consensus before doing any (major) changes.

Is anyone proposing a change to Wikibase (or Wikidata)?

Yes – the goal is that the RDF in the query service, the RDF dumps, and the output of Special:EntityData all change.

Absolutely, in this context a change of the RDF dump implies a change on wikibase output for Special:EntityData and RDF formats.

If divergence between Wikidata and WDQS is bad, then this proposed change has another bad feature as it turns the some value snaks into something that is less like an existential. And this proposed change is for both the RDF dump and the WDQS.

Quoting RDF 1.1 Concepts and Abstract Syntax - 3.5 Replacing Blank Nodes with IRIs:

This transformation does not appreciably change the meaning of an RDF graph, provided that the Skolem IRIs do not occur anywhere else. It does however permit the possibility of other graphs subsequently using the Skolem IRIs, which is not possible for blank nodes.

One could also argue that this change may lead to a positive outcome as it allows the possibility to use these skolem IRIs.

Additionally "unskolemizing" these IRIs is a trivial step that could be added to any import process reading the wikibase RDF output and willing to switch back to blank nodes.

Pfps added a comment.May 11 2020, 1:02 PM

If 'unskolemizing' is a trivial step then that should be implemented by WDQS, instead of pushing it to every consumer (including indirect consumers) of Wikidata information, if this change is simply a change to make WDQS work faster.

If, on the other hand, there are other reasons to make a breaking change to the Wikidata RDF dump then there should be a proposal to make such changes independent of making WDQS faster.

If 'unskolemizing' is a trivial step then that should be implemented by WDQS, instead of pushing it to every consumer (including indirect consumers) of Wikidata information, if this change is simply a change to make WDQS work faster.

WDQS does need to "skolemize" not "unskolemize" but in the end this is the same discussion as pondering whether or not we want WDQS to be close to Wikibase RDF output by moving the "skolemization" before generating the RDF output.

Yes this change is only to make the update process faster by removing the complexity and cost induced by tracking blank nodes. Since all edits to wikidata now depend directly on the efficiency of this process we believe that it is worth this breaking change.

Pfps added a comment.May 11 2020, 2:06 PM

I was completely unaware that WDQS is so integrated into the inner workings of Wikidata. Where is this described? Was this mentioned in the announcement of the proposed change?

In any case there appears to be a reasonable path forward that makes fewer breaking changes.

I was completely unaware that WDQS is so integrated into the inner workings of Wikidata. Where is this described? Was this mentioned in the announcement of the proposed change?

Details on the motivations and the context around this change were clearing lacking in the initial announcement, this is something we will be careful about the next time we communicate about this.
The way WDQS integrates with wikidata editing/tooling workflows is a bit out of our control and I'm not sure that there exists a comprehensive and exhaustive documentation about them (far from ideal but searching for wdqs lag on the wikidata namespace might give some sense of the problems it can cause on contributors).

Pfps added a comment.May 12 2020, 1:48 PM

Based on a quick look at various Phabricator tickets and other information it appears to me that the only connection between the WDQS and Wikidata edit throttling is that a slowness parameter for the WDQS is used to modify a Wikidata parameter that is supposed to be checked by bots before they make edits. Further, it appears that the only reason for this connection is to slow down Wikidata edits so that the WDQS can keep up - the WDQS does not feed back into Wikidata edits, even edits by bots. So this connection could be severed by a trivial change to Wikidata and the only effect would be that the WDQS KB might lag behind Wikidata, either temporarily or permanently, and queries to the WDQS might become slow or even impossible without improvements to the WDQS infrastructure. I thus view it misleading to state in this Phabricator ticket that "performance issues [of the WDQS] cause edits on wikidata to be throttled", which gives the impression that the WDQS forms a part of the Wikidata editing process or some other essential part of Wikidata itself.

There needs to be a very strong rationale to make breaking changes to the Wikidata RDFS dump. Just improving the performance of the WDQS is not enough for me.

I thus view it misleading to state in this Phabricator ticket that "performance issues [of the WDQS] cause edits on wikidata to be throttled", which gives the impression that the WDQS forms a part of the Wikidata editing process or some other essential part of Wikidata itself.

Including WDQS lag into the Wikibase maxlag has been done for reasons, challenging such reasons is out of scope of this ticket and questions should be asked on T221774. De facto this puts WDQS as an essential part of Wikidata itself and one of the reasons our team work has been prioritized to redesign the update process between WDQS and wikidata.

TomT0m added a subscriber: TomT0m.Jun 17 2020, 8:12 AM
JMinor added a subscriber: JMinor.Jul 16 2020, 11:00 PM
Gehel triaged this task as High priority.Sep 15 2020, 8:01 AM