Page MenuHomePhabricator

RDF mapping should not assert that .../entity/Q123 is-a Wikidata item
Closed, InvalidPublic

Description

Currently, the output of Special:EntityData/Q23.ttl claims that George Washington is a Wikibase Item <<entity:Q23 a wikibase:Item>>, which is clearly false. Our description of George Washington is a Wikidata Item <<data:Q23 a wikibase:Item>>.

We should also consider (optionally) omitting these is-a statements, since they are extremely redundant.

Side node: this issue is reflected by the RDFS spec at the lowest level, which claims that everything is an rdfs:Resource:

2.1 rdfs:Resource

All things described by RDF are called resources, and are instances of the class rdfs:Resource.
This is the class of everything. All other classes are subclasses of this class. rdfs:Resource is an
instance of rdfs:Class.

http://www.w3.org/TR/rdf-schema/#ch_resource

There does not seem a distinction between the thing and the description of the thing built into the rdfs spec. Can we go back 15 years and fix this?...

Event Timeline

daniel raised the priority of this task from to Needs Triage.
daniel updated the task description. (Show Details)
daniel added a project: Wikidata.
daniel subscribed.
daniel set Security to None.
Lydia_Pintscher added a subscriber: mkroetzsch.

So, what should the dump look like? Nothing is a wikibase:Item, something else is a wikibase:Item ? If we think that entity:Q23 is the thing, should we have entity:Q23 a entity:Q5? Should we drop data:Q23 a schema:Dataset?

The RDF should certainly contain information about the entity type of exported data. This is essential to ensure that the RDF data contains all the information that is found in the JSON (other than the ordering). As I read it, things that are of rdf:type Item are things that are described by on item on Wikidata. If this is not obvious to anybody who uses the data (maybe somebody really thinks that Washington himself is an item?!), we can always emphasize this in the documentation of the Item class. I therefore suggest to close this issue as invalid. It's just a matter of how we document our ontology. In particular, it should not be assumed that any triple in RDF has a self-evident ground truth associated with it that one can grasp just by reading the URIs (or their labels), though I think confusion is very unlikely here since we do not export any RDF data about item documents.

The comment on rdfs:Resource seems to be a misinterpretation of the spec. In RDF, we certainly distinguish between a thing and its description, it is just that the description itself is yet another thing (in short: everything is a thing). I don't think that this has any bearing on how we want to encode the entity type in our RDF.

In general, I would suggest to stick to the RDF encoding that Denny and I have worked out and published, as it is used in the existing dumps. We can always discuss changes if really needed, but we should not start to re-discuss things that are already done. What is needed now is implementation, not design.

Thanks for your input, Markus!

things that are of rdf:type Item are things that are described by on item on Wikidata

So, what type would the description have? schema:Dataset seems a bit broad...

I think confusion is very unlikely here since we do not export any RDF data about item documents.

We do expose some limited information about item documents:

data:Q23
    schema:version 197346379 ;
    schema:dateModified "2015-02-17T14:27:33Z"^^xsd:dateTime ;
    a schema:Dataset ;
    schema:about entity:Q23 ;
    cc:license <http://creativecommons.org/publicdomain/zero/1.0/> .

I think it would be perfectly sensible to say that wikibase:Item is a subclass of schema:Dataset, and have "a wikibase:Item" on data:Q23 instead of entity:Q23. The latter seems pointless to me.

That said, I agree that we should not wantonly change our mapping. I do think however that we should re-consider this bit of the RDF mapping. It seems pointless at best, and potentially harmful in the case of items. In the case of properties, it makes more sense

In RDF, we certainly distinguish between a thing and its description

Yes, of course you can. It just seems that the RDFS spec does not help with that distinction, nor is it careful to point it out. The statement that rdfs:Resource is the baseclass for everything, and then not providing a base class for descriptions, is an invitation to mix the identity of the description with the identity of the thing.

Our primary goal is to encode the JSON information in RDF, and possibly to enrich this information where it makes sense in an RDF-context (e.g., by adding links to other datasets). The JSON data includes the entity type, so it is clear that we want to encode it in RDF in some way. As I said, my understanding is that Q42 *is* an item for a suitable sense of "item", just as P31 is a "property" in this sense. In neither case are we referring to the HTML page or any other electronic document. The confusion arises from your preconception of the item class referring to a document or "description", which in turn is understandable given our lack of up-to-date documentation for this vocabulary.

We can just invent new vocabulary as needed for exporting data about the HTML item pages, e.g., by introducing an ItemDocument class to refer to the documents. I think including data in RDF that we do not even have in our JSON exports is secondary for now. In fact, creating RDF exports for data that is collected by MediaWiki for every page seems a much bigger task that is hardly addressed in a satisfactory way by the snippet pasted above. Something like the SIOC vocabulary should probably be used there, and a suitable linked-data interface for accessing all revisions would be needed. I am not in favour of creating a makeshift solution now that mixes data that is special to Wikibase with data that should be there for all MW installations. MW should have it's own linked-data export for page metadata, and Wikidata should merely link to the relevant URIs from its data exports.

In neither case are we referring to the HTML page or any other electronic document.

That's our basic disagreement. We actually do have separate URIs for the document/description and the thing as such. We also make separate statements about these, in RDF. It would be a great mistake to mix them up, and wikibase's current RDF output keeps them nicely separate, except for the issue described in this ticket.

EDIT: Actually, I think I misunderstood you, Markus. Will think about it more tomorrow.

Of course, we can just say that George Washington actually is a wikidata:Item in "the real world". Then the RDF would be correct. But then we'd still not be talking about the description. In particular, statements apply to the real world thing, not the description. We could make RDF that says the description has or makes specific statements. That would be the "fully reified" interpretation, that makes no claim about the world at all...

Actually - this kind of ties in with what we are currently working on for the query engine. We plan to use a "truthy" projection, representing a world where all "trusted" statements are considered "true". There we always make claims about the real world thing, not the document.

I think both models and interpretations have merrit. We need to make sure we don't mix the vocabularies inappropriately. In particular, we'd need to use separate types for the descriptions and the thing-as-such.

Thinking about it, I believe the fully reified version should use the /wiki/Special:EntityData-URIs. We defined the /entity/ URIs to refer to the thing as such.

/entity/Q23 rdf:type wikibase:Item .

wikibase:Item is the set of all things that have a QID. I.e. the real Adam Weish... George Washington is identified by /entity/Q23 and he is a Wikibase-Item.

I would also suggest to close this as WAI or invalid.

Denny: fine. But then we are not talking about the description, but the thing when using /entity/Q23. So if we want to say the description has a statement, we need to use /wiki/Special:EntityData/Q23.

No, /entity/Q23 is not the description, indeed, it is the item.

Refer also to http://korrekt.org/papers/Wikidata-RDF-export-2014.pdf (as per IRC chat).

daniel claimed this task.

The definitive resource is of course Markus' and Denny's paper http://korrekt.org/papers/Wikidata-RDF-export-2014.pdf

I'll have to re-read it, but I do think more discussion, or at least more documentation, is needed here. But I'll close the ticket as invalid, because it's invalid as initially filed.

Thanks for adding Denny. Long reply, but details matter here.

I agree that there are different things one could talk about (document, real thing). However, for now I am mainly interested in talking about the latter, since this should be our primary concern in Wikibase (the document is a thing MediaWiki has to care about).

Now you argue that certain triples should not be given for the real thing (whether related to the reified statements or to the item type). However, your arguments do not have an objective foundation: the only reason why you do not want to have certain triples is that you interpret them to mean something that would not be true, whereas I am interpreting the very same triples in a way that would be correct. In other words, we have a dispute about the meaning of triples.

Interestingly, the triples we discuss primarily refer to vocabulary that we created for the very purpose of being used in these triples. How could it mean the wrong thing? Only if we define it to mean the wrong thing. Therefore, let's just define those triples to mean the right thing and we are all set. There is no technical discussion to be had here; it's all about desired or undesired interpretations.

If you look at the RDF structures that we get if we want to represent all of our data (without even including its order), then it should be clear that these structures simply do not have any self-evident "natural" interpretation. We need to tell people what they mean. Let's just tell them what we think they should mean. The only thing to keep in mind (and this is also what you are saying) is that we cannot use the same URI to mean different things in different contexts, so we need different URIs for referring to the class of real-world items and for referring to the class of item documents. I do not see a problem with this.

An RDF document represents a graph. It is a purely abstract, mathematical model. Nothing is said there about the real world or about documents or about truth. It's all in our heads. The reason why we are so careful to distinguish documents from real things etc. is that we want to make sure that the data as a whole (taking all RDF from one site together) still makes sense. Yet, we are free to define this sense. We could define a property that means "is described on a Wikipedia page that was once edited by someone who was born in". Would this say something about the real George Washington? Sure.

For the same reason, please do not confuse reification in RDF (which represents a triple without stating that the triple is true) with reification in our export (which simply uses an auxiliary resource to make a statement). Using auxiiary nodes in RDF data is a common technique that does not have any impact on whether you are saying something about the real world or whether you are saying that a document made a certain claim. In particular, it is not any stronger or more direct to use one triple to represent a statement than to use a group of triples around an auxiliary node. You always need to document in your ontology what your RDF structures express.

Whether we need to use different subject URIs for simplified and for reified exports I don't know. Maybe it also depends on the exact way in which the simplified export is created. Already in our RDF exports, we are using different property URIs in both cases, so even in the union of the datasets there would never be any doubt as to which triple belongs to which view on the data. Moreover, many triples are the same in both views (e.g., labels). Therefore I am inclined to think that there is no need for different URIs there. (I don't see the connection to your truthy projection if you just use it for answering queries, unless of course you are returning query results in RDF so that these results would turn into another kind of RDF export that needs to be consistent with those we have now.)

Now my reply was so long that the ticket has already been closed in the meantime :-D Anyway, those are my two (or more) cents on this topic ;-) I don't think the paper goes into these topics very much (as they are not so much technical as philosophical).

unless of course you are returning query results in RDF so that these results would turn into another kind of RDF export that needs to be consistent with those we have now

We are, indeed, playing with the idea of a SPARQL endpoint now...

We are, indeed, playing with the idea of a SPARQL endpoint now...

Interesting. Virtuoso is your best bet in terms of performance. It should be able to handle the data volume without too much problems. Not sure about the query volume. It has an open source version but there is little chance that WMF could maintain this source if necessary.

Nik tells me that the HA features in Virtuoso are only available in the closed source enterprise version. That basically means WMF is not going to use it in production.

Nik tells me that the HA features in Virtuoso are only available in the closed source enterprise version. That basically means WMF is not going to use it in production.

Yes, I guessed that this would cause issues. I don't know which other tool could deliver the performance you need though. 4Store is free, too, but may not be active enough since 5Store became the main (closed) product; it may also lack some features you need. Beyond this, the only free options (beyond research prototypes) are Jena and Sesame (OpenRDF). I think they won't scale to what we need.

Nik tells me that the HA features in Virtuoso are only available in the closed source enterprise version. That basically means WMF is not going to use it in production.

Yes, I guessed that this would cause issues. I don't know which other tool could deliver the performance you need though. 4Store is free, too, but may not be active enough since 5Store became the main (closed) product; it may also lack some features you need. Beyond this, the only free options (beyond research prototypes) are Jena and Sesame (OpenRDF). I think they won't scale to what we need.

For reference:

What are the differences between the open source and the closed source version of Virtuoso?
The main differences are the following which are closed source only:
* Clustering & High Availability
* Virtual Database
* Replication
* ACL control over large numbers of graphs

High availability and replication are trouble for us. Just having an enterprise edition causes us to lower the score in general. Even though we don't need any of the CUDA stuff that Systap has for BlazeGraph it still lowers their score from the foundation's perspective. The trouble is that we anticipate conflict of interest issues if. We can always fork but that puts us in an undesirable position of being (probably) the only user of our fork.

I've still reached out to its authors to see what they say if they open source their HA and replication code then we'd seriously consider them.