Page MenuHomePhabricator

Document statement URI format for RDF
Open, MediumPublic

Description

As a user of the Wikidata Query Service, I want to be able to write reliable queries related to particular statements, which I know by their Wikibase statement ID.

Problem:
Currently, the RDF format documentation does not document the format of statement URIs after the wds: prefix:

There is no guaranteed format or meaning to the statement id.

In practice, however, it’s not difficult to see how the statement URI is derived from the statement ID: they are identical, except that the $ in the Wikibase representation is replaced by a - for RDF. We use this, for example, in Wikibase-Quality-Constraints ([SparqlHelper.php](https://gerrit.wikimedia.org/g/mediawiki/extensions/WikibaseQualityConstraints/+/2441a9059caf9e1cacd739df38679b5369ef5fb2/src/ConstraintCheck/Helper/SparqlHelper.php#240)).

This is not what the RDF export does, though: it actually uses preg_replace( '/[^\w-]/', '-', $statement->getGuid() ), which means that other characters than $ may also be replaced by hyphens if they ever occur in the future. In this case, tools that only inferred the $- rule from looking at the data might break.

Example:
As mentioned above, WBQC is one case that would benefit from documenting this relationship and making it part of the Stable Interface Policy. @ArthurPSmith also requested it on project chat.

Screenshots/mockups:
In RDF Dump Format#Full statements, change the current

There is no guaranteed format or meaning to the statement id.

to e. g.

The statement ID is the Wikibase statement ID, with all characters other than PCRE “word” characters replaced by hyphens. In PHP, this is expressed as preg_replace( '/[^\w-]/', '-', $statementID )

Acceptance criteria:

  • the format of statement IDs in RDF is documented and part of the stable interface

Open questions:

  • I would actually prefer to slightly tweak the regex before we fix it in stone – \w is not really well-defined (the PHP documentation doesn’t clearly specify what a “word” character is, and apparently it can be locale-dependent?), so I’d make it more explicit with something like [^a-zA-Z0-9_\-] (only ASCII letters and digits, underscore, and hyphen).
  • Should we announce this once it’s done?

Event Timeline

Fine from my side.
About announcing: Something short can't hurt.

Thanks for creating this ticket! Actually, my use case is the opposite of Lucas's - I want to be able to go from the results of a WDQS query to fetch the full statement via the API, which requires the statement ID. So I would like to see the id conversion documented in BOTH directions - and in particular the arbitrary regex replace listed above (preg_replace( '/[^\w-]/', '-', $statementID )) would NOT work for that purpose. Rather can we just settle that the first $ or - is switched, and that's it? Or is there something else that's an issue here?

Another thought - even better would be if the API could be adjusted so it accepts the WDQS statement ID format as it is (all -'s).

So I would like to see the id conversion documented in BOTH directions

This is a bit trickier. The reason is that RDF has particular requirements for URIs, which makes it much easier if the URI does not contain "bad" characters - among which is '$'. However, RDF export does not impose a specific format onto Wikibase core - and indeed, this would be wrong, export format should not control internal format. So there could be two strategies here:

  1. Ensure that internal ID format is such that no export format would have issues with it. Purely alphanumeric and alphanumeric with dashes has a good chance for success here, but there are no guarantees no export format ever would have problems with dashes, for example.
  1. Have each export format transform internal ID into what is safe for its particular case, recognizing that this transformation may not be one-to-one for all spectrum of possible IDs.
  1. Have export format transform internal ID as above, but promise it will always be one-to-one, no matter what happens with internal IDs.

(2) is what we have now, (3) is what you are proposing. The problem here is for this to work, core code should always be aware of all export formats and all transformations happening there, moreover, both core internal ID format and each export transformation becomes the part of the API. With dependencies it creates, it is no different in complexity than (1) anymore, but done in a more convoluted way.

It is possible, of course, to say that the current situation is what is happening - after all, that's the truth and nothing but the truth. However, saying that this will always be the situation creates some dependencies inside the code that do not look good from the design perspective and would be rather hard to maintain in reality.

promise it will always be one-to-one, no matter what happens with internal IDs

Hmm - if it's NOT one-to-one, will that not break RDF? That is, if it's possible for 2 different statements to have the same ID, then you would have conflicting triples associated with the same URI. That's not good at all!

So I think a one-to-one mapping to the RDF format is important if we at all consider RDF support to be a fundamental piece of what wikidata provides...

Well, yes and no. For example, if we used sha1(statment id) as RDF statement URI, it would not technically be one-to-one but in reality we'd have no clashes, probably, and thus it would be OK for RDF. In the same vein, something like preg_replace( '/[^\w-]/', '-', $statementID ) is technically not a 1-1 function, but in our circumstances it has no collisions. We could promise that it remans to forever, and even provide a reversing function (which would obviously be impossible if we used sha1) but then we'd have to be very careful if we ever change anything in statement ID generation in core. That's what I am talking about - having this documented as a feature has its costs.

Can you add a test to the statement ID generation code that ensures it has an RDF compatible format (except for the 1 character that's a problem now), and a note that this is required for RDF support?

Well, adding that test to Wikibase wouldn’t be very useful because any violations are most likely to be caused by extensions adding new entity types. The EntityId class in the Wikibase data model currently enforces a very liberal pattern:

const PATTERN = '/^:?(\w+:)*[^:]+\z/';

That is, any number of repository prefixes, and then anything that doesn’t look like a repository prefix. (\z is an escape sequence matching the end of the string, more or less like $.) Just about the only thing this blocks is consecutive runs of : characters or strings ending in them, as far as I can tell.

In the entity types we have so far (Item, Property, Lexeme, Sense, Form, MediaInfo), entity IDs conform to the pattern of one or more dash-separated components of one letter and a decimal number (I guess that would be [A-Z][1-9][0-9]*(-[A-Z][1-9][0-9])*). If we want the statement ID ⇔ URI conversion to be possible in both directions, perhaps we should think about making this stricter pattern part of the core Wikibase data model?

Side note: @Smalyshev do you know if colons in statement URIs would cause problems? We would currently escape them as dashes, which would make the conversion of statement IDs for foreign entities lossy. (That said, I’m not sure if there’s any situation where such statement IDs would actually occur: in which case would a Wikibase repository emit statements for foreign entities? As far as I’m aware they can only occur as predicates or values, but not as subjects.)

Gehel triaged this task as Medium priority.Sep 15 2020, 7:59 AM