As a user of the Wikidata Query Service, I want to be able to write reliable queries related to particular statements, which I know by their Wikibase statement ID.
Problem:
Currently, the RDF format documentation does not document the format of statement URIs after the wds: prefix:
There is no guaranteed format or meaning to the statement id.
In practice, however, it’s not difficult to see how the statement URI is derived from the statement ID: they are identical, except that the $ in the Wikibase representation is replaced by a - for RDF. We use this, for example, in Wikibase-Quality-Constraints ([SparqlHelper.php](https://gerrit.wikimedia.org/g/mediawiki/extensions/WikibaseQualityConstraints/+/2441a9059caf9e1cacd739df38679b5369ef5fb2/src/ConstraintCheck/Helper/SparqlHelper.php#240)).
This is not what the RDF export does, though: it actually uses preg_replace( '/[^\w-]/', '-', $statement->getGuid() ), which means that other characters than $ may also be replaced by hyphens if they ever occur in the future. In this case, tools that only inferred the $→- rule from looking at the data might break.
Example:
As mentioned above, WBQC is one case that would benefit from documenting this relationship and making it part of the Stable Interface Policy. @ArthurPSmith also requested it on project chat.
Screenshots/mockups:
In RDF Dump Format#Full statements, change the current
There is no guaranteed format or meaning to the statement id.
to e. g.
The statement ID is the Wikibase statement ID, with all characters other than PCRE “word” characters replaced by hyphens. In PHP, this is expressed as preg_replace( '/[^\w-]/', '-', $statementID )
Acceptance criteria:
- the format of statement IDs in RDF is documented and part of the stable interface
Open questions:
- I would actually prefer to slightly tweak the regex before we fix it in stone – \w is not really well-defined (the PHP documentation doesn’t clearly specify what a “word” character is, and apparently it can be locale-dependent?), so I’d make it more explicit with something like [^a-zA-Z0-9_\-] (only ASCII letters and digits, underscore, and hyphen).
- Should we announce this once it’s done?