Page MenuHomePhabricator

How are the hash values in Wikidata rdf generated?
Closed, ResolvedPublic

Description

I am comparing the JSON output of Wikidata's API, with its RDF equivalent. The RDF contains URI's which contains hash values, which are visible in the JSON output.

e.g.

s:Q35869-2E1F6A06-E14B-4533-B207-61DD01CB57D3 pqv:P580 v:0fe8fcb754ca9e8d828baee034479a75 .

The RDF, however also contains "normalized" values that are not part of the json model. URIs in those normalized values also contains has values. How are these generated?

Event Timeline

I don’t think we make any guarantees about these hash values… the RDF Dump Format explicitly says that the hashes of reference (wdref:) and full value nodes (wdv:) have no guarantees on them, except “same hash iff same reference/value”, and if there are other hashes in the RDF, I think the same is true for those.

Addshore claimed this task.

I wasn't looking for guarantees about the hash values. They have a value a sanity check in a[[ https://github.com/Wikidata/triplify-json | reverse engineering project ]] we are doing to reproduce the Wikidata/Wikibase RDF outside wikibase itself. We need this to be able to apply EntitySchema's pre-ingestion. Currently, EntitySchema's can only be applied post data ingestion. The script, as it currently works, builds the RDF from the JSON. That JSON object is enriched and the idea is to then verify if the new JSON object still fits the EntitySchema before it is submitted to the API of Wikidata.

In building that RDF script the hash values have a role in verifying the (reversed engineered) script does indeed reflect the exact same RDF as it is produced natively by Wikidata. For most snaks the hash values are given in the JSON that is produced by the API of wikidata.

This is not the case for those values in the RDF that are not given by the JSON export. This is specifically for the normalized values on time and globe coordinates. That is why I am interested in the algorithm that is used to produce those hash values internally.

Well, I don’t think being able to reproduce those hashes externally was a design goal for Wikibase… I’m not sure what else I can say about your question that you couldn’t also get from looking at the code directly:

Reference.php
	public function getHash() {
		// For considering the reference snaks' property order without actually manipulating the
		// reference snaks's order, a new SnakList is generated. The new SnakList is ordered
		// by property and its hash is returned.
		$orderedSnaks = new SnakList( $this->snaks );

		$orderedSnaks->orderByProperty();

		return $orderedSnaks->getHash();
	}
SnakList.php
	public function getHash() {
		$hasher = new MapValueHasher();
		return $hasher->hash( $this );
	}
MapValueHasher.php
	public function hash( $map ) {
		if ( !is_array( $map ) && !( $map instanceof Traversable ) ) {
			throw new InvalidArgumentException( '$map must be an array or an instance of Traversable' );
		}

		$hashes = [];

		foreach ( $map as $hashable ) {
			$hashes[] = $hashable->getHash();
		}

		if ( !$this->isOrdered ) {
			sort( $hashes );
		}

		return sha1( implode( '|', $hashes ) );
	}
SnakObject.php
	public function getHash() {
		return sha1( serialize( $this ) );
	}

The most important part for you is probably that the hash is based on the PHP serialization of the snak (see the serialize() call in the last code snippet), which means it will be pretty painful to reproduce faithfully.

If your goal is to verify EntitySchemas – do you really need to produce the same hashes as Wikibase? Surely no EntitySchema depends on the specific value of a value or reference hash (prov:wasDerivedFrom [ wdref:0000~ ], reference hash starts with four zeroes?), so I would hope that you could use some other hashing method.

You are completely right, the same hashes are not needed to apply EntitySchema's on memory ingestion to Wikidata. I need the hashes as a sanity check that my script created the exact same RDF as being produced by Wikidata natively. So the hashes are only needed in the development phase of the script.

Here is a notebook that contains the first prototype.

allRD = WDqidRDFEngine(qid="Q38", fetch_all=True)

compareRDF = Graph()
compareRDF.parse("http://www.wikidata.org/entity/Q38.ttl", )
inboth, left, right = graph_diff(to_isomorphic(compareRDF), to_isomorphic(allRD.rdf_item))
print(len(left))
print(len(compareRDF)

If my script works, there should be no difference in the length of both graphs. Currently, that is not the case. I checked various examples and except for the hashes in those normalized statements they seem equal. But if it is indeed difficult to reproduce those hashes, I should reflect on another test to verify.

In the actual validation script, not all RDF will be needed. Ignoring the labels for example slims down the RDF graph substantially. So I am currently building functionality into the WikidataIntegrator that allows selecting only certain parts (e.g. no truthy statements, or only truthy statements, no normalized values, etc). A notebook with that code is here

My PHP skills are a bit rusty, but I will investigate and or consider other test strategies.