Page MenuHomePhabricator

Reference hash is not stable
Closed, InvalidPublic

Description

I've discovered in the RDF database that two references with the same data apparently have different hashes. Specifically, both: <http://www.wikidata.org/reference/004ec6fbee857649acdbdbad4f97b2c8571df97b> and <http://www.wikidata.org/reference/9a24f7c0208b05d6be97077d855671d1dfdbc0dd> have the same data:

<http://www.wikidata.org/prop/reference/P143>	<http://www.wikidata.org/entity/Q48183>

This should not happen. This may indicate change in the hashing calculation (which also should not happen).

Digging deeper into this, I've found that PropertyValueSnak's hash is calculated as:

		return sha1( serialize( $this ) );

I don't think it is a good idea, this does not guarantee property order and contains full names of the classes:

"C:41:\"Wikibase\\DataModel\\Snak\\PropertyValueSnak\":127:{a:2:{i:0;s:4:\"P143\";i:1;C:39:\"Wikibase\\DataModel\\Entity\\EntityIdValue\":50:{C:32:\"Wikibase\\DataModel\\Entity\\ItemId\":6:{Q48183}}}}"

Which means any time any class changes name (even name capitalization) or moves to different namespace, all hashes change. This may be happening to value hashes too.

We need to find a method of generating the IDs that is truly stable and does not change due to changes in class names, etc.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Currently the code produces 9a24f7c0208b05d6be97077d855671d1dfdbc0dd but somewhere in April it still produced the old value. Not sure what changed.

@Smalyshev after what we found yesterday, can this be closed?

Don't we still want to make the hash stable, at least for the future?

I see no good way to do that. I would rather get rid of it entirely, and use uuids.

thiemowmde subscribed.

I'm going to close this ticket for now because it is not actionable and there is nothing we can do for now, except breaking all hashes again, which is what this ticket actually aims to avoid.

The issue is not new. We are aware of this problem in the code for years now. We do have tests in place that make sure we are not breaking any hashes without at least noticing. Whenever there is a good reason to break one of these problematic hashes (which happened just a few weeks ago), we should (and will) try to get rid of hash calculations that are based on serialize(), and replace them with something much more stable.

Replacing hashes with GUIDs is, even if I support the idea, an even bigger breaking change. I don't see us doing this in the near future. What we will do is trying to avoid introducing new hashes, and use GUIDs instead. But I don't think we need a ticket for this.

I'm going to close this ticket for now because it is not actionable and there is nothing we can do for now, except breaking all hashes again, which is what this ticket actually aims to avoid.

I'd be OK with breaking the hashes one more time, if it's the last time. Because we *will* inevitably break them again, but that'd happen without notification, mess up the database and there would be no way to detect it, and no way to avoid it.

The issue is not new. We are aware of this problem in the code for years now.

And still not fixed. I'm not sure closing tickets for it is the best way to deal with it? If we have a problem, it should be documented as existing problem.

Replacing hashes with GUIDs is, even if I support the idea, an even bigger breaking change

Again, I don't mind one-time intentional breakage, if it prevents many accidental breakages in the future.

if it's the last time.

It will never be "the last time". Right now the PHP class names are actually the most stable element in the hash calculation. All other details in there are much more likely to change the hash. Absolute stability can only be achieved by replacing them with GUIDs.

that'd happen without notification […] if it prevents many accidental breakages in the future.

I don't know what you are talking about. As I already explained above we do have tests in place that make sure we are never changing hash calculations without being aware of it. "Accidental breakages" are just not possible.

Databases that store hashes for longer than a few hours and use them for lookups are actually badly designed. We are not supporting such use cases.

I don't know what you are talking about. As I already explained above we do have tests in place that make sure we are never changing hash calculations without being aware of it.

And yet, it recently changed without any announcement and I spent significant time trying to figure out why database is getting out of sync.

Databases that store hashes for longer than a few hours and use them for lookups are actually badly designed. We are not supporting such use cases.

Great, Wikidata Query service is badly designed and not supported. So, what you propose to do now? Close the service? Or we can try and find some solution to it that allows us to have IDs to the references and values and not deny the problem exists?

I might be ignorant, but is the problem really that big that it deserves that much attention? All we need to do is to be aware of it (which we are now after the incident you are referring to) and repopulate the Wikidata-Query-Service database when we think we need to change the hashes again.

The only problematic line of code I can find in the current code is this one: https://phabricator.wikimedia.org/diffusion/EWBA/browse/master/repo/includes/Rdf/FullStatementRdfBuilder.php;c230481ef2c87f85bcb9780327e98b902193b51f$154. I can not think of a good alternative that would be worth the resources needed. The issue goes all the way down from SnakList to Snak to SnakObject to PropertyValueSnak to DataValueObject to every single class that implements the DataValue interface. Just looking for the remaining hash calculations that use serialize() would not solve the problem, as I tried to explain above, because the hashes will still break with ever other change we will make in the future. The only actual solution I can think of are GUIDs on all levels, which is going to be a huge breaking change to pretty much every single component we maintain. For what benefit?

@thiemowmde we can't sensibly make the hashes completely stable. But we can stop using hashes as identifiers for references. The RDF mapping still uses hashes to identify data values - still a problem, but not as big. Perhaps we can make these hashes more stable, so they no longer depend on things like the order of private fields.

@Smalyshev The RDF mapping code could implement its own hashing, perhaps based on the JSON serialization. That would be easy to do, but quite a bit of overhead...

we can stop using hashes as identifiers for references.

If we find a way to do this without breaking everything, sure. Your suggestion sounds good. I can try to spend some time on this.

Perhaps we can make these hashes more stable, so they no longer depend on things like the order of private fields.

What do you mean when you say "more" stable? What details do you consider especially unstable? We are neither going to move these classes to other namespaces, nor change the order of private fields. Why should we do this? These are not actual problems we have. And again, whenever somebody accidentally tries to do something like this, a test will break and tell us.

Yes, we do do this, e.g. https://gerrit.wikimedia.org/r/#/c/353022/. And every time we do this, the RDF database has to be re-imported. I cannot be avoided completely, but the risk can be reduced.

But there's also a conceptual reason beyond the practical one: Conceptually, the hash should not depend on implementation details. It currently does. It would be nice to change that.

You did not answered my question. Sure, class names are implementation details and should not be in the hash calculation. But as the hash calculation is right now, these details are actually the most stable ones. Other details are much more likely to change – like us discussing an entire rewrite of the hashing code.

So what details of the hash calculation do you, @daniel, consider especially unstable?

https://gerrit.wikimedia.org/r/353022 was a conscious decision. Yes, we made a mistake there by not realizing early enough that we had to rebuild the Wikidata-Query-Service database. But again, the change itself was a conscious decision and not some "accidental breakage" that happened "without any announcement".

The suggestion to let the RDF builder do it's own hashing that does not call any of the existing getHash functions might sound like a nice idea, for a moment. But it does have a series of major problems, which is why I decided to not spend time on this for now:

  1. It would be a quite large and fragile chunk of code that more or less duplicates what the Snak and DataValue classes already do.
  2. The custom hashing might be able to traverse Statements, Snaks, and DataValues, but at some point it must consider the actual values stored in the DataValue classes. This means this custom hashing must still break whenever the slightest change is made to a value, including the addition of fields – something we do not consider a breaking change. Overall this makes such a custom hashing as fragile as the original one.

I think this discussion is a bit confused because we are talking about multiple separate issues:

  1. Making the hash independent of implementation (and platform configuration!) details. This would be a good thing in principle.
  2. Reducing the chance of breakage of our RDF mapping by not using hashes to identify references. This would be a pretty major change to the data model, but it would allow references to be re-used, which we want anyway (T78688).
  3. Avoiding breaking hashes when we add fields to data values. For example, the hash of the value without the new field should be the same as the hash of the value with the new field defined but empty.

All that being said, it has to be clear that hashes cannot be made completely stable. Any breaking change the the data structure MUST change the hashes. We should ensure that we announce such changes, to give people who run a triple store a chance to re-import the data in a timely manner.

It would be nice if we could make hashes at least reproducible by other tools... Though it might be non-trivial I admit. But now if some tool wants to generate RDF representation of Wikidata statement, there's basically no way besides using the same PHP code.

I agree this would be cool, and I was thinking about this multiple times. I was not able to come up with something obvious. I mean, it's quite trivial to say "the hash of a StringValue is just the SHA1 of the string". But how would you document this for every slightly more complex data structure? Where would this documentation live, if not in the code?

I think consistency is key. For example we could say that the hash is always the SHA1 of the JSON representation of the data structure in question. We need to specify a few additional details, like spacing and indention of the JSON (because this influences the string representation and the calculated SHA1), and most importantly the order of attributes, as well as the presence or absence of optional attributes.

Oh dear. Even this approach is not that obvious, and not that easy to reproduce in an other language. What if we add a new attribute? This must change existing hashes, even if the value does not change. How to deal with that? How to ever update external hash calculation algorithms? What if one user tries to calculate a hash based on a JSON encoding that adds spaces in "key": "value", and an other one that doesn't?

What's your suggestion?

Just a note: I suspect we might end up improving on this situation as part of T138708: [Epic] Signed statements, since that will probably also require stable, externally reproducible (i. e. ideally without PHP serialize()) hashes of values.

Bugreporter renamed this task from Reference hash is not stable to Make snak hashes stable.Sep 30 2022, 8:14 PM
Bugreporter reopened this task as Open.
Bugreporter updated the task description. (Show Details)