Page MenuHomePhabricator

Main snak and reference snaks do not include hash in JSON output
Closed, ResolvedPublic

Description

In the JSON output format (Special:EntityData, wbgetentities and dump all agree here), snaks used as part of a reference don’t have a hash, and neither does the mainsnak of a statement. Snaks in the qualifiers, on the other hand, have a hash.

Do we want to add the hash to the reference snaks? Do we want to add it to the main snak? And do we want to take the opportunity to call it an “id” in those cases (and eventually rename it in the qualifiers as well)?


Event Timeline

It looks like @Addshore implemented this in July 2015 (datamodel-serialization@679774d120, Wikibase@3ae70d2b47, Wikibase@c6c4a5649a). Adam, do you know why the hashes were excluded?

I guess this was just caused by confusion during refactoring. I'm looking into it now. A good point to start the investigation is SerializerFactory::newStatementSerializer().

The following places currently construct statement serializers:

  • Scribunto_LuaWikibaseLibrary::newEntityAccessor()
  • WikibaseRepo::getApiHelperFactory()
  • SpecialEntityData::newDefaultRequestHandler()
  • DumpJson::execute()

All of them, set use the OPTION_SERIALIZE_MAIN_SNAKS_WITHOUT_HASH and OPTION_SERIALIZE_REFERENCE_SNAKS_WITHOUT_HASH flags. None of them sets the OPTION_SERIALIZE_QUALIFIER_SNAKS_WITHOUT_HASH.

I cannot remember any good reason for this. I suspect that the flags were simply copied around. Perhaps the OPTION_SERIALIZE_QUALIFIER_SNAKS_WITHOUT_HASH mode was added later. Anyway, it seems to me we should always be including all hashes, or none. Mixing these modes seems surprising and pointless.

It seems like we should:

  • never include snak hashes for the internal serialization
  • make the inclusion of snak hashes optional in the output of wbgetentities
  • omit snak hashes in the output of other API modules
  • omit snak hashes in JSON dumps, for compactness. Or include them, for completeness?
  • make the inclusion of snak hashes optional in the output of Special:EntityData. The default should be the same as for JSON dumps

Proposing this for discussion & decision during this or the next sprint.

  • omit snak hashes in the output of other API modules

As far as I can tell, we’ll need hashes in the output of wbeditentity if we want to have those hashes in the DOM (T171725). Or can the JavaScript view calculate the hashes?

So, this was part of the LARGE refactoring of the serialization stuff / removing lib serializers for those that remember that mess (it spans about 30 patches or something like that).

The options were introduced to keep the output in various areas of wikibase exactly the same through the refactoring / removing of the lib serializers.
We said that if we wanted to change something then we would discuss it further down the line, and it looks like we may now be at that point.

tldr, the only reason the options exist was to maintain the output format exactly as it was before refactoring.

As far as I can tell, we’ll need hashes in the output of wbeditentity if we want to have those hashes in the DOM (T171725). Or can the JavaScript view calculate the hashes?

No, the frontend code should not do its own hash calculations. That's going to fail horribly. The has is not based on the JSON representation.

When does the frontend code need to build a DOM fragment from JSON? After an edit completes, I assume. II still think it would save us a ton of trouble to simply ask the backend to do that, instead of doing it in JS... oh well.

If we need that, hashes should be optional in the output of all API modules (and off per default).

tldr, the only reason the options exist was to maintain the output format exactly as it was before refactoring.

Yay for bug-by-bug compatibility :)

I have no idea how to implement this nicely for API modules. There should be an optional flag parameter to include hashes in the output of wbsetclaim and other modules, but the configuration of the serializers is buried in WikibaseRepo::getApiHelperFactory, which is called way before the API module is even instantiated, let alone has had a chance to parse the API parameters. So it seems to me that we can either change a lot of scaffolding to inject serializer factories instead of serializers, and only instantiate a serializer factory at the last possible moment, or we add a way to change the serializer’s behavior after it has been instantiated, which is risky because it might be shared and also tricky because we need to propagate that change through the tree of serializers.

To summarize an IRC discussion (in an unlogged channel :/ ) –

  • We certainly don’t need a mixture of having hashes for some snaks but not others. To this end, I’ve created WikibaseDataModelSerialization#233, and once version 2.5.0 of that library is released we should use SerializerFactory::OPTION_SERIALIZE_SNAKS_WITHOUT_HASH.
  • We definitely don’t want hashes in the database, and it’s debatable whether we want them in the JSON dumps. (Perhaps we want two kinds of JSON dumps – see T174029.) Elsewhere, we can probably include the hashes. It might not even need to be configurable. (Daniel: “I’m okay with always including the hashes in API output, and output of Special:EntityData.”)
  • This should be announced as, at least, a significant change. If we go for dumps without hashes, it’s a breaking change, because qualifier snaks used to have hashes.

Change 373859 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Emit all snak hashes in API responses

https://gerrit.wikimedia.org/r/373859

It might not even need to be configurable. (Daniel: “I’m okay with always including the hashes in API output, and output of Special:EntityData.”)

Perhaps it should be configurable after all, because otherwise the tests are going to become ugly.

Change 373859 abandoned by Lucas Werkmeister (WMDE):
Emit all snak hashes in API responses

Reason:
Already done in Ia143563ed1. (Merging this into current master produces an empty diff. That’s also why the tests succeeded – Leszek went through the trouble of fixing them already :) )

https://gerrit.wikimedia.org/r/373859

thiemowmde moved this task from Review to Doing on the Wikidata-Former-Sprint-Board board.
thiemowmde removed a project: Patch-For-Review.
thiemowmde added subscribers: thiemowmde, WMDE-leszek.

Do we need them in the json dumps and Special:EntityData? If the answer is "unclear" then let's not do it until someone asks for it.

thiemowmde claimed this task.
thiemowmde moved this task from Doing to Proposed on the Wikidata-Former-Sprint-Board board.

Based on what Lucas, Lydia and Daniel wrote above, I think we should do exactly that: No hashes in JSON dumps for now (when somebody asks for it, we probably need to provide two dumps anyway, a compact and an expanded one). No hashes in EntityData (because it should mirror what the dumps do, in my opinion).

Just to clarify: the current situation in dumps and EntityData isn’t “no hashes”, it’s “qualifier hashes but not main snak hashes or reference hashes”. And we will lose the ability to make this distinction as soon as we switch to wikibase/data-model-serialization 3.0.0 (see pull request), and if we go for “no hashes” then (which is fine IMO), it will (strictly speaking) be a breaking change.

Just a followup from somebody coming to Wikidata dumps in 2021: it is really confusing that dumps do not include hashes, especially because EntityData seems to show them now for all snaks (main, qualifiers, references). So when one is debugging this, using EntityData as a reference throws you of.

I would vote for inclusion of hashes in dumps as well. I think this is backwards compatible change as it just adds to existing data in there. Having hashes in all snaks makes it easier to have a reference with which you can point back to a snak in your own system (or UI).