Page MenuHomePhabricator

For consistency MediaInfo serialization should use "claims" as key, rather than "statements"
Open, HighPublic

Description

This should be changed in order to stay consistent with the Item/ Property serialization which both use "claims".

This topic came up during the initial development of Lexem (De-)Serializers.

Related Objects

Event Timeline

Addshore added a subscriber: Addshore.

This is in MediaInfoSerializer::getSerialized

		$serialization['statements'] = $this->statementListSerializer->serialize(
			$mediaInfo->getStatements()
		);

This is already now going to cause an issue as we now have entities in the commons DB including serialization with a "statements" key, so we will have to have a compat layer.

Magnus ran into this and this is very confusing and inconsistent. The longer we wait with fixing this, the more effort it will cost. Example https://commons.wikimedia.org/w/api.php?action=wbgetentities&ids=M27401711 uses "statements" as the key, but we expect "claims" as the key, like https://commons.wikimedia.org/w/api.php?action=wbgetentities&ids=Q422341 . This is also consistent with the api calls like https://commons.wikimedia.org/w/api.php?action=help&modules=wbcreateclaim , https://commons.wikimedia.org/w/api.php?action=wbgetclaims&entity=M27401711 , https://commons.wikimedia.org/w/api.php?action=help&modules=wbremoveclaims and https://commons.wikimedia.org/w/api.php?action=help&modules=wbsetclaim

By the way, I noticed another difference: The imageinfo output doesn't contain the line "datatype": "wikibase-item". Not sure if this is related or not?

Magnus ran into this and this is very confusing and inconsistent. The longer we wait with fixing this, the more effort it will cost.

Totally agree there, we have already put this off for years.
It's a pretty breaking change though.
But with the correct announcement and time that would all be fine, and even if people miss it, fixing their code should be trivial.

Thoughts @Lydia_Pintscher ?

Can't we do a smart trick with showing it twice (claims and statements) in the front and storing it only once in the back? A big bang is much more complicated and riskier. You would have a timeline like:

  • Switch api read and write functions to expose both claims and statements
  • Clients can start switching to "claims"
  • Switch the backend to read it as both statements and claims, but store it as "claims"
  • Do a null edit run to get rid of the "statements" and replace it with "claims"
  • Switch backend code to only use "claims"
  • Check if no code is using statements anymore in the api
  • Switch front end to only use "claims"

Magnus ran into this and this is very confusing and inconsistent. The longer we wait with fixing this, the more effort it will cost.

Totally agree there, we have already put this off for years.
It's a pretty breaking change though.
But with the correct announcement and time that would all be fine, and even if people miss it, fixing their code should be trivial.

Thoughts @Lydia_Pintscher ?

From my side absolutely. I'm not sure why this is different in the first place.

FWIW, I have already changed my code to work with either claims or statements. Quick thoughts:

  • IMHO this change is too significant to do it "just because it's a nicer word". No one really cares what it's called, as long as you call it the same thing every time.
  • A big problem with this change that it was not announced anywhere I did see, and I'm pretty much subscribed to everything public. Give us poor volunteer devs some warning, at least
  • Also, it's inconsistent. Wikidata items and properties have claims, mediainfo items have statements. Is this going to change on Wikidata as well? Wikibase in general?
  • Other things have changes in the Commons/mediainfo implementation. datatype is missing, for one (tracked by some other issue here). Will it come back? Will it stay missing but only for mediainfo?

The problem is not that things are different, they are different needlessly (AFAICT), and unannounced.

Can't we do a smart trick with showing it twice (claims and statements) in the front and storing it only once in the back? A big bang is much more complicated and riskier.

We could do, but this would drastically increase the sizes of responses.
We could use a method similar to some mediawiki core api versions and introduce a new temporary param, for example statementsnotclaims=1.
When 1 send statements, when not set send claims, then we can slowly monitor adoption of the new format and send out warnings with the old format, trying to chase down any user agents that still use the old one before actually fully switching over?
We could also just introduce a parameter for "serializationVersion" which could basically do the same thing.
This would all be easier if we had a nice versioned API already or versioned serialization :)

T92961: [Story] Versioning in JSON output

We also have T221737: REST API Infrastructure in MediaWiki to look forward to, it might make sense to hold off on a change like this until we make a "big move" to a new API.

A big problem with this change that it was not announced anywhere I did see, and I'm pretty much subscribed to everything public. Give us poor volunteer devs some warning, at least

Hmm, you mean the fact that statements on mediainfo entities are serialized as "statements" not "claims".
I guess it is not seen as a breaking change as media info entities are new, and they can do what they want with their serialization.

Also, it's inconsistent. Wikidata items and properties have claims, mediainfo items have statements. Is this going to change on Wikidata as well? Wikibase in general?

Everything should probably change to "statements" in the long run, as having "claims" being talked about in the serialization and only the serialization is confusing. and per the definition of our data model it also doesn't make sense.

Other things have changes in the Commons/mediainfo implementation. datatype is missing, for one (tracked by some other issue here). Will it come back? Will it stay missing but only for mediainfo?

Which ticket number is this one?

Can't we do a smart trick with showing it twice (claims and statements) in the front and storing it only once in the back? A big bang is much more complicated and riskier.

We could do, but this would drastically increase the sizes of responses.

Drastically? We're talking Commons here. Before I ran a bot only 50.000 files even had claims. Nothing has references and qualifiers are only introduced this week. Everything is still tiny compared to Wikidata

We could use a method similar to some mediawiki core api versions and introduce a new temporary param, for example statementsnotclaims=1.
When 1 send statements, when not set send claims, then we can slowly monitor adoption of the new format and send out warnings with the old format, trying to chase down any user agents that still use the old one before actually fully switching over?

Commons! Not, Wikidata. What adoption? I think we currently have:

  • The stuff the SDOC wrote (front end, uploadwizard, etc.). That's easy to track and fix
  • Some stuff I wrote in which I don't have claims/statements yet
  • Some stuff Magnus wrote and he already updated his code
  • Anything else? I doubt it.

So why bother to make this more complicated than needed? Or am I missing something here?

Drastically? We're talking Commons here. Before I ran a bot only 50.000 files even had claims. Nothing has references and qualifiers are only introduced this week. Everything is still tiny compared to Wikidata

I'm talking about changing this from "claims" to "statements" on wikidata.
That would be the right thing to do, as they are statements, not claims.

Drastically? We're talking Commons here. Before I ran a bot only 50.000 files even had claims. Nothing has references and qualifiers are only introduced this week. Everything is still tiny compared to Wikidata

I'm talking about changing this from "claims" to "statements" on wikidata.
That would be the right thing to do, as they are statements, not claims.

No no no, that's not what this task is about. The scope of this task is only mediainfo on Commons and undoing the mistake of introducing statements instead of claims. Please focus on the issue at hand.

This is partly a question for the SDOC team, and if SDOC will ever have references as part of their data model.
Then there is also the question for wikidata and wikibase, which is, why are we using "claims" in the serialization?

The answer to the latter is for legacy reasons, and to avoid breaking peoples tools, if we could rename our api modules to talk about statements instead of claims, and easily change the serialization without annoying people we would, and we will at some point.
Why are we currently putting statements under a claims key? that makes no sense?

And thus, if we want to make this change in wikibase / on wikidata with items and properties, then why change sdoc to use claims, when a little way down the road we will want to then change it back to statements to be consistent with wikidata and wikibase once again?

Jdforrester-WMF renamed this task from For consistency MediaInfo serialization should use "claims" as key, rather than "statements" to For consistency, Wikibase serialization should use "statements" as key, rather than "claims", like modern Wikibase code now does.Jun 24 2019, 9:07 PM

@Jdforrester-WMF Is that an official design decision (claims=>statements)? Where was this fundamentally breaking change announced to the public?

Personally I don't care what it's called, just that it's (a) consistent and (b) announce before changed in production.

Why are we currently putting statements under a claims key? that makes no sense?

Originally, Statement was a subclass of Claim, so all statements were claims. We chose the more general term for the key, so it could accommodate all kinds of claims. Later, we found that we really always wanted Statements, we didn't find a use cases that never needed references. So Claim was dropped as a base class of Statement.

Multichill renamed this task from For consistency, Wikibase serialization should use "statements" as key, rather than "claims", like modern Wikibase code now does to For consistency MediaInfo serialization should use "claims" as key, rather than "statements".Jun 25 2019, 9:11 PM

Changed back the topic. This is a huge scope change and derailing things. As far as I see everywhere in the api we use "claims", not "statements" (also in the functions). The only inconsistency right now is mediainfo, that should be fixed. If you want to change the everything in the Wikibase API to use statements instead of claims (wbgetclaims -> wbgetstatements, etc.), file a new task so I can down vote that one as a huge waste of resources.

Changed back the topic. This is a huge scope change and derailing things. As far as I see everywhere in the api we use "claims", not "statements" (also in the functions). The only inconsistency right now is mediainfo, that should be fixed. If you want to change the everything in the Wikibase API to use statements instead of claims (wbgetclaims -> wbgetstatements, etc.), file a new task so I can down vote that one as a huge waste of resources.

OK, then I can just Decline this task? As established above, when Wikimedia DE wrote WBMI in early 2016 they used "statements" because all new code should use that and not "claims", but haven't gone back to fix Wikidata to use the modern language.

Changed back the topic. This is a huge scope change and derailing things. As far as I see everywhere in the api we use "claims", not "statements" (also in the functions). The only inconsistency right now is mediainfo, that should be fixed. If you want to change the everything in the Wikibase API to use statements instead of claims (wbgetclaims -> wbgetstatements, etc.), file a new task so I can down vote that one as a huge waste of resources.

OK, then I can just Decline this task? As established above, when Wikimedia DE wrote WBMI in early 2016 they used "statements" because all new code should use that and not "claims", but haven't gone back to fix Wikidata to use the modern language.

"Established"? Where? When? Point me to the official announcement please! The one that gives everyone time to prepare, before it's released on, say, Commons. That one. Until then, this remains open.

To avoid that people confuse claims and statements in general, maybe the feature should use an entirely different name.

As established above, when Wikimedia DE wrote WBMI in early 2016 they used "statements" because all new code should use that and not "claims", but haven't gone back to fix Wikidata to use the modern language.

"Established"? Where? When? Point me to the official announcement please! The one that gives everyone time to prepare, before it's released on, say, Commons. That one.

I have no idea when exactly the Wikidata made this decision, or whether or how it was communicated "explicitly", but for outside observers like me it's been obvious for years – for example, the Cirrus search modifier is "haswbstatement" not "haswbclaim".

Until then, this remains open.

That's not how Phabricator works.

The SDC team will follow the recommendations/decisions of the original authors (WMDE).

We do believe, as Addshore mentioned above, that there's a strong possibility that the Commons model will end up not using references in the way that Wikidata does. We'll defer to WMDE for how that should be reflected in the serialization code.

Changed back the topic. This is a huge scope change and derailing things. As far as I see everywhere in the api we use "claims", not "statements" (also in the functions). The only inconsistency right now is mediainfo, that should be fixed. If you want to change the everything in the Wikibase API to use statements instead of claims (wbgetclaims -> wbgetstatements, etc.), file a new task so I can down vote that one as a huge waste of resources.

OK, then I can just Decline this task? As established above, when Wikimedia DE wrote WBMI in early 2016 they used "statements" because all new code should use that and not "claims", but haven't gone back to fix Wikidata to use the modern language.

"Established"? Where? When? Point me to the official announcement please! The one that gives everyone time to prepare, before it's released on, say, Commons. That one. Until then, this remains open.

So the data model is described at https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON . Notice the usage of "claims" instead of "statements" . This is considered a stable data format, see https://www.wikidata.org/wiki/Wikidata:Stable_Interface_Policy#Stable_Data_Formats . The different api functions of claims are subject to https://www.wikidata.org/wiki/Wikidata:Stable_Interface_Policy#Stable_Public_APIs . Shall I continue? You want to break all these stable policies just because it looks better?

So, although not particularly clear on the https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON page, this is the JSON data model definition for items and properties.
And the stable interface policy on wikidata only currently applies to wikidata.

provided by Wikibase as deployed on www.wikidata.org.

So the SIP doesn't apply to commons at all right now. This is something that we need to discuss with the StructuredDataOnCommons team once the main initial development stages are all complete.

  • Maybe the "labels" represented in JSON will be changed to "captions"?
  • "descriptions" will also probably be removed from the JSON

Anyway, the docs and links etc from the SIP and for the JSON data model docs probably need a bit of work to include things like Lexemes for the link to stable JSON representations on wikidata.org rather than only showing the item / property JSON definition on Wikibase/DataModel/JSON.

In light of all of this we need to:

  • Discuss with the SDOC team the stability interface policy, and if they want one yet, and where this should be, and what it should include for mediainfo vs the rest of wikidata at this stage.
  • Improve the "wikibase json datamodel" docs to make it clear what docs are for what entities etc, and which json data models are currently covered by the SIP.
  • Decide what to do regarding claims vs statements for both wikidata entities and also within media info

Regrading that last point, in the long run, we want to stop talking about "claims" everywhere, as the statements vs claims legacy only leads to more confusion.
In the long run, probably once we have a new iteration on our API, we will likely change this JSON serialization, but none of this is happening yet.
The reason that we even see claims anywhere is explained in T149410#5281355
If we want to move towards "statements" in our JSON output, it is highly unlikely that we are going to change mediainfo to now have a "claims" key when in a year or so it will then be moved again back to "statements".

I feel like I have rambled on enough for now but perfectly happy to keep discussing this, but I believe in terms of the serialization on wikidata.org and on commons.org and the claims vs statements keys, nothing is going to be changing in either place any time soon.

The SDC team will follow the recommendations/decisions of the original authors (WMDE).

We do believe, as Addshore mentioned above, that there's a strong possibility that the Commons model will end up not using references in the way that Wikidata does. We'll defer to WMDE for how that should be reflected in the serialization code.

Hi all (especially @Addshore and @Ramsey-WMF)

Is there any progress on adding references to Commons? Now SDC is being widely adopted it would be extremely helpful to be able to use references. I know that it is currently possible to add references but they are hidden.

Would someone be able to explain in plain English why Structured Data on Commons shouldn't use references in the same way Wikidata does?

Also what would be the resources needed to make references work in a way that individuals and organisations could use in the normal interface?

Thanks

@John_Cummings - structured data on a commons File page is for describing the file. For example:

  • what an image depicts
  • the copyright licence associated with the file
  • who created the file

In this context I don't see a solid use case for references. We don't need a reference to say that an image depicts a fish, for example. An image might be a photo of a famous painting, but in that case we'd use the digital representation of property to point at the wikidata item for the painting itself, and referenced information about the painting (rather than the image file) would be available in wikidata

Is there something I'm missing? Had you a specific use case in mind?

Hi @Cparle thanks for replying. I know @Fuzheado @Alicia_Fagerving_WMSE @Jopparn @Battleofalma etc will also have thoughts on this.

I'll give you my answer and let other expand on it. I'm basing this on 10 years of working as Wikimedian in Residence for cultural institutions, UN agencies and parts of the EU. The main use case is from my perspective is for any content created by external organisations, which runs to 10s of millions of files on Commons. Many of these organisations share quite extensive metadata with their content way beyond depicts, copyright and author. The main benefits I see are the same as for references on Wikipedia, verifiability and credit.

Wikipedia
Allowing users to know that the metadata comes from an organisation creates a level of trust in the information. I think SDC could be widely used and useful on Wikipedia but without references to provide verifiability it seems unlikely it will get used, in the same way Wikidata data without references are blocked on English Wikipedia infoboxes in a lot of situations. Another benefit for Wikipedia specifically is to make creating Wikipedia articles for things depicted on Commons (eg an object in a museum) easier because the references which are collated in SDC can most probably be reused on Wikipedia.

Organisations sharing content:
Many organisations adopt an open license specifically so they can share it on Wikimedia projects, most of my job in the UN the last 5 years has been around helping orgs adopt open licenses. Generally speaking organisations who share content on Commons want recognition and metrics around page views and a clear delineation between their content and Wikimedia community contributions to avoid confusion from readers. Have references in SDC will give the organisations credit for the metadata they share and reduce concerns about their content being confused with community contributions which may be incorrect. It will also encourage them to start using Wikidata and SDC on their own website eg providing multilingual labels. There's an extra barrier to them adopting open licenses with the CC0 license for SDC statements, generally organisations are willing to share under CC BY or SA for content but CC0 is difficult because is doesn't by its nature give them credit for their content. We get around this with Wikidata because we can say 'there will be references so people can see you added this data'. Generally speaking 'please can you spend a significant amount of time to understand and change your license so you can share your content with us, we won't give you credit for any of it' is really not going to work.

Hope this helps

Thanks again

@Cparle @John_Cummings : Why are you discussion references in this task instead of in T230315 ? This task is about about serialization of the data and that the fact that we use two different keys (claims vs statements) for the same thing.

@Cparle @John_Cummings : Why are you discussion references in this task instead of in T230315 ? This task is about about serialization of the data and that the fact that we use two different keys (claims vs statements) for the same thing.

I'll copy my answer over to the other task, sorry, didn't realise they were different