[Epic] Signed statements
Open, NormalPublic

Description

We want to enable institutions and people to sign statements in order to say that they indeed state what is in the statement. This would for example be useful for data imported from large institutions.

One possible way to do this:

  • Create a normalized serialization of the Statement's main snak and qualifiers (excluding rank and references, with normalized field order, escaping, whitespace, etc)
  • Make a sha1 hash of this serialization (the claim hash)
  • Compose a human readable text (yaml?) of the following:
    • the claim hash
    • the subject's revision ID (plus possibly the language codes of labels and descriptions that were used to establish the subject's identity)
    • the objects revision ID (if the main snak points to an item). (we may also need this for the objects of qualifiers)
    • the signer's identity (as given in the signer's certificate)
    • the current date and time
  • sign this text with GPG
  • add the signed text as (part of) a reference to the original statement

Things to keep in mind:

  • The hash of statements linking to items should include the revision ID of the linked item to allow checking against changing the identity of the linked item.
  • adding or changing references does not break the signature
  • adding or changing qualifiers does break the signature
  • changing the label and description of the subject (or the object) does not break the signature, but is detectable (by comparing against the revision ID included in the signed text). If the label or description was changed, a warning should be triggered, since changing the identity of the subject changes the meaning of the statement.

Story to sign:

  • enable "sign statement" gadget
  • click "sign this" icon next to a statement
  • see a popup with the follwing text fields:
    • the serialized claim
    • the text to sign
    • perhaps also the labels and descriptions of the subject and any objects, in the user's languages (as in the term box)
  • optionally, visually verify the serialized claim, and verify that it matches the sha1 hash provided for signing.
  • use copy & paste to sign the text with GPG.
  • copy the signed text back into the field that originally contained the text to sign.
  • click a button to save the signature
  • the server should verify the signature before saving it. Saving a broken signature is pointless.
  • we may have a whitelist of well known signers, to avoid confusion. The whitelist may be enforce during saving, or just change how a signature is displayed.

Story to verify:

  • click the "signature" icon next to a statement
  • see a popup with the same text fields as for signing, but with the signed text in a read-only field.
  • use the same steps as for signing to manually verify.

We can of course also verify on the server side, or with JS in the browser. But that can easily be faked. The only way for the user to be sure is to verify manually.

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 26 2016, 11:15 AM
daniel updated the task description. (Show Details)Jun 28 2016, 8:15 PM
daniel added subscribers: JanZerebecki, Eloquence, Denny.
TomT0m added a subscriber: TomT0m.Jun 29 2016, 5:44 PM
Spinster added a subscriber: Spinster.
jayvdb added a subscriber: jayvdb.Jul 1 2016, 4:19 PM
daniel updated the task description. (Show Details)Jul 2 2016, 10:26 AM

Great idea. When you are at the point where this needs testing by external partners, I'll be happy to help approach some.

This is a great idea, also considering that many Wikipedia users argue that Wikidata isn't vandalism-proof enough. Some things:

  • Should really any user be able to sign statements? I think it would be good to define a new user group which holds the right to do this. In this way the community would be able to decide which institutions are reliable / matching our scope.
  • It would be great to have an indicator next to the signature which states if the statement has been changed since it was approved. Otherwise it's in my eyes very hard to track changes and to estimate the reliability of the signature.

Interesting, maybe this can lead to a distributed truthy bubble (see [:w:en:Filter bubble]), where the user can chose instead of someone else.

We might want to get Wikibase-Quality-External-Validation done first.

Should really any user be able to sign statements? I think it would be good to define a new user group which holds the right to do this. In this way the community would be able to decide which institutions are reliable / matching our scope.

Why not curate a list of reliable/matching signing identities instead?

The useful crypto in the browser part can be solved in the same way Debian or Tor solve software distribution. The major browsers are working on parts that are necessary for it, though AFAIK no browser is anywhere near Debian yet, much less standardized.

Why use a sha1 instead of inlining the normalized serialization in the text to sign?
Why add the current date and time?
Why add the signer's identity?
How do you revoke a signature?
How do you guard against being able to send the user only a selective part of the signatures?
How do you verify what a revision contains and that the revision wasn't changed?

daniel added a comment.Jul 4 2016, 6:43 PM

Why use a sha1 instead of inlining the normalized serialization in the text to sign?

Because that doubles the size of the serialization of a statement.

Why add the current date and time?

For completeness. It's nice to know when something was signed, I think.

Why add the signer's identity?

It should be visible *somewhere*, right?

How do you revoke a signature?

By removing the snak that contains the signature. Or by revoking the key.

How do you guard against being able to send the user only a selective part of the signatures?

Can you elaborate?

How do you verify what a revision contains and that the revision wasn't changed?

By signing parts of that revision. I don't currently have a solution for labels and descriptions, except copying them into the signed text.

Why omit the revision ID of the predicate/property?

Why use a sha1 instead of inlining the normalized serialization in the text to sign?

Because that doubles the size of the serialization of a statement.

Would that part need to be stored?
If the sha1 is used how do you reconstruct what it was composed of?

Why add the current date and time?

For completeness. It's nice to know when something was signed, I think.

Why is the one in the GPG signature not sufficient

Why add the signer's identity?

It should be visible *somewhere*, right?

Why is the one in the GPG signature not sufficient?

How do you revoke a signature?

By removing the snak that contains the signature.

Then how do you counter replaying the signature?

Or by revoking the key.

What to do about all the other signatures that the signer wants to convey they think are still true?

How do you guard against being able to send the user only a selective part of the signatures?

Can you elaborate?

If you carefully select only some true statements will a conclusion based on them change in your intended way?

How do you verify what a revision contains and that the revision wasn't changed?

By signing parts of that revision. I don't currently have a solution for labels and descriptions, except copying them into the signed text.

Why only parts? How would that be done, by inlining or hashing and chaining or something else?

daniel added a comment.Jul 4 2016, 8:40 PM

Why omit the revision ID of the predicate/property?

Why use a sha1 instead of inlining the normalized serialization in the text to sign?

Because that doubles the size of the serialization of a statement.

Would that part need to be stored?

I would suggest to store it. It's also possible to reconstruct it.

If the sha1 is used how do you reconstruct what it was composed of?

By normalizing the serialization of the statement of the revision in which the signature was added.

Why add the current date and time?

For completeness. It's nice to know when something was signed, I think.

Why is the one in the GPG signature not sufficient

It is sufficient I think.

Why add the signer's identity?

It should be visible *somewhere*, right?

Why is the one in the GPG signature not sufficient?

It might be sufficient, but it's typically an email address. We may want something more meaningful - maybe even an ItemId.

How do you revoke a signature?

By removing the snak that contains the signature.

Then how do you counter replaying the signature?

The revision ID in the signed text would mismatch the revision in which the signature was added.

However, maybe we don't want to counter this. If someone vandalizes the item by removing the statement, or even just the signature, we want to be able to restore it, I suppose.

Or by revoking the key.

What to do about all the other signatures that the signer wants to convey they think are still true?

I didn't think about revoking individual signatures when I wrote this proposal... Maybe it would be possible to include a token in the signed text, and the signer could maintain a (signed) list if valued resp of revoked tokens in a well known location? Maybe the token is a URL that resolves to a (signed) "valid" or "invalid" response? How do other systems solve this?

How do you guard against being able to send the user only a selective part of the signatures?

Can you elaborate?

If you carefully select only some true statements will a conclusion based on them change in your intended way?

It's theoretically possible to construct a misleading data set from cherry-picked signed statements. I don't think this method could be used to make it appear that a signer asserted something they did not intend to assert. Do you have an idea how such a situation could be constructed?

How do you verify what a revision contains and that the revision wasn't changed?

By signing parts of that revision. I don't currently have a solution for labels and descriptions, except copying them into the signed text.

Why only parts? How would that be done, by inlining or hashing and chaining or something else?

Right, we should probably include the hash of the entire revision in the signed text. That's probably the easiest way, we even have that pre-computed in the database.

Why use a sha1 instead of inlining the normalized serialization in the text to sign?

Because that doubles the size of the serialization of a statement.

Would that part need to be stored?

I would suggest to store it. It's also possible to reconstruct it.

If the sha1 is used how do you reconstruct what it was composed of?

By normalizing the serialization of the statement of the revision in which the signature was added.

That makes it impossible to detach it, i.e. signatures need to be in the items. That limits the amount of signatures that are easily handled, is that your intention?

Why add the signer's identity?

It should be visible *somewhere*, right?

Why is the one in the GPG signature not sufficient?

It might be sufficient, but it's typically an email address. We may want something more meaningful - maybe even an ItemId.

No, it is not an email address, see https://tools.ietf.org/html/rfc4880#section-5.2.3.5 . The SHOULD NOT in https://tools.ietf.org/html/rfc4880#section-3.3 needs consideration, though. I think that RFC gets it right that the signers identity is not bound directly to anything besides crypto, binding to more info can be done by indirection through a signature.

How do you revoke a signature?

By removing the snak that contains the signature.

Then how do you counter replaying the signature?

The revision ID in the signed text would mismatch the revision in which the signature was added.

However, maybe we don't want to counter this. If someone vandalizes the item by removing the statement, or even just the signature, we want to be able to restore it, I suppose.

If replaying signatures is not countered then cryptographically revocation is not working. If revocation is not allowed, then how will a signer correct errors?

Or by revoking the key.

What to do about all the other signatures that the signer wants to convey they think are still true?

I didn't think about revoking individual signatures when I wrote this proposal... Maybe it would be possible to include a token in the signed text, and the signer could maintain a (signed) list if valued resp of revoked tokens in a well known location? Maybe the token is a URL that resolves to a (signed) "valid" or "invalid" response? How do other systems solve this?

GPG certifications (what is used to sign someones key) can be revoked. Synchronization is done through keyservers, though I don't know how far the weaknesses in that weakens the rest of all this. Do you? See also https://en.wikipedia.org/wiki/Outside_Context_Problem , https://en.wikipedia.org/wiki/Eventual_consistency and https://en.wikipedia.org/wiki/Revocation_list#Problems_with_CRLs .

How do you guard against being able to send the user only a selective part of the signatures?

Can you elaborate?

If you carefully select only some true statements will a conclusion based on them change in your intended way?

It's theoretically possible to construct a misleading data set from cherry-picked signed statements. I don't think this method could be used to make it appear that a signer asserted something they did not intend to assert. Do you have an idea how such a situation could be constructed?

Theoretically? Is lying by omission only theory?

By normalizing the serialization of the statement of the revision in which the signature was added.

That makes it impossible to detach it, i.e. signatures need to be in the items. That limits the amount of signatures that are easily handled, is that your intention?

I don't see why that would make it impossible to detach it.

Storing the signed text as a reference in the statement is indeed a scalability issue. It's a cost I would be willing to pay to keep implementation simple. As the proposal stands, it could even be implemented as a Gandget, with no backend support. The assumption was that the use case would be a handful of well known authorities signing things in their domain. If we want to scale this to hundreds of signatures per statement, so people can build a "truthy bubble", a bit of dedicated server side infrastructure would be needed. If the need arises, I'm all for it, but we should start small and simple.

It might be sufficient, but it's typically an email address. We may want something more meaningful - maybe even an ItemId.

No, it is not an email address, see https://tools.ietf.org/html/rfc4880#section-5.2.3.5 . The SHOULD NOT in https://tools.ietf.org/html/rfc4880#section-3.3 needs consideration, though. I think that RFC gets it right that the signers identity is not bound directly to anything besides crypto, binding to more info can be done by indirection through a signature.

That indirection is what I had in mind. But if there are better mechanisms that the one in the original proposal, sure, let's use them.

However, maybe we don't want to counter this. If someone vandalizes the item by removing the statement, or even just the signature, we want to be able to restore it, I suppose.

If replaying signatures is not countered then cryptographically revocation is not working. If revocation is not allowed, then how will a signer correct errors?

I was thinking about the issue while I was typing. I think revokable tokens (in the form of URLs) are the best way to achieve this. But it requires some infrostructure on the side of the signer. An alternative to tokens would be to use many separate subkeys that can be revoked without too much colatteral damage. If the signer uses one subkec per 100 statements, they would have to re-sign 99 statements to revoke one.

GPG certifications (what is used to sign someones key) can be revoked. Synchronization is done through keyservers, though I don't know how far the weaknesses in that weakens the rest of all this. Do you? See also https://en.wikipedia.org/wiki/Outside_Context_Problem , https://en.wikipedia.org/wiki/Eventual_consistency and https://en.wikipedia.org/wiki/Revocation_list#Problems_with_CRLs .

No, I have not looked at the details. As far as I know, the infrstructure exists, but doesn't work too well in practice.

It's theoretically possible to construct a misleading data set from cherry-picked signed statements. I don't think this method could be used to make it appear that a signer asserted something they did not intend to assert. Do you have an idea how such a situation could be constructed?

Theoretically? Is lying by omission only theory?

No, it's common practice. The question is if it's practical in this context.

We could support signing groups of statements, maybe even across items, to prevent this. That gets a lot more complicated, though. Keeping the spec for signed statements simple is important for getting it adopted. If we can keep it open for cross-statement siignatures, great! One option would be to have sections in the sigend text, one per statement ID (that is, ItemID + statement uuid). Then you can easily sign multiple statements together. But then there is no obvious place to store the signature for now.

What is the goal here?

daniel added a comment.Jul 5 2016, 2:00 PM

What is the goal here?

a) enabling an authority to cryptographically assert that a given statement is derived from and consistent with their data.

b) enabling anyone to cryptographically assert that a given statement is derived from and consistent with so given source.

Both should allow for an elevated level of trust in the assumption that the statements we have are actually backed by the sources cited, without having to check manually.

Thinking about it, at least for (b), the source reference in question (or a hash thereof) needs to be part of the signed text.

What is the difference between a and b?

Lydia_Pintscher added a subscriber: johl.
daniel added a comment.Jul 6 2016, 4:35 PM

What is the difference between a and b?

On the technical side, for (a) we might want to restrict who can sign, using a whitelist or a permission. Or we say we only allow signing with keys we (the community?) trust.

The main difference is in the interpretation (verifying against a 3rd party vs a 3rd party authority asserting something) and scale (a few dozen authorities vs thousands of random people).

The cryptography part specifies a technical solution. What is the reason or goal that lead you to specify it?

Scott_WUaS added a subscriber: Scott_WUaS.
abian added a subscriber: abian.Aug 21 2016, 10:48 AM

I find better to sign references, and not statements. The idea would be to say "I've checked out this reference and I ensure that this reference is real and consistent with these data", as Wikidata is a secondary database. We could continue signing references like "stated in" (P248). If the value or the reference changes, the signature should automatically disappear (but, in case of a reverted vandalism, the signature should also appear again).

We want to enable institutions and people to sign statements
in order to say that they indeed state what is in the statement.

Isnt that a recipe for wikibase being used as a primary source repository?

Or is the goal to store verification that a statement correctly reflects what is in a reference/source, is it giving special status to the institution who wrote the source?
Who will decide which institutions are allowed to sign 'their' facts?
If it is an open system, usable for any author of facts, how will the system manage a paper with 100+ authors (with many different roles in authoring, and rarely is this publicly disclosed)? And what happens when the paper is found to be bollocks? I hope you dont expect a uni to sign the statements from their faculty members. Or is academia excluded?

And once it is in, there will be statements that look very authoritative which appear to suggest that an institution is *the* source for a fact - who is going to check they didnt plagiarise someone else work.

Or is this a system for Wikimedia-approved (state-funded) institutions to vote for a fact, with 250+ state-sponsored bodies signing the statement that Taiwan is part of China, and 10 signing a statement that it is a nation.

I9606 added a subscriber: I9606.Oct 5 2016, 4:50 PM

Just to chime in that this would be great, but that there needs to be a way to do this via the API. A user gadget is not sufficient. (Perhaps this is there already and I missed it in the long comment thread there). Our use case would involve large imports of data (100s of thousands of claims) from databases. Having this would capacity to verify would make these data more trustworthy and the databases happier.

Abbe98 added a subscriber: Abbe98.Oct 5 2016, 5:34 PM
DarTar added a subscriber: DarTar.Oct 6 2016, 1:34 AM
Micru added a subscriber: Micru.Dec 1 2016, 4:12 PM
Glorian_Yapinus added a comment.EditedDec 16 2016, 10:09 AM

To get a better picture of what should be developed in this task, I have created a mockup for this.

You can find the mockup below

Please be aware to the presentation note that I have put on some slides which explains the description of the corresponding mockup. Also, I have made most of the links and buttons clickable in order to ease you to grasp the interaction between each mockup.

Tpt added a subscriber: Tpt.May 11 2017, 2:28 PM
Quoth added a subscriber: Quoth.Jul 14 2017, 10:46 AM
Jonas added a subscriber: Jonas.Mar 1 2018, 4:30 PM
AdityaJ added a subscriber: AdityaJ.Mar 2 2018, 6:28 PM

Hi All,
Can I know what are the microtasks for this project?

Hey @AdityaJ Great to hear you are interested in the project. I think as a first step it would be great if you install Wikibase and get it running.

This message is for students interested in working on this project for Google-Summer-of-Code (2018)

  • Student application deadline is March 27 16:00 UTC.
  • If you have questions about eligibility, please read the GSoC rules thoroughly here https://summerofcode.withgoogle.com/rules/. Wikimedia will not be responsible for verifying your eligibility and also not be able to make any decisions on this. For any clarifying questions, please email gsoc-support@google.com
  • Ensure that by now you have already discussed your implementation approach with your mentors, completed a few bugs/microtasks and made a plan to move forward with the proposal
  • I encourage you to start creating your proposals on Phabricator now to receive timely feedback on them from mentors. Do not wait until the last minute. Give your mentors at least a week's time to review your proposal, so that you could then incorporate any suggestions for changes. Learn how to submit a proposal in our participant's guide: https://www.mediawiki.org/wiki/Google_Summer_of_Code/Participants (Step 9)
  • Proposals that contain links to successfully merged patches before the application period and submitted on both Phabricator and GSoC portal will only be considered for the review process. So, between now and the application deadline, you could consider working on this task.
  • If you would like to chat with me more about the process or have questions, come and talk to me in the Zulip chat: https://wikimedia.zulipchat.com/

I have gone through "signed document mockup" shared above and I have the following picture in my mind please correct me where i am wrong,

  1. We are not concerned about data getting altered during the transmission, by the man in the middle or something. That means the receiver is supposed to hash the document again at its side and verify it against the supplied signature. This means it is possible to alter data and put according to signature during the transmission
  2. We want to supply original digital signature with the article and let receiver verify the integrity of the document. In case of web pages, javascript may do that job.
  3. We will need to create some API endpoint to verify the claims. In this API signature can be passed to view the information of signer.
  4. We also need to maintain the records of signers to verify claims
  1. We are not concerned about data getting altered during the transmission, by the man in the middle or something. That means the receiver is supposed to hash the document again at its side and verify it against the supplied signature. This means it is possible to alter data and put according to signature during the transmission

Yes corrcet.

  1. We want to supply original digital signature with the article and let receiver verify the integrity of the document. In case of web pages, javascript may do that job.

Yes. Though the document may be individual statements.

  1. We will need to create some API endpoint to verify the claims. In this API signature can be passed to view the information of signer.

Possibly yes or see if there is a standard way to do this in the browser.

  1. We also need to maintain the records of signers to verify claims

That should be done via the usual PGP web of trust.