Page MenuHomePhabricator

[Epic] Signed statements
Open, NormalPublic

Description

We want to enable institutions and people to sign statements in order to say that they indeed state what is in the statement. This would for example be useful for data imported from large institutions.

One possible way to do this:

  • Create a normalized serialization of the Statement's main snak and qualifiers (excluding rank and references, with normalized field order, escaping, whitespace, etc)
  • Make a sha1 hash of this serialization (the claim hash)
  • Compose a human readable text (yaml?) of the following:
    • the claim hash
    • the subject's revision ID (plus possibly the language codes of labels and descriptions that were used to establish the subject's identity)
    • the objects revision ID (if the main snak points to an item). (we may also need this for the objects of qualifiers)
    • the signer's identity (as given in the signer's certificate)
    • the current date and time
  • sign this text with GPG
  • add the signed text as (part of) a reference to the original statement

Things to keep in mind:

  • The hash of statements linking to items should include the revision ID of the linked item to allow checking against changing the identity of the linked item.
  • adding or changing references does not break the signature
  • adding or changing qualifiers does break the signature
  • changing the label and description of the subject (or the object) does not break the signature, but is detectable (by comparing against the revision ID included in the signed text). If the label or description was changed, a warning should be triggered, since changing the identity of the subject changes the meaning of the statement.

Story to sign:

  • enable "sign statement" gadget
  • click "sign this" icon next to a statement
  • see a popup with the follwing text fields:
    • the serialized claim
    • the text to sign
    • perhaps also the labels and descriptions of the subject and any objects, in the user's languages (as in the term box)
  • optionally, visually verify the serialized claim, and verify that it matches the sha1 hash provided for signing.
  • use copy & paste to sign the text with GPG.
  • copy the signed text back into the field that originally contained the text to sign.
  • click a button to save the signature
  • the server should verify the signature before saving it. Saving a broken signature is pointless.
  • we may have a whitelist of well known signers, to avoid confusion. The whitelist may be enforce during saving, or just change how a signature is displayed.

Story to verify:

  • click the "signature" icon next to a statement
  • see a popup with the same text fields as for signing, but with the signed text in a read-only field.
  • use the same steps as for signing to manually verify.

We can of course also verify on the server side, or with JS in the browser. But that can easily be faked. The only way for the user to be sure is to verify manually.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jayvdb added a subscriber: jayvdb.Jul 1 2016, 4:19 PM
daniel updated the task description. (Show Details)Jul 2 2016, 10:26 AM

Great idea. When you are at the point where this needs testing by external partners, I'll be happy to help approach some.

This is a great idea, also considering that many Wikipedia users argue that Wikidata isn't vandalism-proof enough. Some things:

  • Should really any user be able to sign statements? I think it would be good to define a new user group which holds the right to do this. In this way the community would be able to decide which institutions are reliable / matching our scope.
  • It would be great to have an indicator next to the signature which states if the statement has been changed since it was approved. Otherwise it's in my eyes very hard to track changes and to estimate the reliability of the signature.

Interesting, maybe this can lead to a distributed truthy bubble (see [:w:en:Filter bubble]), where the user can chose instead of someone else.

We might want to get Wikibase-Quality-External-Validation done first.

Should really any user be able to sign statements? I think it would be good to define a new user group which holds the right to do this. In this way the community would be able to decide which institutions are reliable / matching our scope.

Why not curate a list of reliable/matching signing identities instead?

The useful crypto in the browser part can be solved in the same way Debian or Tor solve software distribution. The major browsers are working on parts that are necessary for it, though AFAIK no browser is anywhere near Debian yet, much less standardized.

Why use a sha1 instead of inlining the normalized serialization in the text to sign?
Why add the current date and time?
Why add the signer's identity?
How do you revoke a signature?
How do you guard against being able to send the user only a selective part of the signatures?
How do you verify what a revision contains and that the revision wasn't changed?

daniel added a comment.Jul 4 2016, 6:43 PM

Why use a sha1 instead of inlining the normalized serialization in the text to sign?

Because that doubles the size of the serialization of a statement.

Why add the current date and time?

For completeness. It's nice to know when something was signed, I think.

Why add the signer's identity?

It should be visible *somewhere*, right?

How do you revoke a signature?

By removing the snak that contains the signature. Or by revoking the key.

How do you guard against being able to send the user only a selective part of the signatures?

Can you elaborate?

How do you verify what a revision contains and that the revision wasn't changed?

By signing parts of that revision. I don't currently have a solution for labels and descriptions, except copying them into the signed text.

Why omit the revision ID of the predicate/property?

Why use a sha1 instead of inlining the normalized serialization in the text to sign?

Because that doubles the size of the serialization of a statement.

Would that part need to be stored?
If the sha1 is used how do you reconstruct what it was composed of?

Why add the current date and time?

For completeness. It's nice to know when something was signed, I think.

Why is the one in the GPG signature not sufficient

Why add the signer's identity?

It should be visible *somewhere*, right?

Why is the one in the GPG signature not sufficient?

How do you revoke a signature?

By removing the snak that contains the signature.

Then how do you counter replaying the signature?

Or by revoking the key.

What to do about all the other signatures that the signer wants to convey they think are still true?

How do you guard against being able to send the user only a selective part of the signatures?

Can you elaborate?

If you carefully select only some true statements will a conclusion based on them change in your intended way?

How do you verify what a revision contains and that the revision wasn't changed?

By signing parts of that revision. I don't currently have a solution for labels and descriptions, except copying them into the signed text.

Why only parts? How would that be done, by inlining or hashing and chaining or something else?

daniel added a comment.Jul 4 2016, 8:40 PM

Why omit the revision ID of the predicate/property?

Why use a sha1 instead of inlining the normalized serialization in the text to sign?

Because that doubles the size of the serialization of a statement.

Would that part need to be stored?

I would suggest to store it. It's also possible to reconstruct it.

If the sha1 is used how do you reconstruct what it was composed of?

By normalizing the serialization of the statement of the revision in which the signature was added.

Why add the current date and time?

For completeness. It's nice to know when something was signed, I think.

Why is the one in the GPG signature not sufficient

It is sufficient I think.

Why add the signer's identity?

It should be visible *somewhere*, right?

Why is the one in the GPG signature not sufficient?

It might be sufficient, but it's typically an email address. We may want something more meaningful - maybe even an ItemId.

How do you revoke a signature?

By removing the snak that contains the signature.

Then how do you counter replaying the signature?

The revision ID in the signed text would mismatch the revision in which the signature was added.

However, maybe we don't want to counter this. If someone vandalizes the item by removing the statement, or even just the signature, we want to be able to restore it, I suppose.

Or by revoking the key.

What to do about all the other signatures that the signer wants to convey they think are still true?

I didn't think about revoking individual signatures when I wrote this proposal... Maybe it would be possible to include a token in the signed text, and the signer could maintain a (signed) list if valued resp of revoked tokens in a well known location? Maybe the token is a URL that resolves to a (signed) "valid" or "invalid" response? How do other systems solve this?

How do you guard against being able to send the user only a selective part of the signatures?

Can you elaborate?

If you carefully select only some true statements will a conclusion based on them change in your intended way?

It's theoretically possible to construct a misleading data set from cherry-picked signed statements. I don't think this method could be used to make it appear that a signer asserted something they did not intend to assert. Do you have an idea how such a situation could be constructed?

How do you verify what a revision contains and that the revision wasn't changed?

By signing parts of that revision. I don't currently have a solution for labels and descriptions, except copying them into the signed text.

Why only parts? How would that be done, by inlining or hashing and chaining or something else?

Right, we should probably include the hash of the entire revision in the signed text. That's probably the easiest way, we even have that pre-computed in the database.

Why use a sha1 instead of inlining the normalized serialization in the text to sign?

Because that doubles the size of the serialization of a statement.

Would that part need to be stored?

I would suggest to store it. It's also possible to reconstruct it.

If the sha1 is used how do you reconstruct what it was composed of?

By normalizing the serialization of the statement of the revision in which the signature was added.

That makes it impossible to detach it, i.e. signatures need to be in the items. That limits the amount of signatures that are easily handled, is that your intention?

Why add the signer's identity?

It should be visible *somewhere*, right?

Why is the one in the GPG signature not sufficient?

It might be sufficient, but it's typically an email address. We may want something more meaningful - maybe even an ItemId.

No, it is not an email address, see https://tools.ietf.org/html/rfc4880#section-5.2.3.5 . The SHOULD NOT in https://tools.ietf.org/html/rfc4880#section-3.3 needs consideration, though. I think that RFC gets it right that the signers identity is not bound directly to anything besides crypto, binding to more info can be done by indirection through a signature.

How do you revoke a signature?

By removing the snak that contains the signature.

Then how do you counter replaying the signature?

The revision ID in the signed text would mismatch the revision in which the signature was added.
However, maybe we don't want to counter this. If someone vandalizes the item by removing the statement, or even just the signature, we want to be able to restore it, I suppose.

If replaying signatures is not countered then cryptographically revocation is not working. If revocation is not allowed, then how will a signer correct errors?

Or by revoking the key.

What to do about all the other signatures that the signer wants to convey they think are still true?

I didn't think about revoking individual signatures when I wrote this proposal... Maybe it would be possible to include a token in the signed text, and the signer could maintain a (signed) list if valued resp of revoked tokens in a well known location? Maybe the token is a URL that resolves to a (signed) "valid" or "invalid" response? How do other systems solve this?

GPG certifications (what is used to sign someones key) can be revoked. Synchronization is done through keyservers, though I don't know how far the weaknesses in that weakens the rest of all this. Do you? See also https://en.wikipedia.org/wiki/Outside_Context_Problem , https://en.wikipedia.org/wiki/Eventual_consistency and https://en.wikipedia.org/wiki/Revocation_list#Problems_with_CRLs .

How do you guard against being able to send the user only a selective part of the signatures?

Can you elaborate?

If you carefully select only some true statements will a conclusion based on them change in your intended way?

It's theoretically possible to construct a misleading data set from cherry-picked signed statements. I don't think this method could be used to make it appear that a signer asserted something they did not intend to assert. Do you have an idea how such a situation could be constructed?

Theoretically? Is lying by omission only theory?

By normalizing the serialization of the statement of the revision in which the signature was added.

That makes it impossible to detach it, i.e. signatures need to be in the items. That limits the amount of signatures that are easily handled, is that your intention?

I don't see why that would make it impossible to detach it.

Storing the signed text as a reference in the statement is indeed a scalability issue. It's a cost I would be willing to pay to keep implementation simple. As the proposal stands, it could even be implemented as a Gandget, with no backend support. The assumption was that the use case would be a handful of well known authorities signing things in their domain. If we want to scale this to hundreds of signatures per statement, so people can build a "truthy bubble", a bit of dedicated server side infrastructure would be needed. If the need arises, I'm all for it, but we should start small and simple.

It might be sufficient, but it's typically an email address. We may want something more meaningful - maybe even an ItemId.

No, it is not an email address, see https://tools.ietf.org/html/rfc4880#section-5.2.3.5 . The SHOULD NOT in https://tools.ietf.org/html/rfc4880#section-3.3 needs consideration, though. I think that RFC gets it right that the signers identity is not bound directly to anything besides crypto, binding to more info can be done by indirection through a signature.

That indirection is what I had in mind. But if there are better mechanisms that the one in the original proposal, sure, let's use them.

However, maybe we don't want to counter this. If someone vandalizes the item by removing the statement, or even just the signature, we want to be able to restore it, I suppose.

If replaying signatures is not countered then cryptographically revocation is not working. If revocation is not allowed, then how will a signer correct errors?

I was thinking about the issue while I was typing. I think revokable tokens (in the form of URLs) are the best way to achieve this. But it requires some infrostructure on the side of the signer. An alternative to tokens would be to use many separate subkeys that can be revoked without too much colatteral damage. If the signer uses one subkec per 100 statements, they would have to re-sign 99 statements to revoke one.

GPG certifications (what is used to sign someones key) can be revoked. Synchronization is done through keyservers, though I don't know how far the weaknesses in that weakens the rest of all this. Do you? See also https://en.wikipedia.org/wiki/Outside_Context_Problem , https://en.wikipedia.org/wiki/Eventual_consistency and https://en.wikipedia.org/wiki/Revocation_list#Problems_with_CRLs .

No, I have not looked at the details. As far as I know, the infrstructure exists, but doesn't work too well in practice.

It's theoretically possible to construct a misleading data set from cherry-picked signed statements. I don't think this method could be used to make it appear that a signer asserted something they did not intend to assert. Do you have an idea how such a situation could be constructed?

Theoretically? Is lying by omission only theory?

No, it's common practice. The question is if it's practical in this context.

We could support signing groups of statements, maybe even across items, to prevent this. That gets a lot more complicated, though. Keeping the spec for signed statements simple is important for getting it adopted. If we can keep it open for cross-statement siignatures, great! One option would be to have sections in the sigend text, one per statement ID (that is, ItemID + statement uuid). Then you can easily sign multiple statements together. But then there is no obvious place to store the signature for now.

What is the goal here?

daniel added a comment.Jul 5 2016, 2:00 PM

What is the goal here?

a) enabling an authority to cryptographically assert that a given statement is derived from and consistent with their data.

b) enabling anyone to cryptographically assert that a given statement is derived from and consistent with so given source.

Both should allow for an elevated level of trust in the assumption that the statements we have are actually backed by the sources cited, without having to check manually.

Thinking about it, at least for (b), the source reference in question (or a hash thereof) needs to be part of the signed text.

What is the difference between a and b?

Lydia_Pintscher added a subscriber: johl.
daniel added a comment.Jul 6 2016, 4:35 PM

What is the difference between a and b?

On the technical side, for (a) we might want to restrict who can sign, using a whitelist or a permission. Or we say we only allow signing with keys we (the community?) trust.

The main difference is in the interpretation (verifying against a 3rd party vs a 3rd party authority asserting something) and scale (a few dozen authorities vs thousands of random people).

The cryptography part specifies a technical solution. What is the reason or goal that lead you to specify it?

Scott_WUaS added a subscriber: Scott_WUaS.
abian added a subscriber: abian.Aug 21 2016, 10:48 AM

I find better to sign references, and not statements. The idea would be to say "I've checked out this reference and I ensure that this reference is real and consistent with these data", as Wikidata is a secondary database. We could continue signing references like "stated in" (P248). If the value or the reference changes, the signature should automatically disappear (but, in case of a reverted vandalism, the signature should also appear again).

We want to enable institutions and people to sign statements
in order to say that they indeed state what is in the statement.

Isnt that a recipe for wikibase being used as a primary source repository?

Or is the goal to store verification that a statement correctly reflects what is in a reference/source, is it giving special status to the institution who wrote the source?
Who will decide which institutions are allowed to sign 'their' facts?
If it is an open system, usable for any author of facts, how will the system manage a paper with 100+ authors (with many different roles in authoring, and rarely is this publicly disclosed)? And what happens when the paper is found to be bollocks? I hope you dont expect a uni to sign the statements from their faculty members. Or is academia excluded?

And once it is in, there will be statements that look very authoritative which appear to suggest that an institution is *the* source for a fact - who is going to check they didnt plagiarise someone else work.

Or is this a system for Wikimedia-approved (state-funded) institutions to vote for a fact, with 250+ state-sponsored bodies signing the statement that Taiwan is part of China, and 10 signing a statement that it is a nation.

I9606 added a subscriber: I9606.Oct 5 2016, 4:50 PM

Just to chime in that this would be great, but that there needs to be a way to do this via the API. A user gadget is not sufficient. (Perhaps this is there already and I missed it in the long comment thread there). Our use case would involve large imports of data (100s of thousands of claims) from databases. Having this would capacity to verify would make these data more trustworthy and the databases happier.

Abbe98 added a subscriber: Abbe98.Oct 5 2016, 5:34 PM
Glorian_Yapinus added a comment.EditedDec 16 2016, 10:09 AM

To get a better picture of what should be developed in this task, I have created a mockup for this.

You can find the mockup below

Please be aware to the presentation note that I have put on some slides which explains the description of the corresponding mockup. Also, I have made most of the links and buttons clickable in order to ease you to grasp the interaction between each mockup.

Tpt added a subscriber: Tpt.May 11 2017, 2:28 PM
Quoth added a subscriber: Quoth.Jul 14 2017, 10:46 AM
AdityaJ added a subscriber: AdityaJ.Mar 2 2018, 6:28 PM

Hi All,
Can I know what are the microtasks for this project?

Hey @AdityaJ Great to hear you are interested in the project. I think as a first step it would be great if you install Wikibase and get it running.

This message is for students interested in working on this project for Google-Summer-of-Code (2018)

  • Student application deadline is March 27 16:00 UTC.
  • If you have questions about eligibility, please read the GSoC rules thoroughly here https://summerofcode.withgoogle.com/rules/. Wikimedia will not be responsible for verifying your eligibility and also not be able to make any decisions on this. For any clarifying questions, please email gsoc-support@google.com
  • Ensure that by now you have already discussed your implementation approach with your mentors, completed a few bugs/microtasks and made a plan to move forward with the proposal
  • I encourage you to start creating your proposals on Phabricator now to receive timely feedback on them from mentors. Do not wait until the last minute. Give your mentors at least a week's time to review your proposal, so that you could then incorporate any suggestions for changes. Learn how to submit a proposal in our participant's guide: https://www.mediawiki.org/wiki/Google_Summer_of_Code/Participants (Step 9)
  • Proposals that contain links to successfully merged patches before the application period and submitted on both Phabricator and GSoC portal will only be considered for the review process. So, between now and the application deadline, you could consider working on this task.
  • If you would like to chat with me more about the process or have questions, come and talk to me in the Zulip chat: https://wikimedia.zulipchat.com/

I have gone through "signed document mockup" shared above and I have the following picture in my mind please correct me where i am wrong,

  1. We are not concerned about data getting altered during the transmission, by the man in the middle or something. That means the receiver is supposed to hash the document again at its side and verify it against the supplied signature. This means it is possible to alter data and put according to signature during the transmission
  2. We want to supply original digital signature with the article and let receiver verify the integrity of the document. In case of web pages, javascript may do that job.
  3. We will need to create some API endpoint to verify the claims. In this API signature can be passed to view the information of signer.
  4. We also need to maintain the records of signers to verify claims
  1. We are not concerned about data getting altered during the transmission, by the man in the middle or something. That means the receiver is supposed to hash the document again at its side and verify it against the supplied signature. This means it is possible to alter data and put according to signature during the transmission

Yes corrcet.

  1. We want to supply original digital signature with the article and let receiver verify the integrity of the document. In case of web pages, javascript may do that job.

Yes. Though the document may be individual statements.

  1. We will need to create some API endpoint to verify the claims. In this API signature can be passed to view the information of signer.

Possibly yes or see if there is a standard way to do this in the browser.

  1. We also need to maintain the records of signers to verify claims

That should be done via the usual PGP web of trust.

Hjfocs added a subscriber: Hjfocs.Dec 14 2018, 10:40 AM
Mineo added a subscriber: Mineo.Jan 1 2019, 1:09 PM
AndrewSu removed a subscriber: AndrewSu.
AndrewSu added a subscriber: AndrewSu.
Cirdan added a subscriber: Cirdan.Jan 30 2019, 7:51 PM

This looks interesting to me. I would like to work on this during GSoC 19 period if possible. Any microtasks or approach for which someone can guide me?

This looks interesting to me. I would like to work on this during GSoC 19 period if possible. Any microtasks or approach for which someone can guide me?

Anyone?

I'm so sorry! Only saw your ping now. I'd love to have a chat with you. Can you send me an email at lydia.pintscher@wikimedia.de so we don't clutter up the ticket here?

Hello!!!
please i have been studying this 2019 GSoc project and I will like to get some micro tasks or guide for this project

AdityaJ removed a subscriber: AdityaJ.Mar 7 2019, 4:07 AM

@Lydia_Pintscher Hello please can I have some guidance or micro-tasks about this project?

Unfortunately I can't find a mentor for this this time around. I am very sorry :(

ok I see thanks for the reply.

jeblad added a subscriber: jeblad.EditedMar 12 2019, 11:21 AM

Note that "click 'sign this' icon next to a statement" imply a fundamentally insecure and broken process. You don't sign something after it is uploaded, you sign it before and while it is still on your own machine. The JSON code snippet should be signed, and then a provenance for the statement including that snippet should be provided.

The proposed process by the tasks creator, with use of gpg, does not make sense unless it is expected that the data integrity would be comprised on the server. In that case the data should be signed by WMF, not by the uploader.

When something is signed by the uploader as described by the tasks creator it is nothing more than an additional check on the integrity of the uploaded data and an additional notice about the identity of the uploader and possibly refering an additional federation of identity.

If something is signed _after_ it is uploaded, then we will be spanked by security researchers. :)

Note that "click 'sign this' icon next to a statement" imply a fundamentally insecure and broken process. You don't sign something after it is uploaded, you sign it before and while it is still on your own machine. The JSON code snippet should be signed, and then a provenance for the statement including that snippet should be provided.

You are completely right and entirely wrong about this :) Of course, you can only sign something that you have on your computer. But that doesn't mean there can't be a "sign this" button that sends me the JSON, lets me sign it locally, and then sends the signature back and attaches it to the statement. That's the idea here.

Sorry, but this is not the way you should do it. This assumes the uploader in fact reads and understands the schema (s)he is signing, but that newer works. It is also insecure as it opens a man in the middle attack.

If you want to do this, please use known secure processes!

jeblad added a comment.EditedMar 12 2019, 12:41 PM

Note that there are several options to do PGP/GPG signing and encryption in the browser. One example in Javacript is OpenPGP.js, but it is probably better to use Web Cryptography API if available. (It is almost universally available now, and should be used.)

During upload
Given I have edited a statement
And I have provided a private key (to my browser)
When I publish the edits
Then the usual arguments are wrapped in a container
And the wrapped container is signed

Key management is a problem, as you must use private key to sign a doc, and keep the private key in an unsecure environment (ie the browser). This is like begging for problems, as it is almost to easy to make an exploit.

Note also that I believe existence of available keys is the only thing that matter, and if they exist then they should be used. That means no additional buttons, you provide the keys, then the interface will use those keys to sign the uploads.

On the server
Given an API request arrives
When it is wrapped in a signed container
Then check the signature
And unwrap the arguments
And create a faux request
And append the original signed container to the revision
(And add a reference to the revision as provenance for any changed statement)

The previous should in fact be the same no matter if it is statements on Wikidata or content on Wikipedia.

During reading/verification
Given I read a statement
When I click "provenance"
Then I am shown a list of edits to this statement
And some of them has a notice "signed by …"
And a link to the actual revision
And the revision has the original wrapped container with the digital signature

Note that when you (or someone else) checks the signed contribution then the complete container with the signature is available. There are no need to visually inspect anything. The wrapped container could even be verified at the client machine, either it is verified or it is not, and the result can be provided. It is not necessary to show the whole changeset.

(Another problem is showing which revision supports the current state of the statement, and whether a statement with added qualifiers still can be said to be signed by the same signand. I believe that a signature is assigned to the diff of a revision, and all changes that break with that diff invalidates the later version of the statement. Thus the signand makes no statement of the added qualifier, but if the provision follows the statement (aka also qualifiers), then adding more qualifiers will invalidate the signature. This could however be mitigated by showing what part of the statement a signature covers.)

This is quite simple to implement in various scripting languages, as it require no additional requests to a remote server. It only requires a repackaging of the existing arguments.

Always sign or encode on your local machine before sending anything anywhere, don't sign or encrypt (!) anything someone claims to be the same, especially not if it is an opaque digest. Especially not if it is Unicode, but that is another (and quite funny) discussion.

(Further variations at a Google document.)

Ayack added a subscriber: Ayack.Mar 28 2019, 10:19 AM

It sounds like there are some differences being discussed above regarding signing of edits / revisions as they are made Vs signing of statements that are already saved.
The story described in this ticket is the latter, and can be done without the former. (Optionally they could both be done).