Page MenuHomePhabricator

[Epic] Signed statements
Open, MediumPublic

Assigned To
None
Authored By
Lydia_Pintscher
Jun 26 2016, 11:15 AM
Referenced Files
F5070845: Signed Statements Mockup.odp
Dec 16 2016, 10:09 AM
Tokens
"Love" token, awarded by ElanHR."Like" token, awarded by Akuckartz."Like" token, awarded by Liuxinyu970226."Love" token, awarded by Astinson."Like" token, awarded by Spinster.

Description

We want to enable institutions and people to sign statements in order to say that they indeed state what is in the statement. This would for example be useful for data imported from large institutions.

One possible way to do this:

  • Create a normalized serialization of the Statement's main snak and qualifiers (excluding rank and references, with normalized field order, escaping, whitespace, etc)
  • Make a sha1 hash of this serialization (the claim hash)
  • Compose a human readable text (yaml?) of the following:
    • the claim hash
    • the subject's revision ID (plus possibly the language codes of labels and descriptions that were used to establish the subject's identity)
    • the objects revision ID (if the main snak points to an item). (we may also need this for the objects of qualifiers)
    • the signer's identity (as given in the signer's certificate)
    • the current date and time
  • sign this text with GPG
  • add the signed text as (part of) a reference to the original statement

Things to keep in mind:

  • The hash of statements linking to items should include the revision ID of the linked item to allow checking against changing the identity of the linked item.
  • adding or changing references does not break the signature
  • adding or changing qualifiers does break the signature
  • changing the label and description of the subject (or the object) does not break the signature, but is detectable (by comparing against the revision ID included in the signed text). If the label or description was changed, a warning should be triggered, since changing the identity of the subject changes the meaning of the statement.

Story to sign:

  • enable "sign statement" gadget
  • click "sign this" icon next to a statement
  • see a popup with the follwing text fields:
    • the serialized claim
    • the text to sign
    • perhaps also the labels and descriptions of the subject and any objects, in the user's languages (as in the term box)
  • optionally, visually verify the serialized claim, and verify that it matches the sha1 hash provided for signing.
  • use copy & paste to sign the text with GPG.
  • copy the signed text back into the field that originally contained the text to sign.
  • click a button to save the signature
  • the server should verify the signature before saving it. Saving a broken signature is pointless.
  • we may have a whitelist of well known signers, to avoid confusion. The whitelist may be enforce during saving, or just change how a signature is displayed.

Story to verify:

  • click the "signature" icon next to a statement
  • see a popup with the same text fields as for signing, but with the signed text in a read-only field.
  • use the same steps as for signing to manually verify.

We can of course also verify on the server side, or with JS in the browser. But that can easily be faked. The only way for the user to be sure is to verify manually.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I find better to sign references, and not statements. The idea would be to say "I've checked out this reference and I ensure that this reference is real and consistent with these data", as Wikidata is a secondary database. We could continue signing references like "stated in" (P248). If the value or the reference changes, the signature should automatically disappear (but, in case of a reverted vandalism, the signature should also appear again).

We want to enable institutions and people to sign statements
in order to say that they indeed state what is in the statement.

Isnt that a recipe for wikibase being used as a primary source repository?

Or is the goal to store verification that a statement correctly reflects what is in a reference/source, is it giving special status to the institution who wrote the source?
Who will decide which institutions are allowed to sign 'their' facts?
If it is an open system, usable for any author of facts, how will the system manage a paper with 100+ authors (with many different roles in authoring, and rarely is this publicly disclosed)? And what happens when the paper is found to be bollocks? I hope you dont expect a uni to sign the statements from their faculty members. Or is academia excluded?

And once it is in, there will be statements that look very authoritative which appear to suggest that an institution is *the* source for a fact - who is going to check they didnt plagiarise someone else work.

Or is this a system for Wikimedia-approved (state-funded) institutions to vote for a fact, with 250+ state-sponsored bodies signing the statement that Taiwan is part of China, and 10 signing a statement that it is a nation.

Just to chime in that this would be great, but that there needs to be a way to do this via the API. A user gadget is not sufficient. (Perhaps this is there already and I missed it in the long comment thread there). Our use case would involve large imports of data (100s of thousands of claims) from databases. Having this would capacity to verify would make these data more trustworthy and the databases happier.

To get a better picture of what should be developed in this task, I have created a mockup for this.

You can find the mockup below

Please be aware to the presentation note that I have put on some slides which explains the description of the corresponding mockup. Also, I have made most of the links and buttons clickable in order to ease you to grasp the interaction between each mockup.

Hi All,
Can I know what are the microtasks for this project?

Hey @AdityaJ Great to hear you are interested in the project. I think as a first step it would be great if you install Wikibase and get it running.

This message is for students interested in working on this project for Google-Summer-of-Code (2018)

  • Student application deadline is March 27 16:00 UTC.
  • If you have questions about eligibility, please read the GSoC rules thoroughly here https://summerofcode.withgoogle.com/rules/. Wikimedia will not be responsible for verifying your eligibility and also not be able to make any decisions on this. For any clarifying questions, please email gsoc-support@google.com
  • Ensure that by now you have already discussed your implementation approach with your mentors, completed a few bugs/microtasks and made a plan to move forward with the proposal
  • I encourage you to start creating your proposals on Phabricator now to receive timely feedback on them from mentors. Do not wait until the last minute. Give your mentors at least a week's time to review your proposal, so that you could then incorporate any suggestions for changes. Learn how to submit a proposal in our participant's guide: https://www.mediawiki.org/wiki/Google_Summer_of_Code/Participants (Step 9)
  • Proposals that contain links to successfully merged patches before the application period and submitted on both Phabricator and GSoC portal will only be considered for the review process. So, between now and the application deadline, you could consider working on this task.
  • If you would like to chat with me more about the process or have questions, come and talk to me in the Zulip chat: https://wikimedia.zulipchat.com/

I have gone through "signed document mockup" shared above and I have the following picture in my mind please correct me where i am wrong,

  1. We are not concerned about data getting altered during the transmission, by the man in the middle or something. That means the receiver is supposed to hash the document again at its side and verify it against the supplied signature. This means it is possible to alter data and put according to signature during the transmission
  2. We want to supply original digital signature with the article and let receiver verify the integrity of the document. In case of web pages, javascript may do that job.
  3. We will need to create some API endpoint to verify the claims. In this API signature can be passed to view the information of signer.
  4. We also need to maintain the records of signers to verify claims
  1. We are not concerned about data getting altered during the transmission, by the man in the middle or something. That means the receiver is supposed to hash the document again at its side and verify it against the supplied signature. This means it is possible to alter data and put according to signature during the transmission

Yes corrcet.

  1. We want to supply original digital signature with the article and let receiver verify the integrity of the document. In case of web pages, javascript may do that job.

Yes. Though the document may be individual statements.

  1. We will need to create some API endpoint to verify the claims. In this API signature can be passed to view the information of signer.

Possibly yes or see if there is a standard way to do this in the browser.

  1. We also need to maintain the records of signers to verify claims

That should be done via the usual PGP web of trust.

AndrewSu removed a subscriber: AndrewSu.
AndrewSu added a subscriber: AndrewSu.

This looks interesting to me. I would like to work on this during GSoC 19 period if possible. Any microtasks or approach for which someone can guide me?

This looks interesting to me. I would like to work on this during GSoC 19 period if possible. Any microtasks or approach for which someone can guide me?

Anyone?

I'm so sorry! Only saw your ping now. I'd love to have a chat with you. Can you send me an email at lydia.pintscher@wikimedia.de so we don't clutter up the ticket here?

Hello!!!
please i have been studying this 2019 GSoc project and I will like to get some micro tasks or guide for this project

@Lydia_Pintscher Hello please can I have some guidance or micro-tasks about this project?

Unfortunately I can't find a mentor for this this time around. I am very sorry :(

Note that "click 'sign this' icon next to a statement" imply a fundamentally insecure and broken process. You don't sign something after it is uploaded, you sign it before and while it is still on your own machine. The JSON code snippet should be signed, and then a provenance for the statement including that snippet should be provided.

The proposed process by the tasks creator, with use of gpg, does not make sense unless it is expected that the data integrity would be comprised on the server. In that case the data should be signed by WMF, not by the uploader.

When something is signed by the uploader as described by the tasks creator it is nothing more than an additional check on the integrity of the uploaded data and an additional notice about the identity of the uploader and possibly refering an additional federation of identity.

If something is signed _after_ it is uploaded, then we will be spanked by security researchers. :)

Note that "click 'sign this' icon next to a statement" imply a fundamentally insecure and broken process. You don't sign something after it is uploaded, you sign it before and while it is still on your own machine. The JSON code snippet should be signed, and then a provenance for the statement including that snippet should be provided.

You are completely right and entirely wrong about this :) Of course, you can only sign something that you have on your computer. But that doesn't mean there can't be a "sign this" button that sends me the JSON, lets me sign it locally, and then sends the signature back and attaches it to the statement. That's the idea here.

Sorry, but this is not the way you should do it. This assumes the uploader in fact reads and understands the schema (s)he is signing, but that newer works. It is also insecure as it opens a man in the middle attack.

If you want to do this, please use known secure processes!

Note that there are several options to do PGP/GPG signing and encryption in the browser. One example in Javacript is OpenPGP.js, but it is probably better to use Web Cryptography API if available. (It is almost universally available now, and should be used.)

During upload
Given I have edited a statement
And I have provided a private key (to my browser)
When I publish the edits
Then the usual arguments are wrapped in a container
And the wrapped container is signed

Key management is a problem, as you must use private key to sign a doc, and keep the private key in an unsecure environment (ie the browser). This is like begging for problems, as it is almost to easy to make an exploit.

Note also that I believe existence of available keys is the only thing that matter, and if they exist then they should be used. That means no additional buttons, you provide the keys, then the interface will use those keys to sign the uploads.

On the server
Given an API request arrives
When it is wrapped in a signed container
Then check the signature
And unwrap the arguments
And create a faux request
And append the original signed container to the revision
(And add a reference to the revision as provenance for any changed statement)

The previous should in fact be the same no matter if it is statements on Wikidata or content on Wikipedia.

During reading/verification
Given I read a statement
When I click "provenance"
Then I am shown a list of edits to this statement
And some of them has a notice "signed by …"
And a link to the actual revision
And the revision has the original wrapped container with the digital signature

Note that when you (or someone else) checks the signed contribution then the complete container with the signature is available. There are no need to visually inspect anything. The wrapped container could even be verified at the client machine, either it is verified or it is not, and the result can be provided. It is not necessary to show the whole changeset.

(Another problem is showing which revision supports the current state of the statement, and whether a statement with added qualifiers still can be said to be signed by the same signand. I believe that a signature is assigned to the diff of a revision, and all changes that break with that diff invalidates the later version of the statement. Thus the signand makes no statement of the added qualifier, but if the provision follows the statement (aka also qualifiers), then adding more qualifiers will invalidate the signature. This could however be mitigated by showing what part of the statement a signature covers.)

This is quite simple to implement in various scripting languages, as it require no additional requests to a remote server. It only requires a repackaging of the existing arguments.

Always sign or encode on your local machine before sending anything anywhere, don't sign or encrypt (!) anything someone claims to be the same, especially not if it is an opaque digest. Especially not if it is Unicode, but that is another (and quite funny) discussion.

(Further variations at a Google document.)

It sounds like there are some differences being discussed above regarding signing of edits / revisions as they are made Vs signing of statements that are already saved.
The story described in this ticket is the latter, and can be done without the former. (Optionally they could both be done).

In my opinion, this signing of a hash of an extract of an already uploaded statement seems extremely hackish. I still believe this should be reconsidered.

Anyhow, SHA-1 is completly broken. Schneier reported it was theoretically broken in 2005 SHA-1 Broken, then in 2015 full attack on SHA-1 was demonstrated the SHAppening, and then in 2017 the first collision the SHAttered. Since then the complexity for a Chosen-prefix collision (CP- collision) is down to 2⁶¹ – 2⁶³.

SHA-1 is a Shambles - First Chosen-Prefix Collision on SHA-1 and Application to the PGP Web of Trust

There are other problems too, and I wonder if this will create a false sense of quality.

In https://www.wikidata.org/wiki/Wikidata:Property_proposal/approval_of_subject, it is proposed that an agreement may be appended in a signature. For example, when a explicit waiver of privacy is needed and explicitly agreenment for any reuse of 3rd-party is required.

Just curious, have there been any recent updates on this?

One use case I believe this would be particularly useful is batch data donations from GLAM institutions so provided statements can be signed by the provider. To this end I believe a version of signed statements could already be achieved using wbsetclaim along with new reference properties to represent signature information. I was planning on proposing said properties but wanted to check-in and make sure this approach made sense. :)

For illustration purposes, I put together some demo schema based on the original proposal by @Lydia_Pintscher.

And some example items using this schema:

Create a normalized serialization of the Statement's main snak and qualifiers (excluding rank and references, with normalized field order, escaping, whitespace, etc)

As a first draft for this, I would propose the following:

  • Remove irrelevant fields ("hash", "rank", etc.) as well as other references aside from the new signing statement
  • Sort JSON by field names alphabetically

For example, in order to sign and upload the claim "Q214700 is an instructor for the field of work wikipedia", the claim would be serialized as follows:

{
  "id": "Q214700$4ef81e12-4396-820f-cd8f-46498d66d380",
  "mainsnak": {
    "datatype": "wikibase-item",
    "datavalue": {
      "type": "wikibase-entityid",
      "value": {
        "entity-type": "item",
        "id": "Q40742",
        "numeric-id": 40742
      }
    },
    "property": "P204",
    "snaktype": "value"
  },
  "qualifiers": {
    "P128": [
      {
        "datatype": "wikibase-item",
        "datavalue": {
          "type": "wikibase-entityid",
          "value": {
            "entity-type": "item",
            "id": "Q11",
            "numeric-id": 11
          }
        },
        "property": "P128",
        "snaktype": "value"
      }
    ]
  },
  "qualifiers-order": [
    "P128"
  ],
  "references": [
    {
      "snaks": {
        "P95707": [
          {
            "datatype": "string",
            "datavalue": {
              "type": "string",
              "value": "538357"
            },
            "property": "P95707",
            "snaktype": "value"
          }
        ],
        "P95708": [
          {
            "datatype": "string",
            "datavalue": {
              "type": "string",
              "value": "Q40742/123060"
            },
            "property": "P95708",
            "snaktype": "value"
          },
          {
            "datatype": "string",
            "datavalue": {
              "type": "string",
              "value": "Q11/520357"
            },
            "property": "P95708",
            "snaktype": "value"
          }
        ],
        "P95711": [
          {
            "datatype": "wikibase-item",
            "datavalue": {
              "type": "wikibase-entityid",
              "value": {
                "entity-type": "item",
                "id": "Q214701",
                "numeric-id": 214701
              }
            },
            "property": "P95711",
            "snaktype": "value"
          }
        ],
        "P95712": [
          {
            "datatype": "url",
            "datavalue": {
              "type": "string",
              "value": "https://test-signed-statement.org"
            },
            "hash": "973eada670474c62af0ae00db9b56c87926b8483",
            "property": "P95712",
            "snaktype": "value"
          }
        ]
      },
      "snaks-order": [
        "P95707",
        "P95708",
        "P95711",
        "P95712",
        "P95705"
      ]
    }
  ],
  "type": "statement"
}

This serialized JSON would be used to generate the statement hash which is then added under the references.snaks as so:

"P95705": [
          {
            "datatype": "string",
            "datavalue": {
              "type": "string",
              "value": "this-is-also-a-fake-hash"
            },
            "property": "P95705",
            "snaktype": "value"
          }
        ]

Note: In order to avoid potential vandalism in the signing reference itself I believe all reference properties aside from the hash itself should be part of the serialization.

This signed statement could then be uploaded via wbsetclaim (with a minor modification to merge with existing references).

Happy to hear any ideas/comments/suggestions to improve this approach - this is admittedly a rather rough draft. :)

In order to provide more context and consolidate discussion on this topic I put together the following RfC: Signed Statements (T138708)

Thank you so much, @ElanHR! As for the current status: It's been blocked internally due to a perceived lack of need by some people. Your support here and the RfC definitely helps push it up the priority list.

I hope 'push[ing] it up the priority list' is not the case. The RfC would need to attract consensus, or even interest, first. What appear to be OWNer-ish facets of signed statements seem antithetical to WD's openness, at least, to me.

related, since I think could be easier for us to interact with GPG than in the browser:
https://gitlab.gnome.org/World/Daty/-/issues/35