Page MenuHomePhabricator

[Story] Add a new datatype for linking to creators of artwork and more (smart URI)
Open, MediumPublic

Description

For structured data support for Wikimedia Commons we need to enable people to identify for example the creator of an image. This creator can be among others:

  • a user on Commons
  • a user on Flickr
  • a person represented by an item on Wikidata

The person needs to be identified by a plain text name, to satisfy legal requirements. In addition, a URL (resp URI) can be given to uniquely identify the person (or legal entity).

So we need a new Wikibase datatype that covers them all so statements can be made using the same property.

Our current thinking: This would be a new datatype representing a plain text name and an optional URI. The user interface should be "smart" about well known types of URLs - for example, it could have a selector that offers Commons user pages, Flickr user pages and a bunch of others as a drop-down.

Whether the URL would be part of the main data value, or whether it should be provided as a qualifier, is not yet decided. If we use qualifiers, a more complex widget is needed that allows the main value and the relevant qualifier to be edited together. It seems difficult to integrate this nicely with the generic display and editing UI for qualifiers.

Event Timeline

+1 sounds like a workable design

It would also require having different properties to express the same thing based on where it is which is something I want to avoid.

Change 281585 had a related patch set uploaded (by Eileen):
Fix source field on partial refunds

https://gerrit.wikimedia.org/r/281585

The above patch is not related to this task.

Change 281585 merged by jenkins-bot:
Fix source field on partial refunds

https://gerrit.wikimedia.org/r/281585

What ever happens, just make sure to cover bases on where a creator might be covered by all tree of these. Have a user account, but uploaded it on Flickr and have a Wikidata item about themselfs.

User may be renamed, so we need a special page to link to user pages from user ids.

Izno removed a subscriber: Izno.

Has anyone considered the alternative of having a "creator type" property, qualified by one of the three kinds of ID discussed above?

Alternatively, have one type, which takes either an existing Wikidata item or one of the items "Flickr user" or "Wikimedia user", and the latter two have username qualifiers?

What if the creator is none of these? For example, a non notable individual who puts images on their own website, with a CC licence?

Has anyone considered the alternative of having a "creator type" property, qualified by one of the three kinds of ID discussed above?

Alternatively, have one type, which takes either an existing Wikidata item or one of the items "Flickr user" or "Wikimedia user", and the latter two have username qualifiers?

Yeah it comes up again and again in discussions but I don't believe this is a nice way to go about modeling this when we can do it much more cleanly and user friendly.

What if the creator is none of these? For example, a non notable individual who puts images on their own website, with a CC licence?

We need an "other" option at least that lets people enter a link to some other profile that identifies them.

Daniel, Thiemo: Can you lisit what other objections you had in the last meeting? I want to move forward with this and @Ladsgroup said he would be willing to work on it.

We looked at some examples and the result was that we need to have a way for people to also enter a link text where they can set the real name of a person for example. This will then be the link text when displaying the value.

The central idea is: We want to be able to use the same property for all the different ways we may want to use to identify a person.

During our last discussion it turned out that the primary information we need to include is a plain text name, to cover the legal requirements of some (most?) licenses. The URI/URL should be optional, but we want a UI that makes it easy to add one at least for the common cases (Wikimedia user, Wikidata item, ORCID, VIAF, etc), to provide a nice editing interface.

The user interface should make entry easy: after entering a name, the user should be able to search for that name in some well known places (Wikimedia user, Wikidata item, ORCID, VIAF, etc) to determine the URI, or enter the URI directly. When entering a URI directly, we may want the UI to detect well known prefixes of URIs for Wikimedia user, Wikidata item, ORCID, VIAF, etc.

The simplest approach would be to use a plain string value to represent the name, and a qualifier for the URI. We could even use different properties (with data type URL or external ID) for qualifiers that represent the different kinds of IDs. However, this has two major disadvantages:

  1. Clients will have to know about the "special" qualifiers in order to determine the URI for the person in question.
  2. To allow nice and editing entry of both the name and the URL/URI, the UI would need to closely integrate the editing of the main value with editing a qualifier. It's unclear to me how to do this in a way that does not interfere with the "generic" way to edit qualifiers, causing confusion.

The alternative is to introduce a new value type containing a plain text name and an optional URI. This adds complexity to the data model, but avoids the need for "arcane" knowledge in the UI and in clients. It also avoids special case magic in the UI for handing statements that have some special qualifiers.

I'm currently undecided about which way is best. Using qualifiers is attractive because it keeps the data model simple, but it requires knowledge of special qualifier properties. Using a new value type is straight forwards for the URI and clients, but it adds complexity to the data type, and needs a bit more code in the backend.

Here's a straw man idea for the approach using qualifiers:

Scenario:

  • we assume that we typically have the plain text name, and use it to find a URI, not the other way around
  • we have a "person" data type that uses plain string values
  • we have a single well known property for qualifying a "person"-type statement with a URL/URI that identifies the person.

This means:

  • The new data type can be added with minimal effort in the backend.
  • Clients that need to know a URI for a person need to know about the person-uri qualifier property
  • The UI needs to know the special qualifier property.

UI integration could work like this:

  • if the statement has the type "person" and does not yet have a person-uri qualifier, a button labeled "Find URI for Foo" is added next to the "add qualifier" button. "Foo" is the plain name from the main snak.
  • when that button is clicked, we search well known data sets (Wikimedia users, Wikidata items, ORCID, VIAF) for the name, and allow the user to choose one of the results (or provide an URI directly).
  • a person-uri qualifier is added containing the URI from the search.

When editing a "person" statement that already has a person-uri qualifier, nothing special happens.

@Lydia_Pintscher, @Jan_Dittrich, what do you think? It seems to me that the decision is mostly a UI issue at this point.

I'm sitting here puzzled, because nothing in the rationale indicates to me that this can't already be done using existing datatypes (item) and properties (birth name, perhaps a new 'full name' parameter in case someone wants to be attributed as something other than their birth name). Are we trying to avoid introducing 'non-notable' individuals to the database? Is that all we're working around here?

Are we trying to avoid introducing 'non-notable' individuals to the database? Is that all we're working around here?

That is indeed one of the main issues, if not the only one. It's a question of maintenance overhead, but also of splitting data between two projects, having to watch multiple pages, and so on.

If a person is referenced a lot, it makes of course sense to create an item for them. In that case, it probably wouldn't be a problem to do so. But it seems better to allow "one-off" references to not require the "heavy guns".

daniel added a subscriber: Izno.

I'm watching the project, but thank you. :D

That is indeed one of the main issues

Hmm, okay. I'd personally be a little less leery of WD:N cross this functionality at this point in time because it provides for structural need. (The only reason the reference system, either citing a long work or otherwise, works, is because of structural need.)

but also of splitting data between two projects

What does this mean? Isn't the planned-for use of the same properties and data as already exist on Wikidata already "splitting data" between two projects?

having to watch multiple pages

What does this mean? Maybe this is a presupposition on my part, but my expectation is that, like with the current clients, this functionality would be on watch because it is linked to the main entity (the media file?). Is that a bad expectation?

and so on.

"And so on"s scare me. :D

If a person is referenced a lot, it makes of course sense to create an item for them. In that case, it probably wouldn't be a problem to do so. But it seems better to allow "one-off" references to not require the "heavy guns".

I guess this makes sense, but I would be concerned that we could end up with a lot of inconsistent data in that we'd get 30 Flickr links, 20 Wikidata item links, and 5 Facebook links, with this kind of scheme, when they're all meant to point to the same person/entity (corporate authors should be considered, many of which will or probably already do have items, I suppose). This isn't particularly fixed by having static text in the authorship--authority control is a thing for this reason. :D

Hey,
I talked with @Charlie_WMDE about possible designs. Turns out we still have some questions :-)

(answers edited in after talking to @thiemowmde, whose comments I added in italic)

  • Can we assume that an image (or whatever other thing was uploaded on commons) will be/match an item in Wikidata ?
    • Sort-of-yes: There will be an Wikibase-Repo for/on commons and it will contain something like an item, which is called Media Info. Every image acts like there is a Media Info Object, and it is magically created when needed. There is a connection to Wikidata e.g. if you want to add a statement "depicts", which will suggest items from Wikidata.
  • There should be a name and an URI. What is the usecase for the URI?
    • Tracing back the origin of the image/Commons-thingy?
    • Creating the name more easily by looking it up via the URI?
    • Using your own webpage as source.
  • @daniel You said that we have the plain name but want to find the URL. From my experience of adding images, which I did not create, to commons, I usually have the URL but not the Name. Or is this not covered by the issue?
    • The name is essential, thats probably the case.
    • Usecases:
      • Find an URL based on name
      • Find a Name based on URL
      • Paste the image-URL in the author field and it automatically resolves to authorname, and author URI (but it needs magic for these services, so it is not a good core thing)
  • It is talked about the name as (identifying a) "Person". I wonder why this is not "Creator/Author" or the like. (In copyright there seem to be many roles, and I wonder how we cover this – or will that e mapped on top?)
    • Creator ("Urheber" in German law) is essential.
  • @daniel You said that we have the plain name but want to find the URL. From my experience of adding images, which I did not create, to commons, I usually have the URL but not the Name. Or is this not covered by the issue?

You usually have the URL (or URI) of the author, but not the name? That seems strange to me, can you give examples? Or do you just have a source URL for the image?

  • Usecases:
    • Find an URL based on name

That's straight forward enough, for a handful of authority files.

  • Find a Name based on URL

If we have a URI from some authority file that identifies the author, they probably also have some API to get the author's name. Such APIs are probably all different, so we can only handle a handful of well known cases.

  • Paste the image-URL in the author field and it automatically resolves to authorname, and author URI (but it needs magic for these services, so it is not a good core thing)

If all we have is the source URL of the file, we can still hope that there is some kind of API to retrieve additional information, like the author. But this will be highly specific to the image repo. We might do something special for Flickr, but beyond that, this kind of work should probably be done by the import bot/script/gadget.

  • It is talked about the name as (identifying a) "Person". I wonder why this is not "Creator/Author" or the like. (In copyright there seem to be many roles, and I wonder how we cover this – or will that e mapped on top?)

"Person" is the data type, the kind of thing, not the role. For the roles, we will have several properties using that type, e.g. rights holder, photographer, painter, person depicted, etc.

You usually have the URL (or URI) of the author, but not the name? That seems strange to me, can you give examples? Or do you just have a source URL for the image?

Yes, if I import images from flickr, I can find the URL of the uploader’s profile with one click, but not the legal name (which may or may not be there)

You usually have the URL (or URI) of the author, but not the name? That seems strange to me, can you give examples? Or do you just have a source URL for the image?

Yes, if I import images from flickr, I can find the URL of the uploader’s profile with one click, but not the legal name (which may or may not be there)

I think it is sufficient to use whetever name was used on flicker, it doesn't have to be the legal name.

In any case, it seems to me that this use case needs separate handling. When importing from flicker, you would be creating several statements anyway, you wouldn't be editing a single statement.

@daniel: Thanks!

I wrote a (from-the-users-view) scenario:

Melissa wants to upload an image to commons. She uploads the image. In the upload form, she is asked to provide the name of the image’s creator. She made the picture, so it is her name. Also, she can provide an URI of the creator. She types her name and also uses the URI for putting in her private website.

Does this make sense as a (possible) experience for the user? (@daniel, @thiemowmde)
I deliberately left out how we can smartly fill one of the fields using the other – that would be my next step.

@Jan_Dittrich I think one important thing to consider here is integration with the Upload Wizard. The UI and workflow for manual upload is potentially completely separate from the generic editing interface for MediaInfo.

On a related note: in the UploadWizard, it makes perfect sense to have specialized logic for filling in specific fields (that is, creating statements for a specific property). But for the generic editing interface, we will probably not want any specialized logic for specific properties, but bind all editing behavior to the data type. This allows the community to freely define and refine properties and their usage, without breaking the UI.

The difference is whether we bind UI behavior to a kind of data (more flexible), or to a role of the data (more specific).

I'm not sure why you'd want a URI to an external site in a field like this. Is it not possible to assume that every person referred to will have an item either on Wikidata or on Commons? Then this field only needs to link to one of those places, where links to Flickr profiles etc., can be collected. When somebody is uploading a new file, you'd need to ask for some kind of link like that that identifies the person and allows linking to the right Wikidata/Commons record, creating a new one if needed. External links also go bad sometimes when people delete their Flickr accounts etc. There will be other complexities, like multiple people sharing a single Flickr account (e.g., if the Flickr account is for an organization and they have images from various authors), or people who have multiple Flickr accounts.

@Ghouston there are currently no plans to have any items on commons, only media info entities on file description pages. All items are managed on wikidata. The current inclusion criteria for wikidata would not allow an item to be created for any random flickr user - and even if this would change, we would end up creating essentially empty items that are just stubs for the flickr user ID, especially when importing many files from flickr or a similar repository automatically.

The point is: we have to satisfy the legal requirement of naming the author. And we want to satisfy the technical requirement of identifying the author and linking to some resource describing the author. It seems sensible to link these two things. That's what this new data type achieves.

I had the idea from the demo system that such records would exist (http://structured-commons.wmflabs.org/wiki/Q2). Oh well. With a system of credit string + URI it will be problematic in some cases to find all the works by a particular author. Credit string could be all over the place (G Bush on one photo, G.W.Bush on another, George Bush on a third etc., not to mention having multiple people named G Bush) so it's not very helpful. With other people you may not be able to find a URI at all, e.g., non-notable 19th century photographers, or US government employees whose work is published on their department's Flickr account.

On Commons, we currently tend to confuse author and copyright holder. It's the copyright holder who can release a file with a free license, and in practice chooses the attribution. We have a lot of files like https://commons.wikimedia.org/wiki/File:Christmas_Island_(5774557225).jpg where the author (photographer) is unknown but the copyright holder is presumably the Australian government Department of Immigration and Citizenship. This is also a good example of a file with a bad URI, since the department has been reorganised and renamed and the Flickr account is gone.

Occasionally an organization will name the author, e.g., https://www.flickr.com/photos/117994717@N06/33296715392/ or https://www.flickr.com/photos/39955793@N07/22882333767/. Then you can potentially have separate fields for the author and copyright holders.

Maybe it's not even right to consider Department of Immigration and Citizenship to be the copyright holder in that example, and it's a file with Australian Government crown copyright. In that case, it's not clear what role DIAC is filling, if not the author or copyright holder.

A file needs a single attribution, regardless of how many authors and copyright holders may be involved.

I am new to this discussion and I have not read every word above, but the easiest solution to me seems to allow every author to have their own M-code item on structured-common. Than the new datatype would allow either a q-code or m-code as its value. We could also have M-code item on structured-common for templates to show, like personality warnings or trademarks, etc.

Entries starts with M### are only for MediaInfo entry of an existing file, where ### is page id. They are not normal items.

I like @Jarekt's idea.
There could be X### items of a new type (like User/Person/Author...) which could be restricted to only have a name and a set of URLs/URIs (wikidata item, user page, Flickr/Facebook/Twitter/... account) that can identify the "person" (no statements, labels or descriptions)
As soon as this type of entity will hardly contain much of an info, it will be harder to abuse the functionality.
Also, it will be easy to search and easy to update the data.

Technically, I suspect this would be easiest if Wikidata was allowed to hold all the items that Commons needs to store information about. Maybe there could be a way to restrict particular Wikidata items to a limited subset of properties, so that URLs etc., could be stored without giving people a place to store arbitrary unverifiable information about themselves or others. A few bits of information are relevant for determining copyright expiration: whether an author is a person or a company (my previous comments neglected to mention the concept of "corporate authors", although they seem to be usually treated the same way as anonymous authors), and date of death.

Technically, I suspect this would be easiest if Wikidata was allowed to hold all the items that Commons needs to store information about.

That would mean allowing any user that ever uploaded his file to Commons to have an item on Wikidata. I do not think that would be a good idea as that is something Wikidata specifically prohibited.

There could be X### items of a new type (like User/Person/Author...) which could be restricted to only have a name and a set of URLs/URIs (wikidata item, user page, Flickr/Facebook/Twitter/... account) that can identify the "person" (no statements, labels or descriptions)

I like that, the X### items could live on Commons (or on Wikidata ), we can debate restrictions on such items latter, but they would store limited range of data and be restricted to metadata about people, like wikipedia users that do not meet notability restrictions of regular wikidata items.

I would argue for any new types of items like M### or X### to share property ids with regular Q### items, and not to create parallel systems of properties. Some properties might be only allowed for some types of items.

The same problem of notability will turn up if the subjects of images are going to be linked to Wikidata items, since there are plenty of images of things like buildings that may not be notable for Wikidata. At present, there's no problem having a category for the building. The X# solution does seem preferable to having Q# items hosted on both Commons and Wikidata, and you could prevent Q# items from linking to X# items.

I would argue against a X### solution, and support Daniel's initial solution: for the time being, we don't know how many authors we need to host and link to external sources. His solution will also buy us time to eventually think an expansion of the approach, instead of immediately creating new namespaces in MediaInfo.

His solution will also buy us time to eventually think an expansion of the approach, instead of immediately creating new namespaces in MediaInfo.

Cannot really agree, because it seems that we might be stuck with this solution for quiet a long time as soon as Property's data-type (generally) cannot be changed and migration of old revisions is not possible (as far as I know).
So to add a new way of referencing authors instead of the old one seems to me like a huge pain that might bring a lot of confusion to users.

Other considerations

  1. Anonymous or unknown creator (legally can be different from: 'not known to us')
  2. Pseudonymous creator (worse, a pseudonym whose name from a certain date became known ?)
  3. Multi party copyright/creators
  4. hierarchies ? For instance JPL lab images are often credited as NASA/JPL/Project/Person
  1. Anonymous or unknown creator (legally can be different from: 'not known to us')

That's why we have somevalue and novalue snaks in Wikibase.

  1. Pseudonymous creator (worse, a pseudonym whose name from a certain date became known ?)

In the case of modern licenses like the CC family, the name will be exactly what the author used to credit themselves.

Resolution of pseudonyms can be done via the URIs associated with the person reference.

  1. Multi party copyright/creators

Just use multiple statements.

  1. hierarchies ? For instance JPL lab images are often credited as NASA/JPL/Project/Person

Can be treated as a pseudonym above for the human readable part. But to make the hierarchy machine readable... hm. Not sure. Maybe a qualifier, or a separate statement.