Page MenuHomePhabricator

Address concerns about perceived legal uncertainty of Wikidata
Open, Needs TriagePublic

Description

To release some content under a given license, one must in legal position to release the corresponding material under the chosen license.

Currently, not all data available out there are under a license and/or terms of use that is compatible with a CC-0 license, which is so far the only permitted license for Wikidata. But there are contributors of Wikidata that do make massive imports of external data banks, regardless of the corresponding terms of use.

So far, as far as it is know to the author of this ticket, Wikidata didn't apply a strong process to enforce conformity of imported material with its policy of CC-0 as single license for data stored in the relational model of its Wikibase instance.

This put Wikidata at best under strong legal uncertainty, as some contributors, as already exposed, do perform massive import from various data banks. Indeed, regardless of the "factuality" or "creativity" of each single data which are part of this massive transfers, this raises concerns due to existing monopoly granted to data bank creators, whether through specific rights on databases (ie. in Europe) or through the extent of copyrightable material (ie. USA).

This makes Wikidata current state not only a concern for the whole Wikimedia ecosystem, but also for any downstream stakeholder which would be interested to use Wikidata, but not to take the risk of currently attached possible legal threats.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Denny added a comment.May 14 2018, 6:01 PM

@Rspeer regarding the ontology: the ontology of Wikidata is genuinely unique and not copied from any Wikipedia project, or any other project. It has been created on Wikidata.

Regarding the translations: we are talking about the labels of things in different languages? They are not even necessarily translations, mind you - it is often "there was an article on the German Wikipedia, here's an article on the English Wikipedia, let's connect these two". In general, most titles were not translated. Also, in many of those cases it is hard to argue for any threshold of creativity - the fact that 'London' is called 'Londres' in Frech is rather un-creative, and merely stating a fact.

Also, technically, these are not translations in Wikidata, but labels of an entity in different languages.

I would be surprised if the multilingual labels of the Wikidata entities would trigger copyright, but even if it did, there is nothing in the current Wikipedias that would allow for this copyright to take effect: i.e. if we look at an entry such as Q23780914 - highly visible entry, labels in 50+ languages - there never existed anything in the Wikipedias which might have been under copyright.

Or am I missing something here?

Denny added a comment.May 14 2018, 6:09 PM

@Nemo_bis : good point. I wouldn't know what a good example is, though, maybe someone else can come up with something.

Am 14.05.18 um 17:23 Uhr schrieb Nemo_bis:

But I'd argue that nobody would see such a dataset as problematic,
especially because it's so small (few hundreds data points).

But law is a matter of quality rather than size.

Am 14.05.18 um 17:23 Uhr schrieb Nemo_bis:

But I'd argue that nobody would see such a dataset as problematic,
especially because it's so small (few hundreds data points).

But law is a matter of quality rather than size.

But quantitative indicators can be a proxy for quality. A dataset with just 200 statements has most likely been produced by dozens or hundreds of entities none of which can claim any meaningful exclusive right.

Am 14.05.18 um 22:36 Uhr schrieb Nemo_bis:

But law is a matter of quality rather than size.

But quantitative indicators can be a proxy for quality. A dataset with
just 200 statements has most likely been produced by dozens or hundreds
of entities none of which can claim any meaningful exclusive right.

That remains to be seen. Every bit of text that is imported from
CC-by-sa-protected Wikipedia articles has to be weighed and put to the
test. Also, the sum is more than just an addition of its parts. Think in
terms of quality. Numbers don't count, as we say, iudex non calculat.

the fact that 'London' is called 'Londres' in Frech is rather un-creative

@Denny: Where is this reductionism getting you? You can pick one simple example at a time and entirely miss the point. When you have, say, 20,000 English terms that are translated to 20,000 French terms, and not all of the mappings are as obvious or as one-to-one as the name of a major world city, that is a work that people created. It's not an authorless happenstance. It's not a "monkey selfie" as Tgh so insultingly put it.

The fact that it's expressed as one Wikipedia page being the same as another doesn't make it authorless. That data came from inter-language links that were originally created on Wikipedia, and translation templates that were originally created on Wiktionary, all of which were entered by specific authors, under the CC-By-SA license.

It seems that the entire point of this reductionist handwaving is to find an excuse to not follow the CC-By-SA license.

Consider the case of importing bulk data from someplace outside of Wikimedia, which the legal team already gave a very clear ruling on, linked repeatedly in this thread: you can't do it unless the data is totally free. The "but it's just facts" argument has already been deemed insufficent. Why would that argument work against Wikipedia or Wiktionary when it doesn't work against OpenStreetMap?

@Rspeer

Copyright has to be about some concrete expression.

Are you claiming that the interwiki links that used to be in Wikipedia articles until five years ago should have had copyright protection? Their concrete expression was [[en:London]] [[fr:Londres]] [[hr:London]] etc. This have been, in most wikis written and maintained by bots anyway, without any reference to the originating author. So if that is the case, Wikipedia has already never been compliant with that license.

But even ignoring that, Wikidata does *not* store the same expression anyway. So what exactly is the copyright asserted on?

Am 18.05.18 um 00:05 Uhr schrieb Denny:

Copyright has to be about some concrete expression.

I'm afraid this is not the case. According to German law copyright is
all about what we call Schöpfungshöhe, and it seems that other legal
systems also subscribe to this concept---provided the interwiki links
are right, that is... ;)

https://de.wikipedia.org/wiki/Sch%C3%B6pfungsh%C3%B6he
https://en.wikipedia.org/wiki/Threshold_of_originality

I was reading the article you linked to - https://de.wikipedia.org/wiki/Sch%C3%B6pfungsh%C3%B6he#Sch%C3%B6pfungsh%C3%B6he_seit_2013 - and nothing there lets me believe that the list of Interwikilinks would have sufficient "Schöpfungshöhe".

@Denny Nobody's copyright is going to be invalidated by your personal beliefs.

And what do bots have to do with anything? Wiki bots are simple scripts operated by humans. I know how the translation bots worked in particular -- they relied on active approval by their human operator, who would make decisions about how to resolve mismatches and ambiguities, one major task in creating a translation dictionary.

All Wikipedia content is created using software. You do not lose your copyright when you use software.

[...] without any reference to the originating author. So if that is the case, Wikipedia has already never been compliant with that license.

Wikipedia's interpretation of attribution has always been that the page editing history is sufficient attribution. And, of course, the result is shared alike. Those are the two requirements: attribution, and share alike. Wikipedia follows Wikipedia's license, and therefore Wikipedians can use content from Wikipedia.

For the same reason, Wikidata may not use content from Wikipedia. That's the topic of this task.

Denny added a comment.EditedMay 18 2018, 12:17 AM

@Rspeer Back before Wikidata, If I linked an article from the German Wikipedia to the English Wikipedia by adding an interwiki link on the German Wikipedia, and then an interwiki bot makes this link be reciprocal by adding the interwiki link on the English Wikipedia, there is no attribution to me on the English Wikipedia to my edit. The page editing history on the English Wikipedia will not have any attribution to me. There is no link at all to my edit on the German Wikipedia that would fulfill the attribution requirement.

But I agree with you that the page editing history is sufficient attribution.

And thus, since we never required to have the interwiki links attributed in the first place - as I just showed - we obviously do not seem to regard them as being copyrightable and covered by the CC-BY-SA license.

This comment was removed by Rspeer.

My previous comment probably crossed a line. I'm sorry.

But your convoluted argument has shown nothing and is irrelevant to Wikidata.

Wikipedia is fine with a very lax approach to attribution. It encourages external sites to attribute simply "Wikipedia, the free encyclopedia". As far as I know, everyone is fine with this. If you're not, have fun arguing that, but I can tell you don't seriously hold the position you profess about attribution, it just fit into the current theory you were concocting about why violating Share-Alike on Wikidata is okay.

Fundamentally, your position amounts to that old favorite of nonsense Internet IP arguments: "you didn't enforce your copyright once, so now you don't have a copyright anymore". Very popular among people who once heard something about trademarks and misremembered it, and people who just want something legalish-sounding about why it's okay for them to copy stuff.

Share-Alike is a real thing, and unfortunately I think that this Phabricator discussion is not going to get any closer to a serious discussion of how to change Wikidata's copyright status. I'd like to know, specifically, what it would take to get Wikidata to stop making the frequently-false claim on every page that its data is CC-0.

lisong added a subscriber: lisong.May 18 2018, 6:37 AM

how to change Wikidata's copyright status.

In which you assume it will chance license(/waiver)... If you seek certainty, plenty of people have indicated their view on the situation here, but this discussion is not ever going to give you certainty: only court can.

Nemo_bis added a comment.EditedMay 18 2018, 7:22 AM

Since the license in the Wikipedia(s) is managed by the community,

Not really, the license is one of the non-negotiable aspects of Wikimedia projects.

and the community has the power to change the license, or made amendments to the license

Hardly anyone has! When we switched from GFDL to CC-BY-SA, a new GFDL version had to be released for us to be able to do it. The license cannot be changed retroactively.
https://meta.wikimedia.org/wiki/Licensing_update/Questions_and_Answers

In T193728#4212870, @Micru wrote:

would it be feasible to ask the several Wikipedia(s) communities to add a clause where it is stated that statements can be mined by the Wikidata community (exclusively or not) and re-released as CC0 on the Wikidata platform?

Leaving aside the fact that every individual contributor to Wikipedia would have to agree to that change in license: Should such an amendment prove necessary to "legalize" Wikidata, there are other data sources used by Wikidata which are much more problematic (e.g. the already mentioned OSM).

Cirdan added a comment.EditedMay 18 2018, 7:57 AM

And thus, since we never required to have the interwiki links attributed in the first place - as I just showed - we obviously do not seem to regard them as being copyrightable and covered by the CC-BY-SA license.

That conclusion is not correct. Wikipedia as a whole is licensed under CC-BY-SA, and that also includes the way the content is organized. Just like not every single sentence is protected by copyright, not every single interwiki connection is, but I would be surprised to learn that the structure and organization of Wikipedia's content is not protected at all.

Given my current layman's understanding, that's quite similar to software: While most (if not all) source code constructs used e.g. in Wikibase are not copyright protected, the combination and clever organization of these generic constructs very well constitutes a copyrighted work and I have to follow whatever license it is published under. (Actually, there will probably be many parts in Wikibase which were "written" by some IDE code completion very much like an interwiki bot "wrote" an interwiki link.)

In T193728#4212948, @Micru wrote:

Not really, the license is one of the non-negotiable aspects of Wikimedia projects.

With enough support, everything is negotiable.

Aside from the fact that every single contributor would have to be asked to agree to the change of the license as it would allow the distribution of Wikipedia content under CC-0, if it is really a problem that Wikidata is reusing Wikipedia content the way it is currently doing, then this is a huge problem for Wikidata. Determining wether this is the case is precisely what this task is about.

The solution to such a potential massive license violation on Wikidata cannot be to retroactively change the license of Wikipedia, but Wikidata would need to be cleaned up. Especially, and I'm just repeating what has already been said numerous times in this thread, because there are many sources Wikidata uses which aren't CC-0 either.

In T193728#4213766, @Micru wrote:

Aside from the fact that every single contributor would have to be asked to agree to the change of the license

Not necessarily, a broad discussion with a majority agreeing on it can be enough.

I'm sorry, but your understanding of copyright is deeply flawed then. If I contribute a piece of text under CC-BY-SA to Wikipedia, Wikipedia has to ask me whether I'm willing to re-license that text under a different license. This can never be substituted by a majority decision, since I hold the rights to that text.

The solution to such a potential massive license violation on Wikidata cannot be to retroactively change the license of Wikipedia

Why not? It is definitely feasible.

It is most definitely not, and to me this is a deeply troubling approach to the problem of potential license violations on Wikidata.

there are many sources Wikidata uses which aren't CC-0 either

This should be examined on a case by case basis, there might not be that many issues.

Which is exactly what we are trying to do here. So far, there isn't even a clear picture as to whether imports from Wikipedia to Wikidata are legally problematic. It might very well be that they are fine.

Denny added a comment.May 18 2018, 3:22 PM

@Micru I agree with @Cirdan that this would be a rather worrying way to deal with the situation. Also, as @Nemo_bis points out, it really couldn't be just the communities doing so. In my understanding, it would need an update to the CC license itself, which would need to be done by CC, and then have the license be adopted by the Foundation together with the community, and this only works due to the or-later-clause. But as said, I would be rather troubled by such an approach.

My understanding is that the goal of this bug is to reduce the legal certainties as far as we can. As was pointed out by @EgonWillighagen only a series of court decisions will be able to resolve that 100%, but I don't think it makes sense to keep that bug open until then (if that ever happens). My current goal to shepherd this bug to a closure is to agree with people who have a different point of view on a question or two to ask Gnom1, and then work on from his answer. If, for example, it turns out that the extraction that Wikidata has done from Wikipedia is deemed breaking the license of Wikipedia, then we should indeed purge Wikidata of any data taken from such a license breach or change Wikidata's license to CC-BY-SA.

I hope that seems viable to everyone involved.

I am personally convinced that no license breach has happened, because the facts that have been exported to Wikidata are not covered by copyright. But, as @Rspeer pointed out correctly, my personal beliefs are not exactly relevant here. So here is my suggestion for a question to Gnom1, and I would be happy for others to refine it:

"Can you comment on the legal status of

  1. interwiki links and 2) other facts, such as places of birth,

be extracted from Wikipedia and be republished in Wikidata under a CC-0 license?"

@Psychoslave , @Rspeer , does this sound good to you?

My current goal to shepherd this bug to a closure is to agree with people who have a different point of view on a question or two to ask Gnom1, and then work on from his answer.

In case of serious doubt it is more appropriate to ask WMF’s legal department to organize a professional opinion regarding open questions (elaborated in house or via some hired external expert), regardless of User:Gnom’s unquestionable knowledge in this field. (Yes it is great that he adds expert opinion, but as far as I understand it is all unpaid legal advice he provides, right?)

If, for example, it turns out that the extraction that Wikidata has done from Wikipedia is deemed breaking the license of Wikipedia, then we should indeed purge Wikidata of any data taken from such a license breach or change Wikidata's license to CC-BY-SA.

If any of those happened (or had to happen), I’d be out here and I guess many other Wikidata editors would also discontinue their efforts. There is great support for CC0 in Wikidata, since anything else that required attribution would render it useless; large-scale purging would tear down so much content that we would basically have to start again from the beginning. We are 5.5 years into this project and many of us have spent thousands of hours of effort into it, based on the unchallenged assumption (by WMF) that Wikipedia imports as we are doing them are legally fine.

Once again, it's silly to talk about this issue going to court. Wikimedia contributors are not taking other Wikimedia contributors to court over internal disagreements on how the CC-By-SA license should apply. But we're weakening the legitimacy of Wikimedia licenses by not resolving this.

@Denny Thanks for steering the discussion toward concrete questions. I think your question sounds about right. I have a slight quibble with the phrasing "other facts, such as places of birth", as I believe the problem is not with the individual facts, but with processes that imported them in bulk.

Denny added a comment.May 18 2018, 5:15 PM

@MisterSynergy yes, I agree, it would seriously weaken Wikidata. Nevertheless it is good to resolve legal uncertainties as far as reasonable.

Regarding Gnom1 - well, he did write the previous, official answer by Wikilegal, which is why I consider that a great offer. But I agree that we should also ask WMF Legal officially too, in particular before making grand sweeping changes. But, at least for me, if Gnom1 says "all is fine" or something similar, I won't push Wikilegal for an official answer, as I consider this done. It is still an option that others can pursue, obviously, but I won't use my volunteer time for that.

Denny added a comment.May 18 2018, 5:23 PM

@Rspeer

My previous suggestion to @Psychoslave was
P) "Can you comment on the practise of extracting data from Wikipedia articles, which are published under CC-BY-SA, and storing the results in Wikidata, where they are published under CC-0?"

I guess the phrasing in that would raise the same quibble, though.

So here's my new suggestion:

R2) "Can you comment on the practice of having processes that in bulk extract facts from Wikipedia articles, which are published under CC-BY-SA, and store the results in Wikidata, where they are published under CC-0?"

Does this sound right? I am not naming the interwiki links explicitly, as it seems they should be subsumed by that question (i.e. whatever is true for places of birth will also be true for interwiki links). If someone thinks that we should be explicitly mentioning interwiki links, I am happy to add them, for example in the following form:

R3) "Can you comment on the practice of having processes that in bulk extract facts from Wikipedia articles, which are published under CC-BY-SA, and store the results in Wikidata, where they are published under CC-0? We ask in particular about interwiki links, and additionally also refer to statements stored in Wikidata for e.g. the place of birth, family relationships, etc."

My preference right now is for R2, but I am happy to listen to other suggestions.

R2 sounds excellent. It covers main legal issue that absolutely needs resolving.

Though I am also curious about OSM.

My R-OSM-1 would be

R-OSM-1: "What, if anything, may be imported from ODBL licensed databases like OSM into Wikidata, published under CC-0?".

@Mateusz_Konieczny I like R-OSM-1 too. I would go now for these two questions.

I'd really like to have @Psychoslave chime in, as he was the one opening this bug and certainly being the most vocal on this topic, as far as I have seen, so I will leave this open for a few days to give him the opportunity to speak up.

R2 sounds like the right question. Thanks.

SimonPoole added a subscriber: SimonPoole.EditedMay 21 2018, 6:39 AM

IMHO there are multiple, mainly communications related issues, that continue to lead to confusion

Simon

In T193728#4213806, @Micru wrote:

since I hold the rights to that text

The concept of "rights" is quite flexible, as shows Wikipedia. The Wikipedias are based on texts that have copyrights but they have been re-paraphrased so that the copyright no longer applies. Same with data-mining, in a way it is re-paraphrasing a text in a machine readable format.

I'm again sorry to say that, but your comments show a deeply flawed understanding of copyright. Copyright of texts is by no means "flexible". It seems you are confusing plagiarism and copyright violation, which are completely separate categories (the former is a concept in the context of academic scholarship, the latter a concept in the context of law). What we are discussing here is whether data collections licensed under CC-BY-SA or other non-CC-0 licenses (like OSM) can be imported to Wikidata. The licenses of these collections do not simply vanish because one alters some words or uses a computer program to extract the information.

In T193728#4218415, @Micru wrote:

In my opinion, a CC license that would allow for data mining as CC0 would be most helpful, and not only for the Wikimedia movement.

There is already a license which allows data mining under CC-0: CC-0 itself. There cannot be any other license which allows re-use of content under CC-0 which is not effectively identical CC-0. If there are cases where copyright law permits the extraction of information from copyrighted texts, then this applies to CC-BY-SA licensed texts as well, so there is no need to change CC-BY-SA to extract information from Wikipedia.

But as said, I would be rather troubled by such an approach.

Do you care to explain why does it bother you to clarify the license?

It's not a "clarification", it would constitute a retroactive conversion of CC-BY-SA into a license which is effectively CC-0. As we have explained to you multiple times now, that is not possible without consent of every single contributor of a copyright protected text to Wikipedia and it is highly doubtful that a majority of Wikipedians (or the WMF) is interested in converting Wikipedia to CC-0.

(I can only urge you again to carefully read the explanations people in this discussion have given you and perhaps also look into copyright law (a Wikipedia article will do) and the CC-BY-SA and CC-0 license texts to understand the fundamental issues we are discussing here.)

It has been asserted here several times that OSM data has been wholesale imported into Wikidata - do we know that has happened? Wikidata has two properties related to OSM, one that relates wikidata items to OSM tags like "lighthouse", and one that is essentially deprecated (see T145284), so I assume those are not the issue. According to https://www.wikidata.org/wiki/Wikidata:OpenStreetMap (text which has been there since at least last September) "it is not possible to import coordinates from OpenStreetMap to Wikidata". If the issue is coordinates imported via wikipedia infoboxes that originated with OSM, I can see there might be an issue there, and maybe that should be added to Denny's suggested question in some fashion. But as far as actual importing of OSM data, the only specific cases that I noticed explicitly cited above are (A) a bot request that has been rejected, and (B) a discussion from 2013 where the copyright issue was explicitly raised right away.

Cirdan added a comment.EditedMay 22 2018, 7:17 PM
In T193728#4221843, @Micru wrote:

Wikipedias rephrase the content of works under copyright and rebrand that content as CC-BY-SA, which label do you put to that practice?

That is not what Wikipedia is doing. Wikipedia is using information collected by third parties to create encyclopedic articles. If done properly, i.e. most notably not by directly copying texts, this is not a copyright issue, because facts cannot be copyrighted.

More generally, however, rephrasing is not sufficient. If I write an original book and you publish a rephrased version of that book, there is still a copyright issue there, because I hold the rights to the story and the precise way it is told. This is why a Wikipedia article as a whole can be copyright protected even though the information contained therein is not. There are certainly small Wikipedia articles which are just a collection of facts and can likely be considered without copyright protection because they do not reach the threshold required for an original work (at least under German jurisdiction that is the case), but longer articles where information had to be carefully selected and weighted are original works.

In which way is it different from doing the same with any content and rebranding it as CC0?

Take the example of the book I wrote again: While individual sentences and bits of information are not protected, the way a collection of facts is organized can be protected. This is why in certain jurisdictions data collections and databases have been or currently are subject to copyright laws, which in turn is why we are having this discussion here.

The very strict policy of OSM (see e.g. here and here) is a good example to understand this problem.

If there are cases where copyright law permits the extraction of information from copyrighted texts, then this applies to CC-BY-SA licensed texts as well, so there is no need to change CC-BY-SA to extract information from Wikipedia.

That would be ideal.

This task is about figuring out what is allowed and what is not, and the questions agreed on will hopefully shed some light onto this issue.

It's not a "clarification", it would constitute a retroactive conversion of CC-BY-SA into a license which is effectively CC-0. As we have explained to you multiple times now, that is not possible without consent of every single contributor of a copyright protected text to Wikipedia and it is highly doubtful that a majority of Wikipedians (or the WMF) is interested in converting Wikipedia to CC-0.

Well, you just said that that there might be cases "where copyright law permits the extraction of information from copyrighted texts", so I believe that an explicit indication about those cases in the CC-BY-SA license could be helpful for other people to avoid repeating this conversation in other places.

CC licences are built within the framework of current copyright law. It is not necessary and in fact would probably be quite problematic to add terms to the license which simply reiterate parts of the law in a different wording.

Well, you just said that that there might be cases "where copyright law

permits the extraction of information from copyrighted texts", so I believe
that an explicit indication about those cases in the CC-BY-SA license could
be helpful for other people to avoid repeating this conversation in other
places.

You can add anything to the licences terms, but if the law do not give you
the power to act on that unything the clause is illegal, hence ignored by
courts. In this case, the existence of the licence depends on the copyright
law. So the things you can act on are the precise « rights » the copyrights
defines, and nothing more. If you add an out of scope clause in a CC, it’s
not relevant.

Sorry, I've been busy on other activities lately, I'll catch up and give feed back as soon as I can.

The more I personnaly dig into this questions, the more issues are opened and the less clear it becomes that there is an actual issue, and if there is an actual issue if there is a legal risk. Or even if there is a moral or ethical issue: I think we will all agree here that pure facts can be used by anyone (beyond private life one).

Well, discussing face to face with a professional lawyer specialised on free licenses didn't led me toward the same conclusion of "it's clear there is no problem", unfortunately. Apparently there might be possibilities of "class action" on copyright infringement. She said she could provide more information later, I will forward that to confirm/infirm the validity of that concern.

I think that to go on, we should avoid assertions about "use of pure facts by anyone", the vagueness of this formulae doesn't help. We should first agree that the problem is really about "substantial transfer of data", which is already equivocal enough without us adding more ambiguity to the topic.

Psychoslave added a comment.EditedMay 25 2018, 2:01 PM

@Psychoslave sorry to disagree on the questions, but are we in any disagreement on these three questions?

We should not allow the (significant) import of data from databases which are licensed under a license incompatible with CC-0.
We should enforce that.
We should document the licenses of imported databases.
We should remove data that has been imported from databases which are licensed under a license incompatible with CC-0.

That would be my answers. I don't expect you to disagree with that.

I partially agree.

First let's recall that removing this data is not the only possibility.

For what I understand, just adding a "license" field to sources and filling both the provenance and its license properly would be perfectly fine. I've (informally) been confirmed that by the lawyer I was already referring to. Note that with such a solution, it would still be possible to claim CC-0 for any item which was included in a compatible manner.

An other solution would be to migrate them to an other separated Wikibase instance, each with a compatible license. Of course this instances should offer the same accessibility within the Wikimedia environment.

That lead me to this question:

  • what are the benefit for the Wikimedia community of using exclusively CC-0 for its single Wikibase instance usable in the rest of its environment?

This certainly should be answered and seriously documented along the corresponding drawbacks of such a sate of affair. Thus said, this question is off topic regarding the current ticket, and can be safely ignored until a ticket on this specific topic is open (if none is already existing).

Can you comment on the practise of extracting data from Wikipedia articles, which are published under CC-BY-SA, and storing the results in Wikidata, where they are published under CC-0?

What do you think?

I'm fine with the spirit of this question. I would appreciate if we could come with a more "please provide us threshold, as clear as possible, that should not be crossed". We can come with concrete example, like

  1. import of interlinks (phase1),
  2. import of infobox data,
  3. import of category trees
  4. import of all statements extractable from a prose, into simple predicates, for example through automated natural language treatment

@Nemo_bis thanks, I agree with your point a lot.

But regarding your question - just because there is a database which happens to reproducible should not trigger any right issues.

To give an example: it is easy to imagine a company that sells the list of all countries and their capitals as a dataset that is easy to process and that has a guaranteed quality and support level, to other companies, under a proprietary license that does not allow the dataset to be reshared.

Now just because it happens that we can reproduce that list from Wikidata with a SPARQL query should not mean that we have to act in any way.

On the other side, if we had used that list to import the data - then that would in my understanding be a breach of the rights of that company. But if we did not and the result happens to be the same - well, that's how it is.

If the example is too simple, it could be easily extended to be larger and more substantial. My argument would still be the same.

What do you think?

I think this is a point where we all seem to agree: the problem is not equality of result, but provenance of data. Data that were collected out of many sources into Wikidata without using a significant part of any of each single source don't seem to be a source of concern.

We should first agree that the problem is really about "substantial transfer of data"

It's not.

Psychoslave added a comment.EditedMay 25 2018, 2:32 PM

@Rspeer regarding the ontology: the ontology of Wikidata is genuinely unique and not copied from any Wikipedia project, or any other project. It has been created on Wikidata.

Well, I think it's fine that we keep concentrating at a single topic at a time, like several suggested, including you. So for know, we might concentrate on USA copyright and how "original selection" might raise concern or not. Structure of the database is a different topic.

Regarding the translations: we are talking about the labels of things in different languages? They are not even necessarily translations, mind you - it is often "there was an article on the German Wikipedia, here's an article on the English Wikipedia, let's connect these two". In general, most titles were not translated. Also, in many of those cases it is hard to argue for any threshold of creativity - the fact that 'London' is called 'Londres' in Frech is rather un-creative, and merely stating a fact.

Well, sure, some are more close to the mere well known convention, some ask more reflection. Obviously no language pair will have a bijective relationship of terms nor existing convention for bridge between all of them.

Also, technically, these are not translations in Wikidata, but labels of an entity in different languages.

I would be surprised if the multilingual labels of the Wikidata entities would trigger copyright, but even if it did, there is nothing in the current Wikipedias that would allow for this copyright to take effect: i.e. if we look at an entry such as Q23780914 - highly visible entry, labels in 50+ languages - there never existed anything in the Wikipedias which might have been under copyright.

Or am I missing something here?

Well, entries that were not created thanks to massive import from Wikipedia obviously don't raise any concern of infringement of <strike>Wikipedia</strike>Wikidata community copyright. It doesn't say more about those that were indeed imported in such a way, do it?

Well, entries that were not created thanks to massive import from

Wikipedia obviously don't raise any concern of infringement of Wikipedia
community copyright. It doesn't say more about those that were indeed
imported in such a way, do it?
As far as I understand copyright, if by a pure random event I happen to
write exactly one page of Harry Potter, I can’t publish this as easily as I
would want. The stuff is protected whatever the way it was created. So it
does not matter if we actually imported datas or it happens we
substancially have the same result by over means.

@Rspeer
But even ignoring that, Wikidata does *not* store the same expression anyway. So what exactly is the copyright asserted on?

For now I propose we discuss the "original selection" criteria.

Even sticking with the single example of London, the French Wikipedia do have an article with this specific title, whose English interlang link to "London (disambiguation)", and not to "London". So even with such a basic example, there are clearly choices that were made.

Thus said, the "original selection" pertains to the whole set, rather than hand picked little selections. Actually I'v just been told that the important number of contributors would plead all the more in favour of the originality (which must not be confused with novelty).

If any of those happened (or had to happen), I’d be out here and I guess many other Wikidata editors would also discontinue their efforts. There is great support for CC0 in Wikidata, since anything else that required attribution would render it useless; large-scale purging would tear down so much content that we would basically have to start again from the beginning. We are 5.5 years into this project and many of us have spent thousands of hours of effort into it, based on the unchallenged assumption (by WMF) that Wikipedia imports as we are doing them are legally fine.

How would it render it useless? Or more useless than today? For some actors like OSM it is useless because they don't trust Wikidata claim that this data can legally be released under CC-0. For those who don't care about acting legally, any license will make the same effect.

Is there any official claim by the WMF that said this kind of import were legally fine? Before Wikidata was launched, @Denny used to agree that Wikipedia imports would not be possible including for legal reasons, that is at the time he was officially working for WMDE. If any official statement clearly exposed otherwise in the mid-time, it would be nice to highlight this information. Not being challenged by some entity doesn't mean what an action is legal, just as any legal infraction it doesn't become more legal because no one bother challenging the issue.

Once again, among the obvious solutions, indicating the license of sources would require no deletion. That would allow those who care about license issues to filter data with this criteria, and those who don't care to ignore it. What would be the argument against such a field addition?

In T193728#4228561, @Micru wrote:

In a way Wikipedia already has a "contribute-alike" agreement, it is just not explicit, but tacit. Users come to the site, access it as they wish, and they are asked to make a donation every once in a while. It is not a contribution in the same way as writing an article, but it is still a contribution which is also necessary to keep the project alive.

From what I understand, you are describing the "same condition" which is expressed by the SA in the CC-BY-SA covering Wikipedia, but I might be misinterpreting your text. If not, I would recommend you to read the license. The best asset of our movement is not money donation, but its community which give time and efforts in misc. contributions.

As for the original purpose of the ticket it seems that we will have to wait for legal advice. Or if we want to have it more clear then the WMF could sue WMDE for copyright infringement and see what the court decision would be :-)

No, it's not possible, because WMF don't hold copyright on Wikipedia content, and is actually the entity publishing the possibly infringing copyrighted material.

And even if it would be possible, I personally wouldn't like to see such a move. I agree that it might be interesting in the extent that it would give a clear indication of what is possible or not:

  1. but I'm not sure of how such a move could be already framed upstream by a contract designed to not harm any party whatever the decision made downstream by an official judgement,
  2. I'm not sure this would be really great in term of image.

If any of those happened (or had to happen), I’d be out here and I guess many other Wikidata editors would also discontinue their efforts. There is great support for CC0 in Wikidata, since anything else that required attribution would render it useless; large-scale purging would tear down so much content that we would basically have to start again from the beginning. We are 5.5 years into this project and many of us have spent thousands of hours of effort into it, based on the unchallenged assumption (by WMF) that Wikipedia imports as we are doing them are legally fine.

How would it render it useless? Or more useless than today? For some actors like OSM it is useless because they don't trust Wikidata claim that this data can legally be released under CC-0. For those who don't care about acting legally, any license will make the same effect.

Users of Wikidata can compile datasets of any form and content with the query service, and re-use it according to the CC0 license (i.e.: do whatever they want to, particularly without attribution). If there was a convolution of individual licenses per fact involved, it would be practically impossible to get this sorted so that use of data was in line with all the licenses involved; even if one would be able to manage this, one could easily end up in a situation where one has to display thousands of sources in some way. Databases with more restrictive licenses than CC0 are useless for re-users, and Wikidata just aims to be useful for re-users.

I agree that we should ensure to put only compatible data into Wikidata, yet I am still not convinced that there is a systematic problem. From my point of view, there is no concern about the validity of Wikidata’s declaration that all content (in main and property namespace) is available under the CC0 license.

Is there any official claim by the WMF that said this kind of import were legally fine? Before Wikidata was launched, @Denny used to agree that Wikipedia imports would not be possible including for legal reasons, that is at the time he was officially working for WMDE. If any official statement clearly exposed otherwise in the mid-time, it would be nice to highlight this information. Not being challenged by some entity doesn't mean what an action is legal, just as any legal infraction it doesn't become more legal because no one bother challenging the issue.

We use imports from Wikidata for years now, and this is not a hidden activity than one could have missed. WMF definitely knows about this for years. There were occasionally some dissenting opinions (not by WMF, AFAIK), but I cannot remember that anyone was able to raise concern strong enough to reconsider the import practice. Until now, this has not changed in this conversation as well.

Some references on why CC0 is essential for a free public database:
https://wiki.creativecommons.org/wiki/CC0_use_for_data
"Databases may contain facts that, in and of themselves, are not protected by copyright law. However, the copyright laws of many jurisdictions cover creatively selected or arranged compilations of facts and creative database design and structure, and some jurisdictions like those in the European Union have enacted additional sui generis laws that restrict uses of databases without regard for applicable copyright law. CC0 is intended to cover all copyright and database rights, so that however data and databases are restricted (under copyright or otherwise), those rights are all surrendered"

https://www.nature.com/nature/journal/v461/n7261/full/461171a.html
"Although it is usual practice for major public databases to make data freely available to access and use, any restrictions on use should be strongly resisted and we endorse explicit encouragement of open sharing, for example under the newly available CC0 public domain waiver of Creative Commons."

https://blog.datadryad.org/2011/10/05/why-does-dryad-use-cc0/
"Dryad’s policy ultimately follows the recommendations of Science Commons, which discourage researchers from presuming copyright and using licenses that include “attribution” and “share-alike” conditions for scientific data.

Both of these conditions can put legitimate users in awkward positions. First, specifying how “attribution” must be carried out may put a user at odds with accepted citation practice:

“when you federate a query from 50,000 databases (not now, perhaps, but definitely within the 70-year duration of copyright!) will you be liable to a lawsuit if you don’t formally attribute all 50,000 owners?” Science Commons Database Protocol FAQ)

While “share-alike” conditions create their own unnecessary legal tangle:

“ ‘share-alike’ licenses typically impose the condition that some or all derivative products be identically licensed. Such conditions have been known to create significant “license compatibility” problems under existing license schemes that employ them. In the context of data, license compatibility problems will likely create significant barriers for data integration and reuse for both providers and users of data.” (Science Commons Database Protocol FAQ)

Thus,

“… given the potential for significantly negative unintended consequences of using copyright, the size of the public domain, and the power of norms inside science, we believe that copyright licenses and contractual restrictions are simply the wrong tool [for data], even if those licenses and contracts are used with the best of intentions.” (Science Commons Database Protocol FAQ)"

https://pietercolpaert.be/open%20data/2017/02/23/cc0.html
"Requiring that you mention the source of the dataset in each application that reuses my data, still complies to the Open Definition. There is no need to argue with anyone that uses for example the CC BY license: you will only have the annoying obligation that you have to mention the name in a user interface. This is useful for datasets which are closely tied to their document or database: when for example reusing and republishing a spreadsheet, I can understand you will want that someone attributes you for created that spreadsheet. However, for data on the Web, the borders between data silos are fading and queries are evaluated over plenty of databases. Then requiring that each dataset is mentioned in the user interface is just annoying end-users."
"The share alike requirement, as the name implies, requires that when reusing a document, you share the resulting document under the same license. I like the idea for “viral” licenses and the fact that all results from this document will now also become open data. However, what does it mean exactly for an answer that is generated on the basis of 2 or more datasets? And what if one of these datasets would be a private dataset (e.g., a user profile)? It thus would make it even more unnecessarily complex to reuse data, while the goal was to maximize the reuse of our dataset."

Aschmidt removed a subscriber: Aschmidt.May 25 2018, 5:59 PM

"what are the benefit for the Wikimedia community of using exclusively CC-0 for its single Wikibase instance usable in the rest of its environment?"

This question is, I think, less suitable for a lawyer. I think this is a very interesting question, but I'd rather focus now on answering the question that is directly pertinent to this ticket.

So, given the discussion as it has been going, I hope that the following questions sound good to everyone:

  1. Can you comment on the practice of having processes that in bulk extract facts from Wikipedia articles, which are published under CC-BY-SA, and store the results in Wikidata, where they are published under CC-0?
  1. Particular sets of facts we are interested in to consider would be: a) interwiki links, b) facts extracted from infobox templates, c) facts extracted from prose through natural language processing.
  1. What, if anything, may be imported from ODBL licensed databases like OSM into Wikidata, and republished under CC-0?

If I don't hear back by the mid of the next week, I'm going to raise these as the questions we would kindly ask to be answered. I find the questions are already getting quite heavyweight - any ways to shorten them would be appreciated.

  1. … the practice of having processes that in bulk extract facts from Wikipedia articles …

You probably need to describe how these processes look like, otherwise this question would be impossible to answer properly. To my knowledge there are on average ~0.66 imports per Wikipedia article (based on the fact that we have ~42M “imported from” references and ~64M sitelinks in Wikidata). The CC-BY-SA license applies to the works of individual Wikipedia articles, so a proper definition of “bulk extract” in the context of those numbers would be very important.

based on the fact that we have ~42M “imported from” references and ~64M sitelinks in Wikidata

Hmm, I've added likely over 1000 of those "imported from" items myself by hand, for example for organization "official website" entries. So I would say "imported from" gives us an over-count of "bot" work, if that's the main issue here. Or is thousands of individuals adding these entries by hand also a concern?

Mateusz_Konieczny added a comment.EditedMay 26 2018, 8:26 PM

Or is thousands of individuals adding these entries by hand also a concern?

It does not matter at all whatever things were copied by hand or by a script. Repainting a copyrighted image pixel by pixel also would not change its legal status. If it was legal to copy then it is OK to copy it. If it was illegal to copy it does not change whatever it was copied by script, 100 dedicated editors or 10 000 people each copying one entry.

Users of Wikidata can compile datasets of any form and content with the query service, and re-use it according to the CC0 license (i.e.: do whatever they want to, particularly without attribution).

Well, with proper traceability of chain of sources and their corresponding licenses, user of Wikidata
would still be able to compile datasets of any form and content with the query service, and re-use it according to the licenses of this data. Or if they prefer to ignore it, not take them into account. Actually, as long as you don't aim at republishing this data, you will probably face no legal problem. But if someone want to use data, they have to conform to the last publisher in the source chain that published them through legal means. If this reuser don't want to follow this condition, then indeed they can not use this data legally from this source.

If there was a convolution of individual licenses per fact involved, it would be practically impossible to get this sorted so that use of data was in line with all the licenses involved; even if one would be able to manage this, one could easily end up in a situation where one has to display thousands of sources in some way. Databases with more restrictive licenses than CC0 are useless for re-users, and Wikidata just aims to be useful for re-users.

This argument doesn't hold. Either you are able to prove that your data set where obtained from legal means, and then you will have to have this traceability of sources and their corresponding legal terms of use, or it is practically impossible to reuse this data legally.

For re-user who really care about legality, this kind of legal traceability is a requirement. Maybe you personally don't like other licenses, but it doesn't make project using different licenses useless.

So please provide concrete real world examples where CC-0 made something possible that would have been impossible with proper license traceability. Because there are real practical case like OSM where this lake of traceability make Wikidata useless.

We use imports from Wikidata for years now, and this is not a hidden activity than one could have missed. WMF definitely knows about this for years. There were occasionally some dissenting opinions (not by WMF, AFAIK), but I cannot remember that anyone was able to raise concern strong enough to reconsider the import practice. Until now, this has not changed in this conversation as well.

Where is the official statement of the WMF about significant transfer from copyrighted data banks that are not released under CC-0 into a CC-0 data bank? What are the threshold not to exceed if any? Until there are official statements on this regard, each individual is sole responsible for its inferences and they only apply to their own personal beliefs.

For the moment, the closest we have to an official statement is this conclusion of the Wikimedia Foundation’s preliminary perspective on data base legal issue:

Whenever possible, the best course is to use only content that is made available by the author under an open license. In particular, for EU databases, the license should include a license or express waiver of the sui generis database right. In the absence of a license, copying all or a substantial part of a protected database should be avoided. Extraction and use of data should be kept to a minimum and limited to unprotected material, such as uncopyrightable facts and short phrases, rather than extensive text. For EU databases, bots or other automated ways of extracting data should also be avoided because of the Directive’s prohibition on “repeated and systematic extraction” of even insubstantial amounts of data.

How could this be interpreted ortherwise than "Wikidata should avoid to import data under an incompatible licenses, either by not importing it, or by conforming to the licenses of its sources"?

Databases with more restrictive licenses than CC0 are useless for re-users

This is clearly false, see OpenStreetMap as an example.

Some references on why CC0 is essential for a free public database:

Essential, no. Interesting, certainly.

Now the point is not whether CC-0 offer well balance convenience for factual data bases, or better long term consequences than any other license. The point is, can Wikidata really pretend that what it publishes is under CC-0 when its sources are themselves published under incompatible terms of use and licenses.

Apart from this, which is the topic we currently focus on this ticket, here is is a few off topic background reflections:

It's one thing to aim that any project financed by any government should be public domain, or in an equivalent legal status of CC-0. It happens that many scientific studies are financed by public funds. In this cases, their are argument to support that it's fair to release them with this kind of condition as public paid for it and scientists got a financial retribution for their work of data collection. The only nuance we could bring to that would be about data traceability, which arise from very different concerns than the ones that conducted to enact copyright & sisters information monopolies, a topic which is a raising concern in a world where fake news is on everyone's lips.

Now, in the Wikimedia movement, this is not how it is. Most contributors are volunteers, and no mandatory annual tax make it obvious that the next year there will still be plenty of money to support our infrastructure whatever other actors are providing as service. If we consider fine to transfer the whole collection of statements of every Wikipedia into Wikidata, then you can expect other actors to generate encyclopaedic articles using natural language generation from Wikidata derivative databases augmented with statements and means we don't have access to, under whichever terms of use they like – most probably under licenses with CC-0. As stated in the previous link the main requirement for implementing NLG is the ownership and access to a structured dataset. We can add that it also requires an adapted infrastructure and skilled people to glue the whole. A copyleft license do bring some sustainability to fate of volunteer communities and their common work that a non-copyleft license fails to provide. With that in mind, how is it different in spirit to put no clear limits of transfer of data between all Wikimedia projects into a single CC-0 project than re-licensing all this projects under CC-0?

So this question us on whether we are more concerned about making the Wikimedia community growth in a sustainable way or maximizing immediate re-use of the works it generated so far, and what are the best technical and legal tools to achieve whichever we are aiming for.

https://pietercolpaert.be/open%20data/2017/02/23/cc0.html
"However, for data on the Web, the borders between data silos are fading and queries are evaluated over plenty of databases. Then requiring that each dataset is mentioned in the user interface is just annoying end-users."

Well, law is often annoying for end-users in immediate situations, but lake of law can be even more detrimental on large scale. So it's not really a convincing argument to promote waiver the corresponding rights, is it? Yes, requiring people to recognize each other significance in their own actions is an additional constraint compared to request them to waiver any form of recognition, but certainly can not be reduced to a useless annoying demand as it comes hand in hand with respect of individual dignity.

Even we we are favourable to public domain for works resulting from public funding, we don't have to approve this kind of argument stated under such a form which are basically promoting regression of recognition for everyone dignity.

"The share alike requirement, as the name implies, requires that when reusing a document, you share the resulting document under the same license. I like the idea for “viral” licenses and the fact that all results from this document will now also become open data. However, what does it mean exactly for an answer that is generated on the basis of 2 or more datasets? And what if one of these datasets would be a private dataset (e.g., a user profile)? It thus would make it even more unnecessarily complex to reuse data, while the goal was to maximize the reuse of our dataset."

The "viral" word is itself a terminology designed to conduct noxious innuendos. Talking about inheritance as biological analogy is a both more neutral and more relevant, as only derivative works are concerned and they foster the original work beyond itself.

What it legally implies to use 2 datasets depends of the two datasets. If this is two datasets of a few items, it probably implies nothing legally because they don't generate any monopoly in the first place. If each dataset are under a legal information monopoly, then it implies that you can only mix them if you were granted the appropriate rights. Merging this two databases into a single one and claiming it's released under a license incompatible with at least one upstream database is most probably legal infringement.

Maybe the goal of some actors is to maximize immediate re-use of dataset, but this is not what is claimed as goal for the Wikimedia strategic direction, here is an extract:

We need social and technical systems that avoid perpetuating structural inequalities. We need hospitable communities that lead to sustainability and equal representation. We need to challenge inequalities of access and contribution, whether their cause is social, political, or technical. As a social movement, we need knowledge equity.

Improving sustainable accessibility to trustable knowledge for everyone seems compatible with this claimed goals. Maximizing immediate re-use at all price even in detriment of the sustainability of our community doesn't seem to hold our goals.

Folks, can we please not start the discussion about whether Wikidata should be CC-0 or not again? We've had it. It is. Let's please concentrate on the question of which imports are ok and which are not. Because that does need clarification.

As far as I understand copyright, if by a pure random event I happen to
write exactly one page of Harry Potter, I can’t publish this as easily as I
would want. The stuff is protected whatever the way it was created. So it
does not matter if we actually imported datas or it happens we
substancially have the same result by over means.

My understanding is that we are in a rather different case with Wikidata. It would be more like we would consider fine to aim at extracting every single predicate possible out of Harry Potter books and release them under CC-0. That mean that you could potentially indeed rebuild the very same set of books with a prose automaton, but also (and most likely) many other prose variations. And if you would include predicates extracted from fanfictions from different sources and unsourced predicates around Harry Potter universe including potentially completely novel ones, that would probably be a even more closely corresponding analogy with the current state of Wikidata.

Psychoslave added a comment.EditedMay 27 2018, 9:24 AM

"what are the benefit for the Wikimedia community of using exclusively CC-0 for its single Wikibase instance usable in the rest of its environment?"

This question is, I think, less suitable for a lawyer. I think this is a very interesting question, but I'd rather focus now on answering the question that is directly pertinent to this ticket.

I agree, this part was off topic and intended to give more background on my thoughts.

So, given the discussion as it has been going, I hope that the following questions sound good to everyone:

  1. Can you comment on the practice of having processes that in bulk extract facts from Wikipedia articles, which are published under CC-BY-SA, and store the results in Wikidata, where they are published under CC-0?
  2. Particular sets of facts we are interested in to consider would be: a) interwiki links, b) facts extracted from infobox templates, c) facts extracted from prose through natural language processing.
  3. What, if anything, may be imported from ODBL licensed databases like OSM into Wikidata, and republished under CC-0?

    If I don't hear back by the mid of the next week, I'm going to raise these as the questions we would kindly ask to be answered. I find the questions are already getting quite heavyweight - any ways to shorten them would be appreciated.

I would add to 2. at least "category trees". Also "graph of internal links" and "structure of Project: namespaces" come to my mind.

Otherwise I totally agree with this set of questions, thank you for coming with them.

I would also find interesting that we extent them to cover lexicographical works. Wiktionary is the most obvious concern regarding Wikimedia projects, but it would be nice to know if we can also use other sources such as Google n-gram, or even some part of non-free copyrighted material. So can we import

  • the complete list of covered words, including to which language they pertain, and grammatical categories
  • all infection relations
  • all phonetic relations
  • etymology descriptions, that is prose describing them through sentences
  • etymological relations, that is the set of etymons from which they directly derived as well as the supposed ancestor trees of this etymons themselves, all that exported as a graph
  • the number of definition for each lexical entry
  • the whole set of complete definitions
  • the complete list of relations established by internal links in each definition
  • translation relations, including to which precise definition they pertain
  • set of multimedia files used
  • complete list of synonyms, antonyms, hyponyms, and so on
  • test of terms grouped by theme, for example all terms related to love, as in the French Wiktionary thesaurus
  • number of occurrences in a given corpus
  • grammatical features, like whether a verb is transitive, ergative, defective, and so on

@Denny, what do you think?

Psychoslave added a comment.

In T193728#4231659 https://phabricator.wikimedia.org/T193728#4231659,
@TomT0m https://phabricator.wikimedia.org/p/TomT0m/ wrote:

That mean that you could potentially indeed rebuild the very same set of
books with a prose automaton, but also (and most likely) many other prose
variations. And if you would include predicates extracted from fanfictions
from different sources and unsourced predicates around Harry Potter
universe including potentially completely novel ones, that would probably
be a even more closely corresponding analogy with the current state of
Wikidata.

*This is rather unrealistic at this point and unrelated to my point. The

question raised is rather : Wikipedia has an infobox. Wikidata has the same
informations compared to that infobox, but it was not imported but rebuilt
from scratch by a user. *

This rather unrealistic at this point and unrelated to my point. The

question raised is rather : Wikipedia has an infobox. Wikidata has the same
informations compared to that infobox, but it was not imported but rebuilt
from scratch by a user.

Question: is the fact that it was rebuilt relevant to copyright or only
matters the fact that the informations are the same ?
I'd tend to think that the equality of information is the key, which would
mitigates kind of a lot the relevance of discussing eternally the fact they
were or not copied from a Wikipedia.

Said differently: the fact that the informations are in Wikipedia would
prevent to have them in Wikidata or any other database.

based on the fact that we have ~42M “imported from” references and ~64M sitelinks in Wikidata

Hmm, I've added likely over 1000 of those "imported from" items myself by hand, for example for organization "official website" entries. So I would say "imported from" gives us an over-count of "bot" work, if that's the main issue here. Or is thousands of individuals adding these entries by hand also a concern?

Yes, it doesn't matter whether transfer is done by bots or manual work, the point is whether the transfer is significant whatever the mean.

Said differently: the fact that the informations are in Wikipedia would
prevent to have them in Wikidata or any other database.

No, I don't see on which legal bases someone could claim something even approaching such an extensive monopoly on factual data.

Hi - my most recent response was following MisterSynergy's comment on Denny's proposed questions, and specifically the meaning of "processes that in bulk extract facts from Wikipedia articles," - it sounds like from subsequent discussion that we are not talking solely of automated "processes", so I think I echo MisterSynergy's comment that the question needs to be better defined to "describe how these processes look like". On the one hand there's overall averages, with less than one "fact" per wikipedia article; on the other hand the distribution is probably quite wide, with some articles having dozens of "facts" extracted from them. Since CC-BY-SA applies to each article individually, does extraction of too much factual data from one article potentially violate its copyright?

Here's a specific question that might be detailed enough in description: suppose we have a collection of facts (say the names, countries, inception dates, and official websites for a collection of organizations) that has been extracted from multiple sources, including various language wikipedias, a CC-0 data source (for example https://grid.ac/) and a non-CC-0 non-wikipedia data source - these sources would be indicated in wikidata by the reference/source section on each statement. This extraction has been done by users either manually or running bots with the understanding that they are adding facts to a CC-0 database (wikidata). Reconciling the facts - for example merging duplicates with slightly different names, dates, or URL's - has been done by users manually or semi-automatically, again with the understanding they are contributing to a CC-0 database. Are there any copyright or other rights constraints that apply to this collection, or can it be fully considered to legally be CC-0?

Hi @Denny any news regarding this? What's your mind concerning my additional more precise questions on lexical data?

Vvjjkkii renamed this task from Solve legal uncertainty of Wikidata to zpdaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: Huji, Aklapper.
CommunityTechBot raised the priority of this task from High to Needs Triage.Jul 3 2018, 2:05 AM

Etalab (who runs the open data portal of the French government) have released a statement (in French) concerning the attribution requirement of their "licence ouverte", confirming that it only applies to the first re-user.
https://github.com/etalab/wiki-data-gouv#point-juridique

Therefore it is possible for a license to require attribution and still be fully compatible with Wikidata. So we should really stop asserting that we can only import CC0 data in Wikidata, I think.

Psychoslave added a comment.EditedAug 6 2018, 6:22 AM

Hi @Pintoch if a license is compatible with CC0 requirements, then yes it can be imported into any dataset covered by CC0, including Wikidata.

The link you are providing is a minutes of a workshop to which I attended. I think it's great that Etalab took such a clear position, all the more when I personally insisted to have such a clarification. I don't know how official that makes the statement, but at least it's here to be shown. Now I must say that, having re-red the the licence ouverte official text, I don't agree with this interpretation. The license explicitly states:

La présente licence a été conçue pour être compatible avec toute licence libre qui exige au moins la mention de paternité et notamment avec la version antérieure de la présente licence ainsi qu’ avec les licences « Open Government Licence » (OGL) du Royaume - Uni, « Creative Commons Attribution » (CC - BY) de Creative Commons et « Open Data Commons Attribution » (ODC - BY) de l’Open Knowledge F oundation.

That is, as it is effectively formulated, the license aims at compatibility with licenses which include an attribution clause. The license explicitly states that derived works should comes with a reference toward the source and the date at which it was retrieved. On the other hand, I don't see any mention of "this duty only holds for the first user" in the text of the license. So, maybe the Etalab interpretation is based on extra-license juridical considerations, but as far as I understand, if I want to import a dataset covered by the licence ouverte in an other one, I must provide at least a link toward the source and the date at which I performed the retrieval, and any derivative work should also do the same as for a work released under a CC-by license. To my mind the former is technically manageable within Wikidata, but the statement that Wikidata is released under CC0 breaks the later.

Alsee added a subscriber: Alsee.Sep 9 2018, 12:43 AM

There is a very serious error pervading this discussion. Everyone is working on the presumption that Wikidata is importing pure facts. This is false. Wikidata often imports creative works of authorship.

I went to Wikidata and clicked random item, it took me a matter of seconds to find an example of Wikidata copying creative work licensed under CC-BY-SA, specifically from English Wikipedia. An item popped up for a geographical feature, listing coordinates. When I checked it on a map, the contributor selected an arbitrary and highly unusual coordinate to represent the geographical feature. In particular, the coordinates consisted of 16 digits. Approximately 6 digits of that information was factual (within the geographical feature), and approximately 10 of the 16 digits were a creative work of authorship of the contributor.

I could very easily make a series of edits, adding valid coordinates to articles, and embed a watermark in my creative selection of precise location. A bot would soon come by and import my work into Wikidata. I could then prove my creative authorship in court by explaining how to read the watermark. The watermark spread across those contributions could contain the hidden text "Copyright by Alsee@Wikipedia.org licensed under Creative_Commons_Attribution-ShareAlike_3.0_Unported_License https://creativecommons.org/licenses/by-sa/3.0/ ".

Furthermore many text fields contain creative work. (In case anyone has difficulty with the concept of authorship of a string of digits, or of watermarks.)

A more accurate question for the lawyers is this:
Can you comment on the practice of bulk extraction from Wikipedia articles, of factual information AS WELL AS individually-short works of creative authorship licensed under CC-BY-SA , and publishing it in Wikidata under a claim of CC-0?

In addition, I'd like to specifically ask:
Could the assertion of CC-0 constitute Slander_of_title?
Slander of title might be more serious than the direct concerns of license infringement.

Nemo_bis renamed this task from Solve legal uncertainty of Wikidata to Address concerns about perceived legal uncertainty of Wikidata .Oct 5 2018, 8:09 AM

On the copyright of maps OpenStreetMap suggests:

Instead courts now instead look closely for evidence of originality in either the aesthetic choices made in rendering the map or in the selection of aspects
included. Note, however, that mere color choices or basic styling of components of the map are not themselves significant enough to warrant protection. [...] In instances when contributions come in the form of raw GPS paths, they are unlikely to be deemed independently copyrightable given that they are simply a set of GPS coordinates.

I don't see why the extra 6 digits of information should be regarded as aesthetic choices or selection of aspects of a map. They are more like the simple set of GPS coordinates.

@ChristianKl please do not infer that a third party legal opinion that wasn't even commissioned by the OSMF is a statement of "OSM" or the OSMF.

I'm linking to a page on OpenStreetMap.

You are linking to a document that is -stored- on the OSM wiki. I'm sure the WMF would be just as happy if I posted a couple of million links to documents from wikimedia commons and claimed these are statements of the WMF and reflect their opinions.

A related thread was launched on Commons Do we want to bot-copy descriptions to captions?, following the introduction of structured data integration on the project.