Page MenuHomePhabricator

Include copyright metadata based on Wikidata P6216
Open, Needs TriagePublic


We're currently adding CC-BY-SA and GDFL to every book's metadata. This is incorrect for Public Domain works, and likely to be wrong for other works too. The content of the exported books is not (with the exception of the About page) covered by the wiki's default license.

Many books have copyright templates on their front pages (e.g. the templates discussed in T274452). This means that the correct license is at least identified, but it's not very machine readable, and it conflicts with what's in the metadata of the EPUBs. The metadata currently looks like this:

<dc:rights xml:lang="en">Creative Commons BY-SA 3.0</dc:rights>
<link rel="cc:license" href="" />
<dc:rights xml:lang="en">GNU Free Documentation License</dc:rights>
<link rel="cc:license" href="" />

This should be changed to

  • look up the correct copyright status from copyright status (P6216) (from the edition's item, or the work's)
    • the status could be determined via the text content of the template or categories (such as Category:PD-old (Q19754287)), but these aren't very machine readable at the moment;
  • add this to the metadata; and
  • show an error message for books that have no copyright status.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I'm not sure what the best way is to pull license info out of the HTML. There are templates, such as {{pd-old}}, which put things in categories and those are linked to Wikidata, so that gives us something that works across wikis, but I don't know if there's anything linking those to the actual license items (e.g. public domain Q19652).

I am probably only stating the obvious, but license templates are supposed to contain machine-readable classes as defined on since about a decade ago. Several wikis have implemented that in a half-hearted way or not at all, but I don't think it is unreasonable to ask folks to work on that, also considering that it should be less work than for example adding a copyright status to the wikidata pages of all pages.

I also like the idea of taking the copyright status from the edition items, but please be careful about using a copyright status from a work item, as it will often be different than the status of the edition (e.g. if the edition is a translation).

Hmm. Does it actually need to be machine-readable? I would have thought what was wanted was a way to just identify the license template output so that it could be rendered in the appropriate place, but otherwise just use the on-wiki rendered template. Structured data is nice for all sorts of other reasons, but for this purpose I would think a simple CSS class would be sufficient; or possibly an ID in order to ensure there is only one container for license information.

Structured data will quickly become complicated for things like "translation license" or "multiple license", and to express all the subtleties of things that are not simply "copyright has expired". For example, enWS permits works that are PD in the US, but other projects may need PD in a different jurisdiction or in multiple jurisdictions (ala. Commons). There may also be subtle differences due to different interpretations, like enWS has an expanded definition for PD-EdictGov (since ca. 2019, due to a SCOTUS opinion) that may or may not be applicable on other projects.

License templates also often express an end point rather than the inputs: "after checking place of publication, existence of previous publications, possibility of concurrent publication within 30 days, identity of author, additional contributors (illustrators etc.), date of publication, etc. it was determined that this work's copyright status would best be described by this license template". But for a structured data approach you really want access to the inputs so you could translate it to a different license template with compatible but slightly different semantics.