Page MenuHomePhabricator

Add schema.org structured data to images on Commons and Wikipedia to meet Google's requirements
Open, Needs TriagePublic

Description

We need to add schema.org structured data to all images on Commons and images from Commons on Wikipedia in order to meet Google's requirements (see also here https://developers.google.com/search/docs/data-types/image-license-metadata) and continue to have the images appear in Google searches using the "Usage rights" filter.

We are looking at an approximate deadline of November 2020 to complete this work.

When the web team added a schema.org script to article pages, they set up an A/B test to ensure SEO was positively affected, and they confirmed that the added page weight was within acceptable limits. Adding the license data is not optional for us, but we should still determine if our work impacts page weight/performance so we can mitigate this if possible and communicate the change to those who may be concerned about it.

COVID-19 Deployment Criteria

  • Can you roll back this change without lasting impact?
    1. A recovery plan is required as this will help identify our capacity for recovering from the failure
    2. THIS IS A KEY QUESTION, if you can’t answer it, you shouldn’t deploy
  • Is specialized knowledge required to support this change in production? If so, are there multiple people with this knowledge?
  • Is there a way to increase confidence about the correctness of this change?
    1. Reviews (Design, Code, etc)
    2. Testing coverage (unit tests, integration tests)
    3. Manual testing (e.g. Beta, vagrant, docker)

Event Timeline

There are 2 ways we can add license metadata:

  1. Add structured data using schema.org vocabulary
  2. Add IPTC photo metadata to images. We'd need to add this metadata to existing images and automate the addition of it to new uploads. I'm assuming we don't want to do this.

If we go with structured data, there are a couple of different ways to include it:

  1. The preferred way is to add a JSON-LD script to the <head> of the page that contains structured data for each image on the page.
  2. The other option is to include microdata, which are HTML5 attributes included directly in the markup.

The structured data we need are:

  1. license: A link to a page that describes the license for that image
  2. acquireLicensePage: A link to a page that describes where the user can find information on how to license that image

Example of a script for a single image:

<script type="application/ld+json">
{
  "@context": "https://schema.org/",
  "@type": "ImageObject",
  "url": "https://example.com/photos/1x1/black-labrador-puppy.jpg",
  "license": "https://example.com/license",
  "acquireLicensePage": "https://example.com/how-to-use-my-images"
}
</script>

Some questions/topics for discussion:

  1. Do we definitely want to rule out IPTC photo metadata? The advantage with this is that we only have to include the data once instead of on all pages where the image exists.
  2. Adding a script to the <head> of a Commons filepage containing structured data about a single image sounds simple enough (famous last words). Adding a script to a Wikipedia page with multiple images starts to sound more complicated. I'd like to identify the challenges this implementation would present. Note that we can add the JSON-LD script on the back-end or dynamically on the front-end and it will still be read by Google.
  3. What concerns do members of the community have about adding this data?

What concerns do members of the community have about adding this data?

@Keegan, do you have any thoughts on this by any chance?

Do we definitely want to rule out IPTC photo metadata? The advantage with this is that we only have to include the data once instead of on all pages where the image exists.

Yes, we definitely want to rule out adding metadata to the images themselves 😃 There are all sorts of issues with adding metadata to images (including the hash-checking to identify duplicates)

I'll touch base with folks on this tomorrow.

Cparle updated the task description. (Show Details)

I did some work on making wikidata output W3C-compliant (JSON-LD), I might be able to offer some advice.

@Niedzielski and @ovasileva is this something we could help the structured data team in any way given our schema.org experience?

Here's the task tree for adding "sameAs" schema data to all wikis and all projects. Some of the subtasks have useful detail. @cscott (and many others) helped make this effort successful. As far as I know, there haven't been any complaints. I'm unfortunately quite busy with other responsibilities but happy to answer questions as able.

Changing uploaded files is scary (data loss if anything goes wrong, changing what data needs to be added is a major PITA, duplicate checking breaks during train rollout of changes to the image modification algorithm...). Adding metadata to thumbnails is fine (to some extent we do it already), although thumbnails are preserved forever so purging all existing thumbnails would take a while. The task for it is T5361: Embed image author, description, and copyright data in file metadata fields. Note though that sometimes we display the original file, not a thumbnail. It would be nice to get rid of that for a number of reasons (T67383: Generate optimised thumbnail even when dimensions match original) but it involves some gnarly areas of legacy code so I wouldn't expect to be easy. Although that task implies that it's already being done for Wikimedia production at least... maybe @Gilles can clarify.

The more fundamental problem is that license metadata changes over time, and leaving outdated versions around is a legal liability. Do we really want to purge all thumbnails any time the file description page is edited? (We probably don't want to purge the articles including that image either, but then there doesn't seem to be any reason to - if the file page provides schema.org metadata and the article doesn't, Google can surely figure out that the file page is the one that needs to be associated with the image.)

Also, the thumbnailer would have to fetch license metadata from MediaWiki, or maybe MediaWiki would have to push it into Swift as some custom header... either way it would be slightly awkward.

Out of curiosity, what does it mean that Google requires us to do this? I imagine they won't skip indexing an image just because it does not have schema.org data associated with it...
Also, where do you plan taking the license metadata from? CommonsMetadata or do you expect the data model discussions to be wrapped up by then?

Google staff have told us pretty plainly that Google Image searches using the "Usage rights" filter will be heavily affected by the change they''ll enforce in the future. Wikipedia and Commons results will probably stop showing up at all. We currently rank very high in such searches.

Out of curiosity, what does it mean that Google requires us to do this? I imagine they won't skip indexing an image just because it does not have schema.org data associated with it...
Also, where do you plan taking the license metadata from? CommonsMetadata or do you expect the data model discussions to be wrapped up by then?

There might be a connection to T118517: [RFC] Use <figure> for media/T251641: RFC: HTML element for inline media from wikitext. Parsoid uses RDFa semantic markup for its HTML, and it probably wouldn't be too hard to add the appropriate attributes (and CSS rules). As a rough sketch, we currently emit:

<figure(-inline) typeof="mw:Image"> <!-- or mw:Image/Thumb, mw:Image/Frame etc -->
 <a or span><img resource="..." src="..."></a or span>
 <figcaption (absent when inline)>...</figcaption>
</figure(-inline)>

and the following would need to be added (note: I don't actually recommend doing this, JSON-LD version is better, see below):

<figure typeof="mw:Image schema:ImageObject">
<a href="..."><img resource="...." itemprop="url" ... /></a>
<figcaption property="name">...</figcaption>
<div property="license">https://example.com/license</div>
</figure>

ie, an additional clause in typeof, a few property attributes, and an extra <div> for the license link, which would be hidden by CSS.

But this begs the question: is the semantic markup expected to apply to the *thumbnail* or to the *original image*? This is a big weakness in all of the semantic markup specs, and one I brought up at iAnnotate a few years ago, and I don't think it's been answered. If you add markup such as the above, you are talking about licensing/indexing the *thumbnail image*, not the actual full-res image. And in general commons image pages also contain a *thumbnail* (perhaps a larger one), not the original image itself. Of course this is also an issue with the proposals to embed license metadata directly in *thumbnail* images: the markup ends up licensing/describing the thumbnail, not the full-resolution original.

In order to properly apply the semantic markup, the JSON-LD in <head> approach used by @Niedzielski in T209306 is probably the only correct solution. The JSON-LD would describe an image URL *which appears nowhere on the page* (ie, the full-resolution original), and then use the thumbnail property of schema:ImageObject to link to the url which actually appears on the page. This would be appropriate markup for pages in the File: namespace; I don't think there exists really appropriate markup to use on article pages because of this weird mismatch. What's wanted is a relation in the opposite direction, where you'd markup that the images on articles pages had some not-yet-existant isThumbnailOf or isDescribedBy property to link them to either the full-size original or to the file description page. However, there is the representativeOfPage boolean; that *might* be useful in describing the lead image on an article. But again, you'd describe the full-size image and mark it as representativeOfPage and then record that the URL actually appearing in the article HTML is a thumbnail of that image.

Concretely, for https://en.wikipedia.org/wiki/Douglas_Adams as an example:

<html>
<head>
<script type="application/ld+json">
    [{
      "@context": "https://schema.org/",
      "@type": "ImageObject",
      "url": "https://upload.wikimedia.org/wikipedia/commons/c/c0/Douglas_adams_portrait_cropped.jpg",
      "name": "Portrait of Douglas Adams",
      "license": "https://creativecommons.org/licenses/by-sa/2.0",
      "representativeOfPage": true,
      "thumbnail": {
          "@type": "ImageObject",
          "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Douglas_adams_portrait_cropped.jpg/220px-Douglas_adams_portrait_cropped.jpg"
      }
    }]
</script>
</head>
<body>
<table class="infobox vcard" style="width:22em"><tbody><tr><th colspan="2" style="text-align:center;font-size:125%;font-weight:bold"><div style="display:inline;" class="fn">Douglas Adams</div></th></tr><tr><td colspan="2" style="text-align:center"><a href="/wiki/File:Douglas_adams_portrait_cropped.jpg" class="image"><img alt="Douglas adams portrait cropped.jpg" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Douglas_adams_portrait_cropped.jpg/220px-Douglas_adams_portrait_cropped.jpg" decoding="async" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Douglas_adams_portrait_cropped.jpg/330px-Douglas_adams_portrait_cropped.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/c/c0/Douglas_adams_portrait_cropped.jpg 2x" data-file-width="333" data-file-height="386" width="220" height="255"></a></td></tr><tr><th scope="row" style="padding-top:0.225em;line-height:1.1em;padding-right:0.65em;">Born</th><td style="line-height:1.4em;">Douglas Noel Adams<br><span style="display:none">(<span class="bday">1952-03-11</span>)</span>11 March 1952<br><a href="/wiki/Cambridge" title="Cambridge">Cambridge</a>, <a href="/wiki/Cambridgeshire" title="Cambridgeshire">Cambridgeshire</a>, England</td></tr><tr><th scope="row" style="padding-top:0.225em;line-height:1.1em;padding-right:0.65em;">Died</th><td style="line-height:1.4em;">11 May 2001<span style="display:none">(2001-05-11)</span> (aged&nbsp;49)<br><a href="/wiki/Montecito,_California" title="Montecito, California">Montecito, California</a>, US</td></tr><tr><th scope="row" style="padding-top:0.225em;line-height:1.1em;padding-right:0.65em;">Resting place</th><td style="line-height:1.4em;"><a href="/wiki/Highgate_Cemetery" title="Highgate Cemetery">Highgate Cemetery</a>, London, England</td></tr><tr><th scope="row" style="padding-top:0.225em;line-height:1.1em;padding-right:0.65em;">Occupation</th><td class="role" style="line-height:1.4em;">Writer</td></tr><tr><th scope="row" style="padding-top:0.225em;line-height:1.1em;padding-right:0.65em;">Alma&nbsp;mater</th><td style="line-height:1.4em;"><a href="/wiki/St_John%27s_College,_Cambridge" title="St John's College, Cambridge">St John's College, Cambridge</a></td></tr><tr><th scope="row" style="padding-top:0.225em;line-height:1.1em;padding-right:0.65em;">Genre</th><td class="category" style="line-height:1.4em;"><a href="/wiki/Science_fiction" title="Science fiction">Science fiction</a>, <a href="/wiki/Comedy" title="Comedy">comedy</a>, <a href="/wiki/Satire" title="Satire">satire</a></td></tr><tr><th scope="row" style="padding-top:0.225em;line-height:1.1em;padding-right:0.65em;">Notable work</th><td style="line-height:1.4em;"><i><a href="/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy" title="The Hitchhiker's Guide to the Galaxy">The Hitchhiker's Guide to the Galaxy</a></i></td></tr><tr><td colspan="2" style="text-align:center;line-height:1.4em;"><hr></td></tr><tr><th scope="row" style="padding-top:0.225em;line-height:1.1em;padding-right:0.65em;">Signature</th><td style="line-height:1.4em;"><a href="/wiki/File:Douglas_Adams_Unterschrift_(cropped).jpg" class="image"><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Douglas_Adams_Unterschrift_%28cropped%29.jpg/160px-Douglas_Adams_Unterschrift_%28cropped%29.jpg" decoding="async" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Douglas_Adams_Unterschrift_%28cropped%29.jpg/240px-Douglas_Adams_Unterschrift_%28cropped%29.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Douglas_Adams_Unterschrift_%28cropped%29.jpg/320px-Douglas_Adams_Unterschrift_%28cropped%29.jpg 2x" data-file-width="1480" data-file-height="788" width="160" height="85"></a></td></tr><tr><th colspan="2" style="text-align:center">Website</th></tr><tr><td colspan="2" style="text-align:center;line-height:1.4em;"><span class="url"><a rel="nofollow" class="external text" href="http://douglasadams.com">douglasadams<wbr>.com</a></span></td></tr></tbody></table>
</body>
</html>

Note: 1) the URL is the url of the image, *not* of the File: page describing the image; 2) I don't think the acquireLicense property is actually helpful or meaningful to us, and it's not required by Google; 3) I've copied the commons description into the 'name' property of the ImageObject; there might be other properties you could populate, but some of the schema.org labels don't map well to the wiki process.

You can copy the HTML above into Google's Structured Data Testing Tool to validate it.

I'd recommend emitting this markup only for (a) the representative image on article pages, and (b) pages in the File: namespace. That's not perfect, but it avoids bloating our HTML with a lot of redundant image information meant only for machines.

It's possible to include the JSON-LD for a page by reference, instead of inline; in the future it might be worth incorporating the license information that way to avoid bloat.

I'd recommend emitting this markup only for (a) the representative image on article pages

Again, do we really want to purge article pages when the file page is edited? That seems like a bad place to end up in, especially if something like "no free image for this person" placeholder ends up being the representative image for a million articles.

Also, where do you plan taking the license metadata from? CommonsMetadata or do you expect the data model discussions to be wrapped up by then?

If this a blocker, is there a page we can link to that has the current licensing information? According to Google (Schema.org link and emphasis mine):

  1. A URL to a page that describes the license governing an image’s use. Specify this information with the Schema.org license property or the IPTC Web Statement of Rights field.
  2. A URL to a page that describes where the user can find information on how to license that image. Specify this information with the Schema.org acquireLicensePage property or the IPTC Licensor URL (of a Licensor) field.

More detail further down the page:

A URL to a page where the user can find information on how to license that image. Here are some examples:

  • A check-out page for that image where the user can select specific resolutions or usage rights
  • A general page that explains how to contact you

I'm asking does this have to be an actual license string like "CC-BY-SA-3.0" / exact link "https://creativecommons.org/licenses/by-sa/3.0"? Can it instead be an indirect link to a license? For example, maybe the File page itself which may then transclude Template:Cc-by-sa-3.0 which may contain structured data.


We need to add schema.org structured data to all images on Commons and images from Commons on Wikipedia

I'm trying to understand what the difference is between this task and "sameAs". From the source, I think the differences are:

  1. "sameAs" only adds schemas to pages in the main namespace.
  2. "sameAs" only adds schemas to pages linked in Wikidata.
  3. "sameAs" only sources data from Wikidata and the page it represents.
  4. "sameAs" does link to a thumbnail and full resolution image when present.
  5. "sameAs" does not directly include licensing.
  6. "sameAs"'s @type is an Article.

Whereas this task:

  1. Only adds schemas to pages in the File namespace? It sounded like @cscott is suggesting main and File namespaces changes. Are both needed? If not, can one be split out into a new task? If this task is only about the File namespace, are the purging concerns @Tgr raised different than "sameAs"?
  2. Adds schemas to pages whether they're available in Wikidata or not? If this is true, is it ok?
  3. Is this the same?
  4. Is this the same?
  5. Must include more explicit licensing.
  6. This @type is an ImageObject, possibly also an Article?

I'm listing them because I don't understand, not because I'm suggesting an approach. I also am trying to understand the scope of the task given the timeline and possible fundraising concerns.

It sounded like @cscott is suggesting main and File namespaces changes. Are both needed? If not, can one be split out into a new task?

File namespace changes are split out into T254039

Can sameAs or something similar be used for images? The best way to handle thumbnails (and articles, if image metadata makes sense for those at all) is to just point back from them to the canonical location of the image. No purging needed, does not bloat the size much, search engine should be just as happy. The only non-trivial part would be dealing with image redirects left after moving files, but that doesn't seem too bad either.

(decision being whether we want to provide formal help here)

If I understand correctly, @cscott is proposing adding an ImageObject that contains a thumbnail (also an ImageObject):

{
  "@context": "https://schema.org/",
  "@type": "ImageObject",
  "url": "https://upload.wikimedia.org/wikipedia/commons/c/c0/Douglas_adams_portrait_cropped.jpg",
  "name": "Portrait of Douglas Adams",
  "license": "https://creativecommons.org/licenses/by-sa/2.0",
  "representativeOfPage": true,
  "thumbnail": {
      "@type": "ImageObject",
      "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Douglas_adams_portrait_cropped.jpg/220px-Douglas_adams_portrait_cropped.jpg"
  }
}

@Tgr wants to flip the relationship?

{
  "@context": "https://schema.org/",
  "@type": "ImageObject",
  "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Douglas_adams_portrait_cropped.jpg/220px-Douglas_adams_portrait_cropped.jpg",
  "name": "Portrait of Douglas Adams",
  "license": "https://creativecommons.org/licenses/by-sa/2.0",
  "representativeOfPage": true,
  ...?
}

I don't know how to do that or if sameAs is appropriate. I speculate that:

  • sameAs seems vaguer than thumbnail.
  • Maybe the image property is more appropriate? However, an ImageObject that contains a thumbnail seems like a clearer relationship than the inverse.
  • I'm unsure whether direct image references (not thumbnails) are necessary. For Article types specifically, these guidelines suggests that referencing higher resolution images is preferable ("Images should be at least 696 pixels wide").

Change 604143 had a related patch set uploaded (by Anne Tomasevich; owner: Anne Tomasevich):
[mediawiki/extensions/WikibaseMediaInfo@master] [WIP] Add JSON-LD script with schema.org license info to filepages

https://gerrit.wikimedia.org/r/604143

Thanks very much to everyone for weighing in on this. We've gotten some clarification from Google on a few of these issues:

License link

The license attribute is a link to the license itself. I'm planning on using CommonsMetadata to extract the licenseUrl and we're considering looking at P275 statements as well (unclear whether we need to look at structured data if all images should have license info available in the template). We're still working out what to do about images in the public domain since there is no license; we may link to an existing page on Commons with info about the public domain or create a new page specifically for this purpose.

acquireLicensePage

This attribute has 2 purposes: to serve as a layperson's explanation of a license (as opposed to the potentially legalese-y license itself) and to tell users how they can license an image, e.g. checking out on a stock photo website. Obviously the latter doesn't apply to us but it seems including a link to the file page, which offers a summary of the license, would be appropriate and helpful.

thumbnail vs sameAs

The only thing Google needs for the licensable badge is the main ImageObject url, which should be the URL indexed by Google (which appears to be the original file). A reference to the thumbnail that actually appears on the page isn't needed, although we could include one as an organization-wide standard if we want to.

AnneT removed AnneT as the assignee of this task.Sep 9 2020, 10:36 PM
AnneT subscribed.

@AnneT , related to here, the former european director of creative commons, cory doctorow, got targetted by a copyleft troll, see his post: A Bug in Early Creative Commons Licenses Has Enabled a New Breed of Superpredator. he recommends now to use cc-by-sa-4.0 and higher. to reliably get rid of such trolls, two actions should be taken imo. first, require cc 4.0+ and second, embed the link to the license and/or source with the license into the images metadata. INTO the file. so if somebody downloads and uses it the correct metadata is there. @Tgr i saw your comment about this beeing scary in your comment T250317#6178173, on T5361 ... but this one single link to the source should suffice to extinct trolls. everything else can be fixed via license, or viewers.