Page MenuHomePhabricator

PageImages should never return non-free images
Closed, ResolvedPublic

Description

Related: T95378: Add support to PageImages for finding a freely licensed page image
Related: T131105: Pageimages should return both free and non-free images with 'free-ness' denoted as a property (subtask of tentative Q2 FY 2016-2017 (October - December 2016) WMF Reading goal)

To quote @Finnusertop on mediawiki.org

"According to Wikimedia Foundation's resolution on licensing policy (https://wikimediafoundation.org/wiki/Resolution:Licensing_policy), all content on Wikimedia projects should be free, with the exception of those non-free files used under a fair-use doctrine (or similar). According to the resolution, such uses are conditioned by the projects' Exemption Doctrine Policies (EDP). The resolution goes on to say that exceptions granted by the EDPs should be minimal. The English Wikipeda's EDP is "Wikipedia:Non-free content" (https://en.wikipedia.org/wiki/Wikipedia:Non-free_content). It stipulates that a non-free file may only be used in those articles that a specific, relevant and valid non-free use rationale is written for."

Various project policies do not allow non-free image use https://en.wikipedia.org/wiki/Wikipedia:Non-free_content#Non-free_image_use_in_galleries

AS a result projects such as mobile apps, RelatedArticles, Gather, and mobile web search violate local policy.

https://www.mediawiki.org/wiki/Talk:Gather#Gather_violates_English_Wikipedia_policies_in_an_undetectable_manner

PageImages the extension responsible for this choice of image should have a configuration option to avoid picking images where the license on that wiki doesn't permit it.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Heads up @GWicke, @mobrovac, @Pchelolo - we may need to make an update to the PageImages call that the summary endpoint uses once we have the additional URL parameter available in the api.php endpoint or consider having a separate RESTbase endpoint that in turn adds the additional URL parameter. Do you want me to open a new task to reflect this?

Heads up @GWicke, @mobrovac, @Pchelolo - we may need to make an update to the PageImages call that the summary endpoint uses once we have the additional URL parameter available in the api.php endpoint or consider having a separate RESTbase endpoint that in turn adds the additional URL parameter. Do you want me to open a new task to reflect this?

Thanks for the heads-up, @dr0ptp4kt . If I'm reading the discussion here correctly, the new param would just indicate which image licences are to be allowed in the output, so I don't think this is worth a new end point (free-summary? :P), we can simply amend the existing one. We do need to decide if the content stored thus far for it should be dropped upon change or not.

@mobrovac, yeah, if all of us agree on the characteristics of the response type w.r.t. the "hero image" (different platforms may be open to different treatments to avail the licensing in-context). With a dash of luck, though, we can all just use the same summary endpoint, but we'll see how the data from T125977: Calculate non-free image license density for Related Pages on large enwiki mobile web representative sample of articles inform us.

Are there minutes from the meeting ?

Are there minutes from the meeting ?

I don't think so, unless someone who was online had an Etherpad? Anyone got a link for an Etherpad? I was writing on the whiteboard, and can't remember if anyone created an Etherpad. Anyway, here's a summarization that I think captures most of the salient points.

  • Let's try to solve this.
  • Yes, let's have the API have an extra parameter to invoke it to restrict the candidate(s) for hero image eligible in the response, for consumers that want to invoke it that way (e.g., we will likely want to just take this simpler path on mobile web until some point the policy might be updated, as opposed to trying to jam in an affordance, which would be sort of hard).
  • Trying to instrument so that we could have Event Logging tell the server the in-practice impact in the A/B test would be possible, but more difficult.
  • Therefore, before we go all out with instrumenting stuff or trying to implement alternative treatments across the different platforms, let's run a script (sort of like what @Tgr mentioned, but with a few more bells and whistles) to get a sense of the landscape in practice. Task created - T125977: Calculate non-free image license density for Related Pages on large enwiki mobile web representative sample of articles.
  • Page props will need to be updated such that we can ascertain free versus non-free hero image for navigational contexts when the hero image would be different for those two cases for a given page. But it should be okay to let page props be updated over time once the new API param available, rather than issuing an expensive full reparse / memcached purge (people can always purge individual pages if they see pages that could be updated).

To be estimated, hinging on Gather discussions and script checks.

@TheDJ I took notes. Will clean them up and share shortly.

OK, judging from what I read, it seems like people are nudging towards picking an "alternate" image ?

I would actually suggest we simply don't pick an alternate, but simply show the placeholder.. Design/UX wise, it might be confusing for users to have two different images for the same page. But more important, how many pages would actually have a 2nd option available ? And how many of those selected 2nd options, would be actually relevant to the topic and of sufficient quality ?
There is one big reason why there is a non-free content image for these pages, namely "there is no free alternative". So we might as well not spend the time looking for it within that page, nor potentially confuse the user with it, for just that small number of pages that would even have a secondary candidate.

But let's do some measuring first indeed, and then decide which strategies to test/implement.

Change 265415 abandoned by Jdlrobson:
Support a blacklist category

Reason:
(as I understand it Gergo is working on an alternative solution)

https://gerrit.wikimedia.org/r/265415

OK, judging from what I read, it seems like people are nudging towards picking an "alternate" image ?

I would actually suggest we simply don't pick an alternate, but simply show the placeholder.. Design/UX wise, it might be confusing for users to have two different images for the same page. But more important, how many pages would actually have a 2nd option available ? And how many of those selected 2nd options, would be actually relevant to the topic and of sufficient quality ?
There is one big reason why there is a non-free content image for these pages, namely "there is no free alternative". So we might as well not spend the time looking for it within that page, nor potentially confuse the user with it, for just that small number of pages that would even have a secondary candidate.

To be honest the more I think about it the more I settle on this option. We could simply pass a boolean to whether the image is free or not and then the client can decide what to do with it (show it or not).

But let's do some measuring first indeed, and then decide which strategies to test/implement.

@TheDJ I took notes. Will clean them up and share shortly.

Page Images is used in:
APPS (Lead images,Search, Nearby, Share a Fact)
They are valuable here - data showing impact of ux changes (unsure whether explicitly the image or the layout)
Portal: A/b test of search results
Web: search, Gather, Related Pages, Nearby

Considerations

  • Doing nothing is not an option.
  • API used outside Foundation so we shouldn't impact 3rd party users over English Wikipedia policies
  • We don't believe this is a legal issue but law varies depending on where it is used.
  • Would like to attribute images better if that was an option. Would rather not show images disappearing.
  • We need to understand what % of images are non-free to get a sense of the impact. If small impact don't care if lots of pages impacted do care.
  • We should measure before any change
    • Can use categories or commons meta data
    • Post change can measure tap through rates on certain features e.g. RelatedArticle

Next steps

  1. Measure (T125977)
  2. Implement one of following solutions:
  3. Use meta data of images OR categories to determine free status
  4. Double page image storage and store a non-free and free image with each article OR provide one image as defined by project's policy OR provide a flag detailing status

API used outside Foundation so we shouldn't impact 3rd party users over English Wikipedia policies

Note that this is an argument in favour of the removal of unfree files, as third parties may not legally use unfree files.

Note that this is an argument in favour of the removal of unfree files, as third parties may not legally use unfree files.

@Nemo_bis, would you please clarify? Is this with respect to site policy or something else? I believe third parties are not barred from knowing the pageimage data, although I couldn't speak to the presentation of the images themselves.

Does the question make sense?

This comment was removed by Ruud_Koot.

Clearly, merely knowing the pageimage data is fine. A more realistic scenario would be that the 3rd party re-user does not realize that the returned page image may be non-free (Wikipedia is marketed as being free...) and then displays that image in such a way that it no longer passes the (legal) fair use test.

This applies to internal WMF use as well:

  • Assume the page image data returns a non-free image from the lead of the article and is displayed together with a (snippet of) the lead. This will likely be fine legally. It just runs into site policy issues.
  • Assume the page image data returns a non-free image from further down the body of the article and is displayed together with a (snippet of) the lead. The connection between the image and the snippet may no longer exist. So this may no longer pass the (legal) fair use test.

Perfect 10 v. Amazon demonstrates that even displaying thumbnails can be legally risky. The District Court found that Google's use of thumbnails did not pass the fair use test. The Ninth Circuit only said that the very specific manner in which Google used and displayed thumbnails of Perfect 10's images was sufficiently transformative to count as fair use. It did not give a carte blanche for displaying thumbnails of any copyrighted image in any circumstance.

Is this with respect to site policy or something else?

With respect to copyright. Our wikis' policies are meant to defend reusers. English Wikipedia's policies a bit less, as they allow "fair use" files which can be very hard for others to reuse. The site policies which are best for our reusers are clearly Wikimedia Commons'.

Anyway, a rather common example is a file authorised by the copyright holder for use only on Wikipedia. That's not even "fair use", it's a classic copyright license, but doesn't give any right whatsoever to anyone else.

Ricordisamoa renamed this task from PageImages should never show non-free images to PageImages should never return non-free images.Feb 15 2016, 4:20 PM

Renamed to clarify that PageImages does nothing except from exposing an API.

Thanks @Ruud_Koot, @Nemo_bis, @Ricordisamoa for the clarifications. Much appreciated. I imagine we can factor this into the API design in addition to the empirical data (e.g., with the output boolean output flag or some other simple thing).

Per T125977 3.2% of the top 10K Wikipedia pages would return a different page image after a "don't return non-free" policy; and 0.3% would stop returning any image. That seems tiny, so can we agree to just keep things simple and filter out nonfree images from the PageImages results?

I provided an update on the read of the numbers, which was against an evenly distributed psuedorandom sample of the top 0.5% most popular pages on mobile web enwiki (the top 0.5% most popular pages constitute about 60% of pageviews on mobile web enwiki). See T125977#2062487.

@Jdlrobson, what is the rough level of effort of applying the algorithm with the fallback heuristic for the hero image use case? I think in terms of scheduling the work to do this, it seems like the best sprint to start would be sprint 68, which begins 14-March-2016.

For those of you following this task, I'm going to be out for a few business days, but hope to return the conversation upon my own return.

@dr0ptp4kt I would say high effort and imperfect, as we would have to replicate the existing page images algorithm in node.
At this point given the low numbers I would suggest that we can:

  1. easily store a second free pageimages as a page property when the image is non-free
  2. simply filter out the nonfree images as @Tgr says.

Someone just needs to make a decision on which path to pursue.

I can decide. :) 2 is the way.

Okay, let's go with Option 2. I'll put this in sprint 68.

Change 266196 merged by jenkins-bot:
Weigh images by copyright status

https://gerrit.wikimedia.org/r/266196

Thanks @MaxSem and @Tgr
This will roll out on the train and will be live on the 10th. After which we should monitor the situation on apps/web engagement.

@dr0ptp4kt please send a mail to the reading web team if you need us to organise a SWAT deploy or do anything else with respect to this.

@Jdlrobson, no need to SWAT.

+ @Tnegrin, @JKatzWMF

Riffing on T125977#2062487

  • 16.95% (1695/10,000) of the psuedorandom 60th percentile pages already don't have an image.
  • 3.71% (371/10,000) contain a fair use / non-free image, and I consider this acceptably low.

Riffing on T125977#2062487

  • 16.95% (1695/10,000) of the psuedorandom 60th percentile pages already don't have an image.
  • 3.71% (371/10,000) contain a fair use / non-free image, and I consider this acceptably low.

It's not ideal, but in the interests of keeping things simple, the proposed (or, well, implemented) solution seems acceptable to me.

For clarity, noting here that this change is now live. However, non-free page images may remain in the cache for a little while. They will be refreshed with free page images when pages are reparsed, which will happen naturally over time or sooner in some circumstances (e.g. when individual pages are edited).

I should note that I keep seeing complaints crop up about this change. Users generally complain that the images that are chosen are "not right" or "have changed and are no longer as good". Here's some examples:

Those don't seem to be complaints about this change at all. The one at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Certain_articles_on_the_iOS_Wikipedia_app_show_wrong_images is regarding https://en.wikipedia.org/wiki/House_(TV_series), for which the infobox image is a public-domain logo. That is free content, so that could be shown. The other one regards showing sexually explicit images as the default, which may well be free content and probably are. I don't see how either one relates to this bug.

Indeed House is probably discounted, because of Aspect Ratio preferences by PageImages, and porn etc is not something we can immediately do something about, though per principles of least astonishment, is probably something that we should start thinking about. I mean, google even disables their autosearch, when they the suggestions contain sensitive results. But it's a difficult problem to solve, especially after jimmy once went on a deletion spree... :D

I do note that one comment in the mediawiki page discussion notes: "The hovercard popup and mobile search image for w:10 Cloverfield Lane shows a picture of John Goodman, who isn't even in the lead role, instead of the identifying artwork. ", which probably is related to this change.

The one at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Certain_articles_on_the_iOS_Wikipedia_app_show_wrong_images is regarding https://en.wikipedia.org/wiki/House_(TV_series), for which the infobox image is a public-domain logo. That is free content, so that could be shown.

True. Sorry about that.

The other one regards showing sexually explicit images as the default, which may well be free content and probably are. I don't see how either one relates to this bug.

For the original part of the discussion, yes, it's unrelated. But if you read a bit deeper, you'll see a user (@hahnchen) complaining specifically about this change. That's definitely related.

I do note that one comment in the mediawiki page discussion notes: "The hovercard popup and mobile search image for w:10 Cloverfield Lane shows a picture of John Goodman, who isn't even in the lead role, instead of the identifying artwork. ", which probably is related to this change.

Indeed, yes. I also read somewhere (I can't remember where now, sorry) that there was an article about a person where the first image was fair use, so instead now PageImages was choosing a photo of the man's ex-wife. -_-

I've also seen a report where an image of the wrong person was displayed for a biography (there was no image of the person in question, but an image of another person was extracted from one of the navboxes.) Can't find the link right now.

At the talk page of WP:NFC someone gave the example of where, instead of a film poster, an image of one of the filming locations was displayed. This was experienced as confusing.

The common problem in all these examples seems to be that images outside of Section 0 often have no immediately obvious connection to the topic without an accompanying image caption. Not picking non-free images makes this more apparent.

The common problem in all these examples seems to be that images outside of Section 0 often have no immediately obvious connection to the topic without an accompanying image caption. Not picking non-free images makes this more apparent.

Related to this: T87336: PageImages shouldn't return images that are not in the lead section

The common problem in all these examples seems to be that images outside of Section 0 often have no immediately obvious connection to the topic without an accompanying image caption. Not picking non-free images makes this more apparent.

That's T87336.

Another related thing I've been noticing: PageImages seems to strongly prefer images that are square-ish or portrait-sized. This makes a lot of sense for thumbnails in search, but not so much in Hovercards. Often suitable wide images, like logos, are skipped for this reason (even if they're PD-ineligible). But Hovercards can handle wide images just fine.

At the RfC at WP:NFC there seems to be forming a consensus that non-free images are fine for Hovercards if they're "visually identifying" for the topic (in most cases, if they're from Section 0 or the infobox). But if different extensions also prefer differently sized images, than if may make more sense have PageImages pick an image per downstream user (taking size-restrictions, place in the article, and licensing-status into account) instead of just free/non-free.

I just found out that a significant part of nonfree images are not marked as such. Earlier, I went through all copyright tags containing license metadata and added the nonfree flag where needed, but apparently lots of copyright tags did not have metadata at all and I missed them. I apologize, that was sloppy :(

I'll make sure everything in Category:Wikipedia non-free file copyright tags is marked as nonfree. I made a quick query to see how much this affects the impact evaluation; at the time of the evaluation, 367602 pages used copyright tags properly marked as nonfree, while 184953 pages used a nonfree copyright tag that was not flagged as such (more details in P2808). So there are about 50% more nonfree images than expected - not sure how well that translates to pageviews.

Only images that are explicitly tagged as free should be selected, not all the ones that aren't tagged as non-free.

I should note that I keep seeing complaints crop up about this change.

Yes, and I complained about this change before it was implemented only to be ignored. https://phabricator.wikimedia.org/T124225#1984824

This change has made hovercards worse, it has made search worse. Doing nothing would have been a better solution. Was that solution even considered? It should have been argued that PageImages falls under the non-free policy exemptions granted to special pages (such as search), detailed on the English Wikipedia at https://en.wikipedia.org/wiki/Wikipedia:Non-free_content#Exemptions Other Wikimedia projects must have similar exemptions.

How long was PageImages returning non-free images? A year? 18 months? Longer? This was established behaviour.

Only images that are explicitly tagged as free should be selected, not all the ones that aren't tagged as non-free.

Yes, that would be better. Compare T75130.

only to be ignored

You were not ignored, you were pointed to the correct process to achieve your goal. If you need more guidance on how to propose a policy change, let me know on https://meta.wikimedia.org/wiki/User_talk:Nemo_bis

This was established behaviour.

The fact that an unauthorized activity was carried on for a long time doesn't make it more acceptable.

@hahnchen I agree with you that the user experience is noticeably diminished for some of our most popular pages: books, albums, movies, television.

It is also true that the change to the API forces does not allow for discrimination by the feature that is calling the image. Apparently, image search in visual editor is equally impacted and this is not a navigational element, the same goes for the app 'lead image'.

I see 3 paths forward to alleviate the issue:

  1. try and revisit policy or ask for a re-intepretation of existing policy (complete 'fix')
  2. change API to flag image and let project/feature decide usage. This actually alleviates concerns around re-use. (partial fix)
  3. prevent non-lead image sections from appearing in results or let users over-ride image. This is captured here: T87336 and here: T91683. (least complete fix--users lose out on valuable visual queues provided by the non-free images and see inconsistency between preview image and page)

Here is an example showing how big the user experience impact can be:

Screenshot 2016-03-25 14.35.46.png (804×1 px, 681 KB)

IMG_7773.PNG (1×750 px, 689 KB)

The fact that an unauthorized activity was carried on for a long time doesn't make it more acceptable.

This rhetoric is unhelpful. Please comment constructively.

I see 3 paths forward to alleviate the issue:

  1. try and revisit policy or ask for a re-intepretation of existing policy (complete 'fix')
  2. change API to flag image and let project/feature decide usage. This actually alleviates concerns around re-use. (partial fix)
  3. prevent non-lead image sections from appearing in results or let users over-ride image. This is captured here: T87336 and here: T91683. (least complete fix--users lose out on valuable visual queues provided by the non-free images and see inconsistency between preview image and page)

In my view, attempting to enforce policies by coding logic into the API is poor software design. Policies are living documents which describe current best practices and standards, and which are ever-evolving; trying to keep code up-to-date with them is a time-consuming, uphill battle, and API clients may have different content standards. I prefer option 2 for this very reason: as different clients may have different standards, and API should provide them the ability to make the choice between free or not based on their needs and restrictions.

I also support the idea of changing the policy to be far more permissive. Getting the policy changed is no small undertaking, but demonstrating the impact is doable as @JKatzWMF did with the above screenshot. However, I view this as a matter that is orthogonal to the design of the PageImages API for the reasons stated above.

You were not ignored, you were pointed to the correct process to achieve your goal. If you need more guidance on how to propose a policy change, let me know on https://meta.wikimedia.org/wiki/User_talk:Nemo_bis

The fact that an unauthorized activity was carried on for a long time doesn't make it more acceptable.

There is no need to change policy. The activity was not unauthorised. Every project must have exemptions in place for non-free content appearing in special pages. For example, the English Wikipedia has https://en.wikipedia.org/wiki/Wikipedia:Non-free_content#Exemptions If a project has no non-free content, then PageImages would not return non-free images anyway, as it depends on article content.

The argument that search falls under the exemption should have been raised before resources (read: donated money) was committed to making the user experience worse. I question decision making that went into developing and deploying this change.

T124225 would serve Wikimedia better if it were just reverted. Now.

I appreciate the care to licensing policies, but this change broke my project and probably many others. Why don't you leave third parties to decide for themselves in the licensing policy of each image is compatible with their use case? This would be much better than simply removing non-free images.

Many images are now excluded from the PageImages response because they are fair-use. Movie posters, for instance, are all fair-use, and the way they're used on third party websites qualifies most of the times as fair-use too, just like on Wikipedia. Now, none of those images are returned.

Please consider reverting this, and instead include the licensing policy in the response. Thank you.

instead include the licensing policy

Fair use is not a license and as such it's impossible to codify standard responses. It would only be possible to return the file description, which you would then have to parse yourself. Better then parse the HTML yourself altogether.

Thank you for your quick answer, @Nemo_bis. Parsing the html is of course always a last resort, but the API is so much better. I understand that fair use is not a license, but then maybe there could be a simple indicator of whether the image is free or fair use. Or maybe some extra parameter to the API that would include the fair use images in the response too. Any solution that would bring back the possibility to retrieve via the API the page image(s) even if they are fair use would be great!

@Voidronl:
Official WMF Resolution:Licensing policy opens with the following notice:

This policy is approved by the Wikimedia Foundation Board of Trustees.
It may not be circumvented, eroded, or ignored by Wikimedia Foundation officers or staff nor local policies of any Wikimedia project.

Please do not ignore, erode, or attempt to circumvent the limitations on use of non-free content. The fact that some wikis permit non-free content at all is itself conditional upon those limitations.