Page MenuHomePhabricator

Implement custom settings for image licenses used for PDF generation (currently skips images marked as "fair use")
Open, MediumPublic

Description

when creating PDF files many files are missing see https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_%28technical%29&oldid=563383020#Infobox_images_missing_from_PDFs


Version: unspecified
Severity: enhancement

Details

Reference
bz50948

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:58 AM
bzimport added a project: Collection.
bzimport set Reference to bz50948.
bzimport added a subscriber: Unknown Object (MLST).

From that VPT section:
I think you'll find it has left out the "fair use" images. That's probably a deliberate design choice, though I can't immediately find it documented anywhere. -- John of Reading (talk) 15:50, 7 July 2013 (UTC)

Betacommand: is that the case? Do you only see this happening when the images are used on WP under a fair use claim? My guess is the vast majority (99%) of images in the category referenced (marvel comics related) are used under a fair use claim. Can you reproduce on another article that doesn't have any fair use images?

Betacommand tells me on IRC that yes, this only occurs when the images are marked as Fair Use. Retitling bug as such. This may be a WONTFIX issue. I'll let the extension authors/those interested weigh in.

volker.haas wrote:

Yes, fair use images are not included in the PDFs on purpose.

there needs to be a way to override the removal of those files

volker.haas wrote:

It is not possible to include fair use images in PDFs due to potential copyright issues. If you want to generate a PDF for personal use, you could install the PDF rendering software (mwlib and mwlib.rl) on your local machine and disable filtering of images.

How is it any more of a copyright issue than serving the existing article? If a user decides they want to include the files there should be an option to override the filtering. It wouldnt affect the default process but would enable better offline access.

I'm going to pre-emptively nip this copyright conversation in the bud, at least the aspect of having it in the bug tracker.

!! Please do not debate the relative merits of any interpretation of copyright law in the bug tracker. !!

The local wiki is the correct place to have this discussion as that is where the various lists of excluded categories/templates live (per wiki).

This "Feature" isn't controlled at the local level. This was a <s>feature</s> Bug that was created by developers, implemented by developers, without ever asking local users.

What would be great would be a parameter that could be passed to this process that overrode the image filters.

Volker: can you comment on the history of this decision (to add that functionality)? Maybe pointing to any public discussion?

(In reply to comment #9)

Volker: can you comment on the history of this decision (to add that
functionality)? Maybe pointing to any public discussion?

Greg: have you tried looking up the relevant code? It would be helpful if someone could paste the code here, perhaps with the accompanying SVN revision or Git commit. :-)

volker.haas wrote:

@Betacommand:
There is no bug, the software works as intended. The software is just configured in a way that does not suit your current need. At the moment it is not possible to pass any user-configuration for specific PDFs/collections to the rendering software. Therefore it is not possible to include fair-use images for specific collections.

@greg:
I can't point you to any public discussion regarding that issue. There has been lot's of talk regarding image copyrights related to the Collection Extension, but I don't remember anything specific about fair use images.

@MZMcBride:
The code is not really the problem, it's the configuration which explicitly removes fair use images.

All this happens in mwlib's licensechecker https://github.com/pediapress/mwlib/blob/master/mwlib/writer/licensechecker.py
The licensechecker supports three modes:

  • nofilter (include all images, adding license info where available)
  • blacklist (exclude all images marked as nonfree)
  • whitelist (include only images that are marked as free, thus removing images with an unknown license)

The license information is imported from a csv file ( https://github.com/pediapress/mwlib/blob/master/mwlib/writer/wplicenses.csv ). This file contains the info about fair use images:

"Fairuse",,"nonfree",,"- Copyrighted content that may be used as ''fair-use'', but since the commons does not accept ''fair use'' content this image will need to be deleted. [[Commons:Licensing#Material under the fair use clause is not allowed on the Commons|See here for details why]]."

The PDF writer is currently configured to use blacklisting for all wikipedia projects except for the german wikipedia. In the german wikipedia we are using whitelisting after a community uproar about including images with an unknown license in the PDFs.
( https://github.com/pediapress/mwlib.rl/blob/master/mwlib/rl/rlwriter.py#L190 )

So, that's pretty much all I can say about that topic.

It should be trivial to create a method for setting nofilter mode upon command and per PDF generation. Thus allowing all images if a user wants them.

volker.haas wrote:

(In reply to comment #12)

It should be trivial to create a method for setting nofilter mode upon
command
and per PDF generation. Thus allowing all images if a user wants them.

Unfortunately it is not trivial:

  • The UI of the Collection Extension would need to be altered to allow custom settings
  • The rendering software would need an interface to interpret these settings and apply them to the PDFs
  • The render servers caching mechanism would need to be updated to avoid delivering PDFs with the wrong settings
  • All this would need to be tested thoroughly

And there are probably more things that I forgot.

But of course patches are always welcome!

In general I'd highly recommend to never start sentences with "It should be trivial" in bug reports if you don't know the codebase by heart. ;)