As a product owner, I want to know the non-free image license density for Related Pages (aka Read More) results, so that I can determine whether I should apply a new treatment that makes the licensing clearer (requires lots of instrumentation and UX redesign), apply a new image algorithm without much additional instrumentation (easier), or apply a new image algorithm but instrument a lot and have a very carefully controlled A/B test (more work).
We don't actually know what to expect for a //representative// sample (as opposed to a random sample), so this is one of those cases of "we'll know it when we see it". But a //representative// sample is what's called for in this case.
##Acceptance criteria / steps
[ ] Update templates (CC @tgr) so that pertinent file pages on enwiki reflect rough known EDP status even more consistently. Re-parses need to have time to run.
[ ] Manually check with Commons metadata API endpoint that a handful of images have the correct licensing disposition now that templates are better in place.
[ ] Using the technique in T120504#1900287 identify 10,000 articles from the top 0.5% pages on enwiki mobile from the using the month of January, 2016 as the basis (bucketing in Hive will need to be adjusted as appropriate; please don't use the dr0ptp4kt Hive namespace; use your own).
[ ] For each article, identify the Related Pages result set. For each Related Pages result identify the images. For each image, confirm with the Commons metadata API endpoint the free / non-free licensing disposition (if no image then null out the image name and image disposition in the result set). If the image is not free, identify whether there is an alternative image in the page that is free (prop=images yields all files on page).
[ ] At the end we should have a table with the following columns: `title|recommendation_title|recommendation_image|is_image_free|alternative_free_image_name` - it should be 30,000 rows long (10,000 articles @ 3 Related Pages results per article). The alternative_free_image_name value should only have a value if the primary image was not free and a free alternative was found, otherwise it should be null.
[ ] If you're being clever, you may able to batch requests and use generators, avoid repeats where the result was already fetched earlier, and such, but there's no need to be clever; what counts is the script runs in a relatively brief period so that it's more of a snapshot in time.
[ ] As the `morelike` query backing Related Pages is known to be somewhat computationally expensive, don't knock down the application servers. As always, follow [[ https://www.mediawiki.org/wiki/API:Etiquette | API etiquette ]], including [[ https://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client | clear identification of a custom User-Agent ]].
[ ] The output data should be provided on this task as a TSV/CSV.
[ ] In this task identify the following: (1) percentage of articles for which at least one image in the recommendations produced by Related Articles is non-free licensed under the current implementation, (2) percentage of all recommendations (i.e., across all 30,000 rows) which have an image that is non-free licensed under the current implementation, (3) percentage of all recommendations that have at least one free image.