Page MenuHomePhabricator

Calculate non-free image license density for Related Pages on large enwiki mobile web representative sample of articles
Closed, ResolvedPublic3 Story Points

Description

As a product owner, I want to know the non-free image license density for Related Pages (aka Read More) results, so that I can determine whether I should apply a new treatment that makes the licensing clearer (requires lots of instrumentation and UX redesign), apply a new image algorithm without much additional instrumentation (easier), or apply a new image algorithm but instrument a lot and have a very carefully controlled A/B test (more work).

We don't actually know what to expect for a representative sample (as opposed to a random sample), so this is one of those cases of "we'll know it when we see it". But a representative sample is what's called for in this case.

Acceptance criteria / steps

  • Update templates (CC @Tgr) so that pertinent file pages on enwiki reflect rough known EDP status even more consistently. Re-parses need to have time to run.

[x ] Manually check with Commons metadata API endpoint that a handful of images have the correct licensing disposition now that templates are better in place.

  • Using the technique in T120504#1900287 identify 10,000 articles from the top 0.5% pages on enwiki mobile from the using the month of January, 2016 as the basis (bucketing in Hive will need to be adjusted as appropriate; please don't use the dr0ptp4kt Hive namespace; use your own).
  • For each article, identify the Related Pages result set. For each Related Pages result identify the images. For each image, confirm with the Commons metadata API endpoint the free / non-free licensing disposition (if no image then null out the image name and image disposition in the result set). If the image is not free, identify whether there is an alternative image in the page that is free (prop=images yields all files on page).

JR after talking to Adam we agreed this is unnecessary and simply running on the sample of 10,000 is enough.

  • At the end we should have a table with the following columns: title|recommendation_title|recommendation_image|is_image_free|alternative_free_image_name - it should be 30,000 rows long (10,000 articles @ 3 Related Pages results per article). The alternative_free_image_name value should only have a value if the primary image was not free and a free alternative was found, otherwise it should be null.
  • If you're being clever, you may able to batch requests and use generators, avoid repeats where the result was already fetched earlier, and such, but there's no need to be clever; what counts is the script runs in a relatively brief period so that it's more of a snapshot in time.
  • As the morelike query backing Related Pages is known to be somewhat computationally expensive, don't knock down the application servers. As always, follow API etiquette, including clear identification of a custom User-Agent. given the number of API requests needed, we avoid calling morelike entirely for this reason.
  • The output data should be provided on this task as a TSV/CSV.
  • In this task identify the following: (1) percentage of articles for which at least one image in the recommendations produced by Related Articles is non-free licensed under the current implementation, (2) percentage of all recommendations (i.e., across all 30,000 rows) which have an image that is non-free licensed under the current implementation, (3) percentage of all recommendations that have at least one free image.

Event Timeline

dr0ptp4kt created this task.Feb 5 2016, 2:07 PM
dr0ptp4kt raised the priority of this task from to Needs Triage.
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt added a subscriber: dr0ptp4kt.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 5 2016, 2:07 PM
dr0ptp4kt renamed this task from Calculate non-free image license density on large enwiki mobile web representative sample of articles to Calculate non-free image license density for Related Articles on large enwiki mobile web representative sample of articles.Feb 5 2016, 2:49 PM
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt set Security to None.
dr0ptp4kt added a subscriber: Tgr.

@Tgr, do you have a sense roughly how hard it is to update the templates and when you might be able to do that?

dr0ptp4kt triaged this task as High priority.Feb 5 2016, 3:01 PM
dr0ptp4kt added a project: PageImages.
dr0ptp4kt added a project: RelatedArticles.

I wrote a node script that polled the recent changes feed for Wikipedia edits to get a sense of where non-free images are being served as page images. The sample of titles taken represents what our editors care about editing which you'd expect would be somewhat related to what our readers care about reading.

I started with a sample size of 40 which is easily collected to check if the script was working.

I then upped the sample size to 1000 to get a more accurate idea. It took a bit longer to run (as this script's execution time depends on how quickly enwiki is being edited :-))

The results:
38% of titles have no page image.
62% have page image

  • 18% of these images are non-free images.

This gives an idea of the impact of restricting page images to only provide free ones per the wiki policy - about 1 in 5 of our page images are non-free.

Note of those pages that use non-free images I didn't check whether their articles have other images, so please don't read this as 18% of page images would suddenly disappear. The script could theoretically be extended to at least get a sense of what % of those articles with non-free images have a free image inside the article.

phuedx added a subscriber: phuedx.Feb 8 2016, 5:26 PM

This is blocked on actually having the representative sample.

bd808 edited a custom field.Feb 8 2016, 5:31 PM

To be clarified:

  • whether to look for fallback image
dr0ptp4kt updated the task description. (Show Details)Feb 8 2016, 8:56 PM

I added verbiage about looking for the fallback image. T120504#1900287 contains information on how to derive a representative sample of titles for an engineer with Hive access.

Tgr added a comment.Feb 8 2016, 11:24 PM

@Tgr, do you have a sense roughly how hard it is to update the templates and when you might be able to do that?

It's easy, maybe half an hour of template editing, but my hands are full at the moment. My initial impression was that there are other blockers plus this will be a pretty slow projects (several weeks) anyway. If you think I'm the bottleneck, please poke me and I'll find the time.

Tgr added a comment.Feb 8 2016, 11:45 PM

This is blocked on actually having the representative sample.

Which should be easy as it's (or it should be!) a common task.
T126290: Improve joining mechanism between webrequest data and edit data for i.e. sampling pageviews

@Tgr, we were indeed hoping to get that 30 minutes of your template editing time at the start of the upcoming sprint, which starts 15-February-2016, or shortly before in case time avails itself. Do you think that would be possible?

Tgr added a comment.Feb 9 2016, 1:51 AM

If the SessionManager deployment this week goes better than the last two attempts then definitely.

dr0ptp4kt renamed this task from Calculate non-free image license density for Related Articles on large enwiki mobile web representative sample of articles to Calculate non-free image license density for Related Pages on large enwiki mobile web representative sample of articles.Feb 15 2016, 6:28 PM
dr0ptp4kt updated the task description. (Show Details)
Jdlrobson changed the task status from Open to Stalled.Feb 17 2016, 6:40 PM

to make this possible we need two things

  • List of articles representing sample
  • A way to ask the API in one request for all images inside the article and whether they are non-free (blocked on @Tgr ?)

^ @dr0ptp4kt can you help unblock this?

@Tgr, @bmansurov, requests for you:

For the template updates, @Tgr, do you think you could find a little time in the next few business days to help with the following acceptance criteria?

  • Update templates (CC @Tgr) so that pertinent file pages on enwiki reflect rough known EDP status even more consistently. Re-parses need to have time to run.

For the list of articles, @bmansurov, do you have Hive access that would make it possible for you to work on the following acceptance criteria? (This is independent of @Tgr's work, but is needed for the scripting, of course.)

  • Using the technique in T120504#1900287 identify 10,000 articles from the top 0.5% pages on enwiki mobile from the using the month of January, 2016 as the basis (bucketing in Hive will need to be adjusted as appropriate; please don't use the dr0ptp4kt Hive namespace; use your own).

One note: in the part where T120504#1900287 divides by 30 you would divide by 10000 instead. I imagine for people coming back to this task later on it would be helpful if you provide the sequence of queries along with the result set (result set as an attachment).

bmansurov added a comment.EditedFeb 18 2016, 2:28 PM

@dr0ptp4kt, yes I have Hive access. I'll post my queries and results here.

Top 0.5% articles

FROM (
    SELECT page_title, SUM(view_count) ct
    FROM wmf.pageview_hourly
    WHERE year = 2016 and month = 1
        AND access_method = 'mobile web'
        AND agent_type = 'user'
        AND project = 'en.wikipedia'
        AND page_title <> '-' AND page_title <> 'Special:Search' AND page_title <> 'Special:MobileMenu'
    GROUP BY page_title
) t
SELECT PERCENTILE(CAST(t.ct AS BIGINT), 0.995);

Result: 8673.644999999553

Page views for pages with counts in the top 0.5% articles

FROM (
    SELECT page_title, SUM(view_count) ct
    FROM wmf.pageview_hourly
    WHERE year = 2016 and month = 1
        AND access_method = 'mobile web'
        AND agent_type = 'user'
        AND project = 'en.wikipedia'
        AND page_title <> '-' AND page_title <> 'Special:Search' AND page_title <> 'Special:MobileMenu'
    GROUP BY page_title
) t
SELECT SUM(t.ct) WHERE t.ct > 8673;

Result: 2151494243

Sum of all page views

SELECT SUM(view_count) ct
FROM wmf.pageview_hourly
WHERE year = 2016 and month = 1
    AND access_method = 'mobile web'
    AND agent_type = 'user'
    AND project = 'en.wikipedia'
    AND page_title <> '-' AND page_title <> 'Special:Search' AND page_title <> 'Special:MobileMenu';

Result: 3577475540, which means the page views in the top 0.5 percentile account for the 60% of all page views.

How many rows are there of those top 0.5%?

FROM (
    SELECT page_title, SUM(view_count) ct
    FROM wmf.pageview_hourly
    WHERE year = 2016 and month = 1
        AND access_method = 'mobile web'
        AND agent_type = 'user'
        AND project = 'en.wikipedia'
        AND page_title <> '-' AND page_title <> 'Special:Search' AND page_title <> 'Special:MobileMenu'
    GROUP BY page_title
) t
SELECT COUNT(1) WHERE t.ct > 8673;

Result: 64155

Bucket size: 6.4155 (64155 / 10,000)

Create a table with 6 buckets, so that we can sample from it

USE bmansurov;
DROP TABLE IF EXISTS toppages;
CREATE EXTERNAL TABLE toppages(`title` STRING, `ct` BIGINT)
CLUSTERED BY(title) INTO 6 BUCKETS
LOCATION '/user/bmansurov/data/';

INSERT OVERWRITE TABLE toppages
SELECT page_title, ct FROM (
    SELECT page_title, SUM(view_count) ct
    FROM wmf.pageview_hourly
    WHERE year = 2016 and month = 1
        AND access_method = 'mobile web'
        AND agent_type = 'user'
        AND project = 'en.wikipedia'
        AND page_title <> '-' AND page_title <> 'Special:Search' AND page_title <> 'Special:MobileMenu'
    GROUP BY page_title
) t
WHERE t.ct > 8673 ORDER BY ct DESC LIMIT 1000000;

Get the top 10,000

SELECT * FROM toppages LIMIT 10000;

Result:

Get the top 10,000 from the bucket 1

SELECT * FROM toppages TABLESAMPLE (BUCKET 1 OUT OF 6 ON RAND()) LIMIT 10000;

Result:

<--- pseudorandom distribution

Note: Feel free to change the bucket number in the last query to any other integer up to 6 inclusive to get a different list of sampled articles.

Just waiting on @Tgr then (to be fair he didn't commit to anything) so I leave this with you @dr0ptp4kt to work out if we can do this during this sprint.

Tgr added a comment.Feb 19 2016, 11:28 PM

Just waiting on @Tgr then (to be fair he didn't commit to anything) so I leave this with you @dr0ptp4kt to work out if we can do this during this sprint.

Done (test). The actual file pages are going to be updated from the job queue which might take a while depending on how many pages these templates are transcluded to.

Tgr updated the task description. (Show Details)Feb 19 2016, 11:28 PM
bmansurov updated the task description. (Show Details)Feb 22 2016, 6:35 PM
Jdlrobson changed the task status from Stalled to Open.Feb 22 2016, 10:32 PM
Jdlrobson claimed this task.
Jdlrobson moved this task from To Do to Doing on the Reading-Web-Sprint-66-Harry is Tired board.

I've got a script setup which can run on Baha's sample

Will take about 30-40 mins to run. Will have an answer tomorrow morning.

@Jdlrobson, forgive me if I missed it, but I don't see anything about related articles in your script.

@bmansurov I'm not sure why this would need to explicitly look at related pages since this impacts other products too (and would say given the performance issues with that feature would not be wise to look into) . The script I have written looks at page images for the titles you gave me and reports on page images for those titles.

Surely this is enough @dr0ptp4kt ? If not it smells like over-engineering for me.

For the first 1000 pages in that sample:

  • 8% of titles have no page images
  • 9% of titles in sample that have images use non-free images
  • 90% of images in sample that have non-free images have a free alternative

This is considerably lower than the results I got for taking a random sample of 1000 from the recent changes stream. Whether those alternative images represent the article is another question (for instance https://en.m.wikipedia.org/wiki/The_Wolf_of_Wall_Street_(2013_film)#/media/File%3ALeonardo_DiCaprio_2014.jpg would be the new page image of https://en.m.wikipedia.org/wiki/The_Wolf_of_Wall_Street_(2013_film)

(and yes to be explicit I've deviated from the task description which to be honest seems like overkill)

Note for every 20 titles to generate this report you need 1 API request to calculate current page image
1 API request to check non free image status of all those page images and 1 API request for every page image which is non free (so worse case 20 API requests)
So anywhere between 2 and 20

So for a sample of 1000 titles, we would be looking at (1000/20)*2=100 API requests best case, worse case 1000 (50*20)

Using more like API would bloat this by a magnitude of 3 plus 1 additional query to the more like API.

Tgr added a comment.Feb 23 2016, 2:20 AM

So that means something like 1% of pages would lose their images because of this?

Jdlrobson updated the task description. (Show Details)Feb 23 2016, 7:13 PM

So that means something like 1% of pages would lose their images because of this?

For the first 1000 of those 10,000 articles yes.
I chatted with Adam and he agreed that we shouldn't need to run on morelike as running my script on 10,000 pages with morelike is far too expensive though.

I've tweaked my script and rather than 20, it now can query 50 in one API request.

I ran it for the 10,000 articles and these are the results:
16% of titles have no page image.
4% of images in sample that have images are non-free images.
92% of images in sample that have non-free images have a free alternative.

CSV for your browsing pleasure:

My read of the data was as follows:

  • 16.95% (1695/10,000) without an image ("|N/A|")
  • 3.71% (371/10,000) with a fair use / non-free image ("|false|"). 32.6% (121/371) of those had File:Wikiquote-logo.svg as their fallback, which should be ruled out as a good fallback image. 7.5% (28/371) of them didn't have a free fallback ("|false|false"). 59.8% (222/371) of them did have a free fallback and those in general looked reasonable.

Stated differently, when there would otherwise normally be a hero image, around 1.79% of the time ((121+28)(10,000-1,695)) there would be a less valuable outcome with a simple fallback algorithm. This is sufficiently low and I think we can move forward.

Jdlrobson closed this task as Resolved.Feb 27 2016, 12:36 AM

My read of the data was as follows:

  • 16.95% (1695/10,000) without an image ("|N/A|")
  • 3.71% (371/10,000) with a fair use / non-free image ("|false|"). 32.6% (121/371) of those had File:Wikiquote-logo.svg as their fallback, which should be ruled out as a good fallback image.

Whoops. I could probably edit the script to see if I can rule those out but this seems like high effort
Per https://phabricator.wikimedia.org/T124225#2068423 I think we should close this task.