Page MenuHomePhabricator

Gather data to estimate Zim file downloads
Closed, ResolvedPublic

Description

In order to provide estimates of the number of downloads for Zim files, we should assemble the following data:

For each proposed language, gather the follow metrics for the last month:

  1. What is the % of total page views for this language within the Android app? (Captured in page view logging)
  2. What is the number of devices (and % of total devices) set to this language when the android app launches (Captured in event logging)

What will be done with these numbers?
With these numbers we can estimate the maximum total number of users for each language. This should give us an upper bound for the number of downloads that are possible for any given language. From there we can estimate actual use of the feature and adjust the numbers down accordingly.

For example (very contrived):
We find that the 4% of devices are set to "es" on launch.
In Beta we have 100,000 users.
So we can estimate the maximum number of beta users to download ZIm files to be 4000.
If we decide to have 4 Zim files available for Spanish, then we can estimate the maximum number of Zim downloads to be 16000 in Spanish

@JMinor @Tbayer can you verify that the above is possible and useful for the described purpose?

Event Timeline

What is the % of total page views for this language within the Android app? (Captured in page view logging)

@Fjalapeno take a look at pivot: http://pivot.wikimedia.org and you can see this data: language is not captured per-se but project is the proxy for that, example (kind of fast) pageviews per projects for the newest app versions (android and iOS) http://bit.ly/2uyr3sP

Screen Shot 2017-07-26 at 11.53.36 AM.png (1×2 px, 348 KB)

I understand from @JMinor that we would like to have separate numbers for beta and production in each case, correct? And what are the languages of interest?

Regarding #1: As Nuria pointed out, this can be glanced fairly conveniently from Pivot (remove the time split to get the sum over the entire timespan), at least if one is comfortable restricting it to a few of the most recent versions in each case (beta/production), which should be fine for such an estimate. But let me know if that doesn't work out, I could run a query later.

Regarding #2: We would need to add new instrumentation or extend existing EL schemas (e.g. MobileWikiAppDailyStats) to record the app's set language. I would propose to instead use the number of unique installs (beta vs production) that access a given language version per day or month, which is fairly easy to get via a modification of the queries underlying our existing app uniques data. I should be able to do that by Friday if we decide to go with that option.

I understand from @JMinor that we would like to have separate numbers for beta and production in each case, correct?

@Tbayer We only need production numbers. We will use them to extrapolate beta usage.

And what are the languages of interest?

For languages of interest, that is up to @JMinor - I assume it will be dictated by the list on the spreadsheet in this ticket: T169905: Determine the list of content packs to make available in V1

@JMinor can you confirm that you want him to measure all the proposed languages in that sheet?

Regarding #1: As Nuria pointed out, this can be glanced fairly conveniently from Pivot (remove the time split to get the sum over the entire timespan), at least if one is comfortable restricting it to a few of the most recent versions in each case (beta/production), which should be fine for such an estimate. But let me know if that doesn't work out, I could run a query later.

Thanks (and thanks @Nuria) checking this out!

Regarding #2: We would need to add new instrumentation or extend existing EL schemas (e.g. MobileWikiAppDailyStats) to record the app's set language. I would propose to instead use the number of unique installs (beta vs production) that access a given language version per day or month, which is fairly easy to get via a modification of the queries underlying our existing app uniques data. I should be able to do that by Friday if we decide to go with that option.

I think we want to go with this to get better estimates… your proposal seems like a good option to work with the data we have. I think this is how @JMinor wants to proceed as well

For #1 I think its sufficient for a first order magnatude, and I think all languages we care about right now are there. I didn't manage my time well this week and won't be able to do this today, and am on vacation tomorrow, though. Tilman or Corey can y'all pull these numbers from pivot?

BTW. the beta request came from me. I thought if it was easy it would be better to provide both numbers for potential "slow roll out" planning. But, yeah, its maybe too much extra complication and can be estimated from prod data.

For #2 I think this is a better way to go and will yeild more accurate estimate of potential user base. Please run those numbers as you've proposed.

Regarding #2, here is the number of app installs accessing each Wikipedia language listed in the spreadsheet (daily average for July 21-27). The overall daily average is about 1 million currently.

languageavg_daily_uniques
en499538
de178302
it58570
fr55939
es53676
ru50888
pt32344
zh16083
ar9910
fa8557
hi3792
id3631
tr1777
bn979
ta615
mr542
tl264
my201
or35
yo2
ig1

Data from

-- adapted from https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mobile_apps/uniques/daily/generate_uniques_daily.hql :

WITH mobile_apps_uuids_lastweek AS
(
    SELECT
        year,
        month,
        day,
        normalized_host.project AS language,
        COALESCE(x_analytics_map['wmfuuid'],
                 parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) AS uuid
    FROM wmf.webrequest
    WHERE user_agent LIKE('WikipediaApp%')
        AND NOT (user_agent LIKE '%iOS%' OR user_agent LIKE '%iPhone%') -- i.e. Android
        AND ((parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'action') = 'mobileview' AND uri_path == '/w/api.php')
            OR (uri_path LIKE '/api/rest_v1%' AND uri_query == ''))
        AND COALESCE(x_analytics_map['wmfuuid'],
                     parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
        AND webrequest_source IN ('text')
        AND year=2017
        AND month=7
        AND day >= 21 AND day <= 27
)

SELECT 
    language,
    INT(AVG(unique_count)) AS avg_daily_uniques
FROM (
SELECT
    year,
    month,
    day,
    language,
    COUNT(DISTINCT uuid) AS unique_count
FROM mobile_apps_uuids_lastweek
WHERE language IN ('pt', 'en', 'es', 'ar', 'ig', 'yo', 'hi', 'my', 'mr', 'ta', 'bn', 'id', 'tr', 'ru', 'tl', 'de', 'it', 'fr', 'zh' ,'or', 'fa')
GROUP BY
    year,
    month,
    day,
    language ) AS dailylist
GROUP BY language
ORDER BY avg_daily_uniques DESC
-- Limit enforced by hive strict mapreduce setting.
-- 1000000000 == NO LIMIT !
LIMIT 1000000000
;

PS: I recalled that the task asked for monthly numbers, which also, on reflection, makes more sense than daily for the present purposes. So here is the equivalent for June 2017. The global number (all languages) was 5.7 million that month.

languageunique_count
en3040133
de966893
es404970
fr380379
ru331181
it306486
pt233415
zh198074
ar96321
fa63788
hi43347
id32670
tr21861
bn10817
ta7444
tl6276
mr5219
my2469
or497
yo44
ig13

Data from

-- adapted from https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mobile_apps/uniques/monthly/generate_uniques_monthly.hql :
WITH mobile_apps_uuids_June AS
(
    SELECT
        normalized_host.project AS language,
        COALESCE(x_analytics_map['wmfuuid'],
                 parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) AS uuid
    FROM wmf.webrequest
    WHERE user_agent LIKE('WikipediaApp%')
        AND NOT (user_agent LIKE '%iOS%' OR user_agent LIKE '%iPhone%') -- i.e. Android
        AND ((parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'action') = 'mobileview' AND uri_path == '/w/api.php')
            OR (uri_path LIKE '/api/rest_v1%' AND uri_query == ''))
        AND COALESCE(x_analytics_map['wmfuuid'],
                     parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
        AND webrequest_source IN ('text')
        AND year=2017
        AND month=6 )
SELECT
    language,
    COUNT(DISTINCT uuid) AS unique_count
FROM mobile_apps_uuids_June
WHERE language IN ('pt', 'en', 'es', 'ar', 'ig', 'yo', 'hi', 'my', 'mr', 'ta', 'bn', 'id', 'tr', 'ru', 'tl', 'de', 'it', 'fr', 'zh' ,'or', 'fa')
GROUP BY language
ORDER BY unique_count DESC
-- Limit enforced by hive strict mapreduce setting.
-- 1000000000 == NO LIMIT !
LIMIT 1000000000
;
...
21 rows selected (4410.989 seconds)

@Tbayer thanks for the data!

@JMinor and I will be using this to get at some projections this week

Cool - and regarding #1, just in case it didn't work out with Pivot, here are the pageview numbers from the app for these languages in June 2017:

projectviews
en.wikipedia70743751
de.wikipedia25442444
it.wikipedia8683845
ru.wikipedia8327161
pt.wikipedia7527393
fr.wikipedia6619835
es.wikipedia6557854
zh.wikipedia3093771
ar.wikipedia1812594
fa.wikipedia1509618
id.wikipedia408436
hi.wikipedia300619
tr.wikipedia272477
bn.wikipedia103332
ta.wikipedia57539
mr.wikipedia40158
tl.wikipedia28660
my.wikipedia22271
or.wikipedia4200
yo.wikipedia76
ig.wikipedia20

Data from

SELECT project, SUM(view_count) AS views
FROM wmf.pageview_hourly
WHERE agent_type = 'user' AND year = 2017 AND month = 6
AND user_agent_map['os_family'] = 'Android' AND access_method = 'mobile app'
AND REGEXP_REPLACE(project, '.wikipedia', '') IN ('pt', 'en', 'es', 'ar', 'ig', 'yo', 'hi', 'my', 'mr', 'ta', 'bn', 'id', 'tr', 'ru', 'tl', 'de', 'it', 'fr', 'zh' ,'or', 'fa')
GROUP BY project
ORDER BY views DESC LIMIT 10000;

...

BTW. the beta request came from me. I thought if it was easy it would be better to provide both numbers for potential "slow roll out" planning. But, yeah, its maybe too much extra complication and can be estimated from prod data.

Seems we went a bit back on forth on this.... I have now broken out the monthly uniques from above into beta vs. production: T172137#3487852 (it wasn't a lot of work, just took a little while to run).

Tbayer claimed this task.

PS: It looks like we are done here, but absolutely feel free to reopen the task in case more detail is needed.