Page MenuHomePhabricator

Estimate both beta and production downloads of Zim files
Closed, ResolvedPublic

Description

Use the data from the parent ticket to provide an upper bound (worse case scenario) and an estimate based on expected usage.

Numbers to come up with:

  • Total number of Zims
  • Total size of Zims
  • Distribution of Zim sizes (may need some work for this one)
  • Expected downloads per day (or hour / minute/ second - this is used to figure out number of connections)

Event Timeline

To clarify just in case it's useful: We can actually break out the stats from the parent task by beta vs. production. Here for the monthly Android app uniques in each of the specified languages ( T171764#3483505 , June 2017 ):

languageproductionbetaall_uniques
en29171071230273040134
de94755319340966893
es38965015320404970
fr36953510844380379
ru31925011931331181
it2970069480306486
pt19594737468233415
zh1940943980198074
ar91613470896321
fa60776301263788
hi39829351843347
id31565110532670
tr2111274921861
bn986295510817
ta69824627444
tl61201566276
mr48174025219
my23291402469
or44750497
yo39544
ig11213

Query (assumes for simplicity that everything that is not beta is production, ignoring e.g. alpha releases; we can pull more fine-grained stats if needed):

-- adapted from https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mobile_apps/uniques/monthly/generate_uniques_monthly.hql 
WITH mobile_apps_uuids_June AS
(
    SELECT
        normalized_host.project AS language,
        user_agent LIKE '%beta%' AS is_beta,
        COALESCE(x_analytics_map['wmfuuid'],
                 parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) AS uuid
    FROM wmf.webrequest
    WHERE user_agent LIKE('WikipediaApp%')
        AND NOT (user_agent LIKE '%iOS%' OR user_agent LIKE '%iPhone%') -- i.e. Android
        AND ((parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'action') = 'mobileview' AND uri_path == '/w/api.php')
            OR (uri_path LIKE '/api/rest_v1%' AND uri_query == ''))
        AND COALESCE(x_analytics_map['wmfuuid'],
                     parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
        AND webrequest_source IN ('text')
        AND year=2017
        AND month=6)
SELECT language,
SUM(IF(NOT is_beta, unique_count, 0)) AS production,
SUM(IF(is_beta, unique_count, 0)) AS beta,
SUM(unique_count) AS all_uniques
FROM (
    SELECT
        language,
        is_beta,
        COUNT(DISTINCT uuid) AS unique_count
    FROM mobile_apps_uuids_June
    WHERE language IN ('pt', 'en', 'es', 'ar', 'ig', 'yo', 'hi', 'my', 'mr', 'ta', 'bn', 'id', 'tr', 'ru', 'tl', 'de', 'it', 'fr', 'zh' ,'or', 'fa')
    GROUP BY language, is_beta ) AS bylangbeta
GROUP BY language
ORDER BY all_uniques DESC
-- Limit enforced by hive strict mapreduce setting.
-- 1000000000 == NO LIMIT !
LIMIT 1000000000
;
..
21 rows selected (4925.481 seconds)

Okay, I can get you to projected downloads per day in production. But, given the shakiness of some of these assumptions, I'm giving a midpoint in the middle of a range as big as an order of magnitude in either direction. I think we should proceed with the idea that these numbers are very preliminary until we have some beta data.

Feel free to poke at the model and try different download rates, etc if you want to see impact on total downloads. I've also got beta and monthly prod numbers the sheet:
https://docs.google.com/spreadsheets/d/1sOS6LCZSAFvf2-yM_lqTkruKR0OvvvngR0nQdXf98BE/edit#gid=2098347965

LanguageDaily DevicesProjected Downloads
en49953829972
de17830210698
it585703514
fr559393356
es536763221
ru508883053
pt323441941
zh16083965
ar9910595
fa8557513
hi3792228
id3631218
tr1777107
bn97959
ta61537
mr54233
tl26416
my20112
or352
yo20
ig10

Lets talk assumptions:

  • This model assumes between 1% and 10% of users in a given time will download something, the number above are for 5%
  • 10% is taken from the Reading Lists feature usage (in May), which has a different target audience, but is the closet analog we have
  • It assumes each user who uses this feature will download 1.2 packs. This is based on the assumption that most users will download one pack (2 of the packs overlap, and one is highly targeted at medical users), but some will download 2, and a very few will download 3 or more.
  • This is a feature aimed at attracting NEW readers, while these numbers are based on the existing user base. I would expect some languages here will have disproportionately higher use than others (for example i could see Arabic or Hindi downloads being at a much higher rate than German, depending on marketing and distribution). However for purposes of this capacity estimation I assumed a static world of languages and even interest across the languages.

Thanks for the numbers @JMinor!

For beta numbers should I just use a ratio of beta users to production users?

I think so. Tilman gave daily beta numbers, so that gives a rough ratio. It looks to be about 5% average across these languages. (though some languages like pt are outliers with a beta almost 1/5 the size of its prod audience-- weird)

I'd maybe bump it a bit to be safe, and because we've seen Android's beta users tend to engage at a higher rate with features than its general user base (cf link previews and wikidata edits). Maybe assume beta will be 10% of production?

I think so. Tilman gave daily beta numbers, so that gives a rough ratio.

To clarify just in case, the query above gave monthly numbers, not daily ones. (As noted earlier on the other task, we went a bit back and forth between daily and monthly - I can certainly run the beta vs. prod query for daily numbers too if that's useful.)

@Fjalapeno Do you need any additional data from us here?