Page MenuHomePhabricator

Productionize per-country daily & monthly active app user stats
Closed, DuplicatePublic

Description

Background

We have the overall stats https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/mobile_apps_uniqueswmf.mobile_apps_uniques_daily (DAU) and wmf.mobile_apps_uniques_monthly (MAU) – and @Tbayer manually maintains a tbayer.mobile_apps_uniques_by_country – which uses a per-country DAU query (see Appendix) based on Analytics Engineering's global DAU query – but we have no productionized per-country MAU or DAU (which would be used to calculate the two apps' "stickiness %" in different markets).

While I can run a query that calculates per-country DAU for the past 2 months (using 60 days of request data that we have) and use Tilman's data as part of my work on T184089, it would be tremendously helpful to have that data automated and available in wmf database so that we could track longer term trends.

End goal

  • wmf.mobile_apps_uniques_daily_by_country
  • wmf.mobile_apps_uniques_monthly_by_country

Please and thank you!

Optional

Something to think about and maybe have additional tickets for:

  • If this can be made available in Druid, it would be great if in addition to country names there were also regions, continents, and maybe primary-language-in-country data so that any dashboards we make in Superset could use a meaningful grouping of countries. I'm sure @Charlotte @JMinor and @JKatzWMF would appreciate having that :)
  • Some mobile app makers calculate MAU on a daily basis by using a rolling 30-day window. It might be too computationally intensive for us, though.
  • Maybe we should also have a DOD retention metric (global & by country)? From https://www.kumulos.com/2016/11/22/mau-vs-dau-measure-mobile-app-retention/:

What is day-over-day (DOD) retention for measuring mobile app performance?
DOD retention is much like the MAU/DAU ratio, but provides a detailed snapshot metric. With DOD retention analysis of mobile app user behavior, you simply establish the ratio of the app users who are active today and yesterday, relative to the number of app users who were only active yesterday. This retention measurement ratio symbolizes the proportion of users who are liable to use your client’s app regularly in a short timeframe. In order to calculate this ratio, you establish the number of users who were active yesterday and today, and divide by the total number of users who were only active yesterday.

(Although there is MobileWikiAppDailyStats EL schema that we can use to estimate this.)

Appendix

Tilman uses the following bash script to maintain the per-country DAU table (adapted from the existing global query)::

#! /bin/bash
DESCRIPTION="update mobile apps uniques by country"
QUERY="
SET mapred.job.queue.name=nice;
-- Set compression codec to gzip to provide asked format
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;

CREATE EXTERNAL TABLE IF NOT EXISTS tbayer.mobile_apps_uniques_by_country
(
    \`year\`                 int     COMMENT 'Unpadded year of request',
    \`month\`                int     COMMENT 'Unpadded month of request',
    \`day\`                  int     COMMENT 'Unpadded day of request',
    \`platform\`             string  COMMENT 'Mobile platform from user agent parsing',
    \`country\`              string  COMMENT 'Country name of the accessing agents (computed using the MaxMind GeoIP database)',
    \`country_code\`         string  COMMENT '2-letter country code (based on ISO 3166)',
    \`unique_count\`         bigint  COMMENT 'Distinct uuid count'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION '/user/tbayer/mobile_apps_uniques_by_country'
;

WITH mobile_apps_uuids_new AS
(
    SELECT
        year,
        month,
        day,
        CASE WHEN (user_agent LIKE '%iOS%' OR user_agent LIKE '%iPhone%') THEN 'iOS' ELSE 'Android' END AS platform,
        geocoded_data['country'] AS country,
        geocoded_data['country_code'] AS country_code,
        COALESCE(x_analytics_map['wmfuuid'],
                 parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) AS uuid
    FROM wmf.webrequest
    WHERE user_agent LIKE('WikipediaApp%')
        AND ((parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'action') = 'mobileview' AND uri_path == '/w/api.php')
            OR (uri_path LIKE '/api/rest_v1%' AND uri_query == ''))
        AND COALESCE(x_analytics_map['wmfuuid'],
                     parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
        AND webrequest_source IN ('text')
        AND year=2018
        AND month=1
)
INSERT INTO TABLE tbayer.mobile_apps_uniques_by_country
SELECT
    year,
    month,
    day,
    platform,
    country,
    country_code,
    unique_count
FROM
    (
        SELECT
            year,
            month,
            day,
            platform,
            country,
            country_code,
            COUNT(DISTINCT uuid) AS unique_count
        FROM
            mobile_apps_uuids_new
        GROUP BY
            year,
            month,
            day,
            platform,
            country, country_code
    ) old_union_new_uniques_daily
ORDER BY
    year,
    month,
    day,
    platform,
    country, country_code
-- Limit enforced by hive strict mapreduce setting.
-- 1000000000 == NO LIMIT !
LIMIT 1000000000
;
"
THISSCRIPTFILE=`basename "$0"`
RESULTSFILE=~/${THISSCRIPTFILE%.*}_result.txt
{ date ; echo = ; TZ='America/Los_Angeles' date ; echo generated by $THISSCRIPTFILE on $HOSTNAME ; beeline --verbose=true -e "$QUERY" ; } &> $RESULTSFILE
cat $RESULTSFILE | mailx -s "$DESCRIPTION (Hive result from $HOSTNAME)" tbayer@wikimedia.org

Event Timeline

mpopov created this task.Feb 8 2018, 9:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 8 2018, 9:00 PM
mpopov updated the task description. (Show Details)Feb 8 2018, 9:12 PM

Having the per-country unique app stats in production would be very helpful for future work, given our annual planning put more focus on geographies and languages.

Milimetric moved this task from Incoming to Dashiki on the Analytics board.Feb 12 2018, 4:29 PM
Milimetric moved this task from Dashiki to Incoming on the Analytics board.Apr 2 2018, 3:32 PM
Milimetric moved this task from Dashiki to Incoming on the Analytics board.
Nuria added a comment.Apr 5 2018, 5:22 PM

Our preferred path here is that the data analysts team initiates this work and we support them. Given queries are done it just requires oozie.

Nuria triaged this task as Low priority.Apr 5 2018, 5:22 PM
Nuria moved this task from Incoming to Backlog (Later) on the Analytics board.
mpopov moved this task from Triage to Backlog on the Product-Analytics board.Apr 23 2018, 10:59 PM

Our preferred path here is that the data analysts team initiates this work and we support them. Given queries are done it just requires oozie.

I agree that there are many arguments for moving more data engineering work into the Audiences department; however, so far the practice has been that Analytics Engineering takes care of implementing Oozie jobs such as this. Compare also https://github.com/wikimedia/analytics-refinery/commits/master/oozie

In the meantime, I have launched the last update for tbayer.mobile_apps_uniques_by_country manually again.

chelsyx moved this task from Backlog to Doing on the Product-Analytics board.

Change 451566 had a related patch set uploaded (by Chelsyx; owner: Chelsyx):
[analytics/refinery@master] Add mobile_apps_uniques_by_country_daily and mobile_apps_uniques_by_country_monthly jobs in oozie

https://gerrit.wikimedia.org/r/451566

Hi @Nuria, could you review the change https://gerrit.wikimedia.org/r/451566 when you have a chance? I've tested the job under my account, see oozie job rMW0048633ac4ca-180705103628398-oozie-oozi-C and 0049964-180705103628398-oozie-oozi-C

Also, after you review the patch, can you help move the data from chelsyx.mobile_apps_uniques_monthly_by_country to wmf.mobile_apps_uniques_monthly_by_country? chelsyx.mobile_apps_uniques_monthly_by_country contains data back to 2017, which is from tbayer.mobile_apps_uniques_by_country.

Thanks!

Nuria added a comment.Aug 9 2018, 4:38 PM

@chelsyx thanks for working on this. Will review as is but let me outline an approach that would have survived the test of time better: we could have created a "tag" (see: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest/Tagging) for when a pageview is a mobile pageview, that way all this code:

AND ((parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'action') = 'mobileview' AND uri_path == '/w/api.php')
       OR (uri_path LIKE '/api/rest_v1%' AND uri_query == ''))
   AND COALESCE(x_analytics_map['wmfuuid'],
                parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
   AND webrequest_source IN ('text')

could be abstracted to a tag so 1) that where clause could be changed to "where tags include ' mobile-pageview' ". This is what we do to , for example, identify wqds requests, we tag them when we refine and subsequent jobs that use that data do not need to do costly regexes. The wdqs tag as an example:

https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/webrequest/tag/WDQSTagger.java

See a similar tag for portal pageviews:

https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/webrequest/tag/PortalTagger.java

@chelsyx thanks for working on this. Will review as is but let me outline an approach that would have survived the test of time better: we could have created a "tag" (see: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest/Tagging) for when a pageview is a mobile pageview, that way all this code:

AND ((parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'action') = 'mobileview' AND uri_path == '/w/api.php')
       OR (uri_path LIKE '/api/rest_v1%' AND uri_query == ''))
   AND COALESCE(x_analytics_map['wmfuuid'],
                parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
   AND webrequest_source IN ('text')

For context, the code in question comes from the Analytics Engineering team's query to calculate the global version of this data, and I can see strong arguments for keeping this new per-country query consistent with that.
So I understand your remarks are about how the Analytics Engineering team could have approached this differently back in 2015 if the tagging infrastructure had been around already.

could be abstracted to a tag so 1) that where clause could be changed to "where tags include ' mobile-pageview' ".

Its usefulness as a general tag would be limited though, considering that it only captures app views where the user has opted in/not opted of data collection, as opposed to the general access_method = 'mobile app'.

This is what we do to , for example, identify wqds requests, we tag them when we refine and subsequent jobs that use that data do not need to do costly regexes. The wdqs tag as an example:

https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/webrequest/tag/WDQSTagger.java

See a similar tag for portal pageviews:

https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/webrequest/tag/PortalTagger.java

I seem to recall we already moved away from the tagging system for other purposes (in particular that its use for measuring page previews was initially proposed as an example use case, but later discouraged).

@Nuria Thanks for the suggestion and @Tbayer thanks for the insights.

we could have created a "tag" (see: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest/Tagging) for when a pageview is a mobile pageview, that way all this code:

AND ((parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'action') = 'mobileview' AND uri_path == '/w/api.php')
       OR (uri_path LIKE '/api/rest_v1%' AND uri_query == ''))
   AND COALESCE(x_analytics_map['wmfuuid'],
                parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
   AND webrequest_source IN ('text')

could be abstracted to a tag

I want to point out that the above code doesn't filter mobile pageviews only. URI path like /api/rest_v1% could be used to fetch feed or reading lists updates (see T193917#4237597 for more info). So even if a user didn't open the app that day, the app could still fetch feed and reading lists updates while in the background and be counted as a device in our table.

I'm wondering the reason why we want to use mobile app pageviews only when counting the mobile apps uniques. It seems to make more sense to filter using access_method = 'mobile app' only instead of the above code.

chelsyx added a comment.EditedAug 15 2018, 9:30 PM

@mpopov and @Tbayer , I have a question for you two: What is the proper definition of "mobile apps unique devices"?

There seems to be 4 possibilities (maybe there are more...):

1, Any devices that send requests and have a uuid, regardless of intentional requests or unintentional request (e.g. when the app is in the background, request could be send to fetch the feed and the reading list)

WHERE access_method = 'mobile app'
AND COALESCE(x_analytics_map['wmfuuid'],
          parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
AND webrequest_source IN ('text')

2, Any devices that open the app (start a session), with or without pageview. It should be easy to get this number from eventlogging, but I haven't figure out how to filter it from webreqeust.

3, Any devices that has at least one pageview

WHERE access_method = 'mobile app'
AND is_pageview
AND COALESCE(x_analytics_map['wmfuuid'],
          parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
AND webrequest_source IN ('text')

4, The current query of https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/mobile_apps_uniques . The value of this defective query is for historical comparison.

WHERE user_agent LIKE('WikipediaApp%')
AND ((parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'action') = 'mobileview' AND uri_path == '/w/api.php')
          OR (uri_path LIKE '/api/rest_v1%' AND uri_query == ''))
AND COALESCE(x_analytics_map['wmfuuid'],
          parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
AND webrequest_source IN ('text')

I prefer the first one, but I agree that we need to compare with historical data sometimes. What do you think if we keep both 1 and 4 in two tables?

chelsyx moved this task from Doing to Epics on the Product-Analytics board.Aug 16 2018, 8:00 PM

1, Any devices that send requests and have a uuid, regardless of intentional requests or unintentional request (e.g. when the app is in the background, request could be send to fetch the feed and the reading list)

WHERE access_method = 'mobile app'
AND COALESCE(x_analytics_map['wmfuuid'],
          parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
AND webrequest_source IN ('text')

I prefer the first one, but I agree that we need to compare with historical data sometimes. What do you think if we keep both 1 and 4 in two tables?

I'm learning towards this method too. My reasoning: as far as I know we're primarily interested in how many unique app users we have in each country, so whether the request we see from someone is intentional or not (in case of a background fetch/sync) is irrelevant.

This may be controversial but I don't think we should have a second table. Having a second table full of faulty data just so we can compare it against faulty historical data doesn't seem right to me, and imo isn't worth the effort (setup + maintenance) nor the added confusion of "legacy counts". The old counts can stay in Tilman's personal database in Hive but we shouldn't replicate them in this productionized capacity.

@Nuria: if we go with the first method (which is the one @chelsyx and I have a preference for), what do you think should happen to the global query currently in production?

@Nuria: if we go with the first method (which is the one @chelsyx and I have a preference for), what do you think should happen to the global query currently in production?

It should get changed to the one above (which has a better rationale for how numbers are gathered), right? Let's first change these per-country calculations, document rationale of metric and after change the other jobs. Does this sound good?

For posterity:
In an ideal world the app will send and event every time a session is started (regardless of feature in display : feed, reading list or whatever article...) and this start-session event will be used to calculate uniques. That way: 1) unique users is tied to the notion of intentional views, 2) the app is in charge of deciding when a user should be counted and 3) using a method similar to unique-devices-last-access we could count all users without having to use app-install-ids which will give you precise numbers without needing an opt-in (maybe worth considering)

1, Any devices that send requests and have a uuid, regardless of intentional requests or unintentional request (e.g. when the app is in the background, request could be send to fetch the feed and the reading list)

WHERE access_method = 'mobile app'
AND COALESCE(x_analytics_map['wmfuuid'],
          parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
AND webrequest_source IN ('text')

I prefer the first one, but I agree that we need to compare with historical data sometimes. What do you think if we keep both 1 and 4 in two tables?

I'm learning towards this method too. My reasoning: as far as I know we're primarily interested in how many unique app users we have in each country, so whether the request we see from someone is intentional or not (in case of a background fetch/sync) is irrelevant.

It is irrelevant only if one doesn't care about the definition of "unique app user" ;) I think the industry standard here is rather to only count active users in the sense of people intentionally using the app, as opposed to it running in the background while they use their device for something else, or merely being installed on the device. (The last one would be more like Google's "installs on active devices".) Ultimately that seems like a question for the product owners, @Charlotte and @JMinor.

That said, I think there might be a pragmatic argument for 1. and 4. over 3. (i.e. counting non-pageview requests as activity too) if the difference isn't too large - has anyone tried to calculate it (beyond the related observations for iOS at T193917#4237597) ? To be sure, I think @chelsyx is right to call out the effect of these background requests for reading lists and feed, both features that didn't exist when the currently used query (4.) was first implemented back in 2015. I wonder if they have the same impact on both platforms - regarding Android, perhaps @Dbrant knows more about how often background requests are likely to happen on days where the user hasn't opened the app.

Tbayer updated the task description. (Show Details)Aug 21 2018, 2:28 PM

[...]

I prefer the first one, but I agree that we need to compare with historical data sometimes. What do you think if we keep both 1 and 4 in two tables?

[...]

This may be controversial but I don't think we should have a second table. Having a second table full of faulty data just so we can compare it against faulty historical data doesn't seem right to me, and imo isn't worth the effort (setup + maintenance) nor the added confusion of "legacy counts". The old counts can stay in Tilman's personal database in Hive but we shouldn't replicate them in this productionized capacity.

I'm not quite sure what this has to do with my personal table per se - to be extra clear just in case, the global numbers in wmf.mobile_apps_uniques_daily_by_country and wmf.mobile_apps_uniques_monthly_by_country have been calculated with method 4 in production since 2015. tbayer.mobile_apps_uniques_by_country is just a by-country version of the former.

Right now we (in fact, you ;)) are still reporting the Android number and year-over-year changes from wmf.mobile_apps_uniques_monthly_by_country every month at https://www.mediawiki.org/wiki/Wikimedia_Audiences#Readers . If we regard method 4 faulty to the point of the data being unusable, we should probably note that there. However, considering the value of knowing about longer-term trends in the app's usage, it might be worth taking a closer look at the difference between 1. and 4.

Focusing just on the device detection part for now:

  1. uses access_method = 'mobile app'
  1. uses user_agent LIKE('WikipediaApp%')

From a quick glance at the webrequest refinery code, access_method = 'mobile app' is based on this following regex: "Wikipedia(App|/5.0.)"

The difference here (for Wikipedia/5.0) is just a special-casing fix for an old version of the iOS app from 2016 that should not affect current numbers (cf. T131824#2185551 and T131824#2192267 - although it should be noted that despite T131824#2192267 there seem to be a small number of current non-pageview requests that still use e.g. Wikipedia/5.8.2 ... instead of WikipediaApp/5.8.2 ..., perhaps @Mhurd knows more about this).

It is irrelevant only if one doesn't care about the definition of "unique app user" ;) I think the industry standard here is rather to only count active users in the sense of people intentionally using the app, as opposed to it running in the background while they use their device for something else, or merely being installed on the device. (The last one would be more like Google's "installs on active devices".) Ultimately that seems like a question for the product owners, @Charlotte and @JMinor.

That said, I think there might be a pragmatic argument for 1. and 4. over 3. (i.e. counting non-pageview requests as activity too) if the difference isn't too large - has anyone tried to calculate it (beyond the related observations for iOS at T193917#4237597) ? To be sure, I think @chelsyx is right to call out the effect of these background requests for reading lists and feed, both features that didn't exist when the currently used query (4.) was first implemented back in 2015. I wonder if they have the same impact on both platforms - regarding Android, perhaps @Dbrant knows more about how often background requests are likely to happen on days where the user hasn't opened the app.

@Tbayer I have to point out that none of the solution we have currently is perfect -- we are merely looking for something that's workable and relatively reasonable:

  • For option 2, which is the count of users who intentionally used the app, the closest we can get is to count the unique ID in MobileWikiAppDailyStats. But for Android, the sampling rate is 1:100. Also remember this event is sent every 24h based on local timezone, not our system timezone (UTC), so the real count we get would be off if comparing to webrequest table.
  • For option 3, which is the count of any devices that has at least one pageview, I don't understand why we want to exclude those who open the app without reading an article. It doesn't seem right to me...
  • For option 4, the query doesn't make sense any more given that app could fetch info in the background, plus those parsing and string comparisons are computationally expensive. The only value I can see is to compare with historical data.
  • Also need to mention that for any of the 4 options, we are counting opt-in users only. Given that the opt-in rate on iOS seems very small, the unique device count is compromised anyway...
  • As to Google's "installs on active devices", which is the count of device having the app installed, iTunes doesn't report this number, but there is a way to estimate it: https://www.linkedin.com/pulse/how-estimate-active-installs-ios-sharif-aboulnaga/

Therefore, I think the first option is the most reasonable one here.

Focusing just on the device detection part for now:

  1. uses access_method = 'mobile app'
  1. uses user_agent LIKE('WikipediaApp%')

From a quick glance at the webrequest refinery code, access_method = 'mobile app' is based on this following regex: "Wikipedia(App|/5.0.)"

The difference here (for Wikipedia/5.0) is just a special-casing fix for an old version of the iOS app from 2016 that should not affect current numbers (cf. T131824#2185551 and T131824#2192267 - although it should be noted that despite T131824#2192267 there seem to be a small number of current non-pageview requests that still use e.g. Wikipedia/5.8.2 ... instead of WikipediaApp/5.8.2 ..., perhaps @Mhurd knows more about this).

We can fix the regex in webrequest refinery code so that access_method = 'mobile app' can detect all the apps correctly.


@Tbayer @mpopov @Nuria What do you think about this solution:

  • For the new tables wmf.mobile_apps_uniques_daily_by_country and wmf.mobile_apps_uniques_monthly_by_country, which are the ones this ticket is aiming to create, let's use option 1. If anyone is interested in historical uniques by country back to 5/1/2017, or want to see the counts under definition 4 in order to compare with wmf.mobile_apps_uniques_daily and wmf.mobile_apps_uniques_monthly, they can check chelsyx.mobile_apps_uniques_daily_by_country and chelsyx.mobile_apps_uniques_monthly_by_country -- I copy the data from @Tbayer 's personal table and schedule an oozie job running under my account.
  • For the old tables wmf.mobile_apps_uniques_daily and wmf.mobile_apps_uniques_monthly, which are the ones using query in option 4, let's keep them as they are for now. I will create chelsyx.mobile_apps_uniques_daily and chelsyx.mobile_apps_uniques_monthly using query in option 1, and we can compare the numbers with old queries over time. If the differences are not huge, maybe we can change the query of wmf.mobile_apps_uniques_daily and wmf.mobile_apps_uniques_monthly eventually.
  • Of course, we need to document the definition differences on https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/mobile_apps_uniques
Nuria added a comment.Aug 23 2018, 8:44 PM

For the old tables wmf.mobile_apps_uniques_daily and wmf.mobile_apps_uniques_monthly, which are the ones using query in option 4, let's keep them as they are for now. I will create chelsyx.mobile_apps_uniques_daily and >chelsyx.mobile_apps_uniques_monthly using query in option 1, and we can compare the numbers with old queries over time. If the differences are not huge, maybe we can change the query of wmf.mobile_apps_uniques_daily and >wmf.mobile_apps_uniques_monthly eventually.

I think we can compare counts for the past couple of months and use the number to quantify the changes without having to do it long term on your own tables. maintaining your own tables seems quite a bit of work that I am not sure is necessary. We correct metrics and queries all the time and it is quite understandable that a query from 2015 needs updating.

Given that the opt-in rate on iOS seems very small, the unique device count is compromised anyway...

I have made a ticket on how to calculate these numbers, precisely, for all users w/o having to use appinstallids. It requires app instrumentation but code should be real simple, we do not have to do this now but we can bring this idea up to the attention of the mobile PMS, I think more precise numbers are well worth the instrumentation cost. https://phabricator.wikimedia.org/T202664

A comment on something that struck me today: for unique-devices computation using last-access-cookie, we store tables per country and we assume additivity of the metric accross countries, i.e. we do as if unique-devices don't change country. If we do the same asumption for mobile-uniques-devices, it means we can go with 2 jobs only instead of 4: we'd keep the old global values for historical purposes, and store only per-country daily and monthly moving forward. This would mean less jobs, less tables to maintain, less redundancy etc.

I ran the two queries of option 1 ("new") and 4 ("old") in T186828#4505784 for both iOS and Android (see https://people.wikimedia.org/~chelsyx/Apps%20Unique%20Devices%20-%20Query%20Comparison.html for more details):

Apparently, the discrepancies are larger on Android than on iOS. On average, the new query is counting 39949 more devices on Android, which is about 4.2% more than the old query; while on iOS, the new query is counting 296 more devices which is about 0.21% more than the old query.

Suggestions:

1, @mpopov and I talk to @Charlotte and @JMinor about the proper definition of "mobile apps unique devices" -- with or without background request, with or without pageviews
2, After we have an agreement on 1, @mpopov and I review the current query and modify it
3, Since the query is only counting the opt-in users on iOS, which is a small proportion, I'm going to chat with the iOS team about the proposal in T202664.

@Tbayer @mpopov @Nuria Thoughts?

mpopov changed the task status from Open to Stalled.Aug 28 2018, 6:52 PM

We'll discuss with Josh and Charlotte (once she's back from vacation)

Dbrant added a comment.Sep 7 2018, 5:25 PM

Just to answer @Tbayer's question about background requests made from the app, here is a rough breakdown of the circumstances in which this can happen:

  • In the general case, the app does not make network requests unless it's in the foreground. However:
  • If the user has a particularly large reading list that has just been synced, it will start a background task that downloads all the articles in the list for offline reading. In this case, the background task may continue for a long time after the app is closed.
  • We will soon release Echo notifications in the app, where the app will poll the Echo API at regular intervals in the background, even if the app is not open (and doesn't need to be opened again).

Hello all! @JMinor @Charlotte @mpopov and I met 2 weeks ago to discuss about the definition of mobile app uniques: (sorry I should have posted this earlier :( )

1, We agree that we should only count users who have the app in the foreground as active users -- but pageview is not necessary.
2, Since both app would send background requests (although the frequencies are different), we'd like to tell them apart from foreground requests by adding a tag to the http header (see T204783). This tag will be very helpful not only for counting the uniques but also for debugging.
3, We agree that T202664 is a good solution for counting the uniques. After 2 is done, we will discuss with the Android and iOS app teams separately to prioritize.

As to the task of this ticket -- productionize per-country daily & monthly active app user stats, I suggest we leave this ticket as "Stalled" and come back to modify the query when T186828#4596238 is done. As pointed out by @JAllemandou in T186828#4530420, we only need to store per-country daily and monthly moving forward.

We also need to decide whether to keep the defective mobile_apps_uniques_daily and mobile_apps_uniques_monthly historical comparison by then. For now, if anyone is interested in the per-country daily and monthly that are consistent with mobile_apps_uniques_daily and mobile_apps_uniques_monthly, feel free to check out chelsyx.mobile_apps_uniques_daily_by_country and chelsyx.mobile_apps_uniques_monthly_by_country.

@mpopov to meet with Josh & Charlotte and figure out engineering steps to unblock this task - part of larger task to figure out better ways to track unique mobile app devices

Nuria added a comment.Feb 14 2019, 8:08 PM

If we can track them using a variation of method described here: https://phabricator.wikimedia.org/T202664 our life (i think) will be easier and our numbers more precise

The work to identify unique devices is captured under Better Use of Data work for FY 2019-20: T225237

kzimmerman removed chelsyx as the assignee of this task.Jul 9 2019, 9:23 PM

Chelsy had an oozie job updating 2 tables daily/monthly: mobile_apps-uniques-by_country-daily-coord and mobile_apps-uniques-by_country-monthly-coord (the monthly job is suspended currently). We plan to archive this job when Chelsy leaves (since it will be replaced by Better Use of Data work in T225237).

@mpopov any objections to archiving the oozie job and the data?