Bug in user sampling for MobileWikiAppSessions
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• chelsyx
	Feb 7 2018, 4:47 AM

Description

Observation

In mobile apps uniques, we have 1,080,318 unique app on 2017-12-3:

SELECT *
FROM wmf.mobile_apps_uniques_daily
WHERE year=2017 AND month=12 AND day =3
AND platform='Android'

Using webrequest data, we counted 1,101,078 unique app on 2017-12-3, closed to the number above:

SELECT COUNT(DISTINCT IF(x_analytics_map['wmfuuid'] IS NOT NULL, x_analytics_map['wmfuuid'], PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'appInstallID')))
FROM wmf.webrequest
WHERE year=2017 AND month=12 AND day =3
AND http_status IN('200', '304')
AND user_agent_map['os_family'] = 'Android'
AND access_method = 'mobile app'
AND IF(x_analytics_map['wmfuuid'] IS NOT NULL, x_analytics_map['wmfuuid'], PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'appInstallID')) IS NOT NULL

But on the same day in MobileWikiAppSessions, there are 56,061 production app and 51,708 beta app. If this calculation make sense: production*100+beta=56,061*100+51,708=5,657,808, since production is sampled 1:100 and beta is sampled 1:1. Then the difference is about 5 times, which seems too large than expected.

SELECT
IF(useragent.wmf_app_version LIKE '%-r-%', 'prod', 'beta') AS app,
COUNT(DISTINCT event.appInstallID) AS n_users
FROM event.mobilewikiappsessions
WHERE useragent.os_family = 'Android'
AND year=2017 AND month=12 AND day =3
GROUP BY IF(useragent.wmf_app_version LIKE '%-r-%', 'prod', 'beta')

Problem

There may be a bug in the sampling process which results in sampling more data than expected. This bug may affect other schema that use the same sampling method as well.

Details

	Subject	Repo	Branch	Lines +/-
	Sample eventlogging schemas based on appInstallID, not current instance.	apps/android/wikipedia	master	+36 -39

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	SNowick_WMF	T184098 [EPIC] Analytics baseline for Android app
Resolved	mpopov	T184089 Understand Android app usage by market
Resolved	Dbrant	T186682 Bug in user sampling for MobileWikiAppSessions

Event Timeline

• chelsyx created this task.Feb 7 2018, 4:47 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 7 2018, 4:47 AM

Dbrant removed a project: MediaWiki-extensions-MobileApp.Feb 7 2018, 4:55 AM

• Charlotte triaged this task as Medium priority.Feb 7 2018, 5:25 PM

• Charlotte moved this task from Needs Triage to Tracking on the Wikipedia-Android-App-Backlog board.

One possible cross check we can do here is to make sure our data represents (percentage wise) our android users (for metrics that are available in all versions such as session length). So if we look at android data per OS and say we have 20% of users on android 5 our data collection on the EL end for sessions (regardless of sampling) should have also 20% users on android 5. If sampling however is heavily biased towards some app versions/OS we might have found our issue.

In T186682#3953874, @Nuria wrote:

One possible cross check we can do here is to make sure our data represents (percentage wise) our android users for metrics susch as session legth. So if we look at android data per OS and say we have 20% of users on android 5 our data collection on the EL end for sessions (regardless of sampling) should have also 20% users on android 5. If sampling however is heavily biased towards some app versions/OS we might have found our issue.

Thanks @Nuria !

On 2018-2-1, in MobileWikiAppSessions, there are 48,729 production app and 45,433 beta app. In webrequest, there are 900,565 production app and 50,913 beta app. The beta sampling rate is closed to 1:1, but the production sampling rate is around 1:18.5.

I broke down the counts by version type (beta vs production) and os. The proportions below don't seem very different. (Since there are too many os, I only show the top 15 here.)

Version type	OS	n_users (webrequest)	proportion (webrequest)	n_users (EL)	proportion (EL)
beta	Android_7_0	14375	0.2823444	14690	0.3233333
beta	Android_6_0	9540	0.1873785	10714	0.2358198
beta	Android_5_1	8712	0.1711154	4774	0.1050778
beta	Android_4_4	5605	0.1100898	3561	0.0783792
beta	Android_7_1	3938	0.0773476	4183	0.0920696
beta	Android_5_0	1858	0.0364936	2203	0.0484890
beta	Android_8_0	1785	0.0350598	1997	0.0439548
beta	Android_8_1	1728	0.0339403	2036	0.0448132
beta	Android_4_2	1203	0.0236285	590	0.0129862
beta	Android_4_0	696	0.0136704	39	0.0008584
beta	Android_2_3	546	0.0107242	0	0
beta	Android_4_1	459	0.0090154	378	0.0083199
beta	Android_4_3	443	0.0087012	256	0.0056347
beta	Android_6_1	10	0.0001964	3	0.0000660
beta	Android_3_2	5	0.0000982	0	0

For production, we saw large discrepancies in Android 7.0 and 6.0 (The table only shows top 15):

Version type	OS	n_users (webrequest)	proportion (webrequest)	n_users (EL)	proportion (EL)
prod	Android_7_0	401414	0.4457357	10604	0.2176117
prod	Android_6_0	184850	0.2052600	17347	0.3559892
prod	Android_7_1	85523	0.0949659	2467	0.0506269
prod	Android_5_1	57810	0.0641930	4662	0.0956720
prod	Android_4_4	48725	0.0541049	5522	0.1133206
prod	Android_8_0	36494	0.0405234	947	0.0194340
prod	Android_5_0	34771	0.0386102	4378	0.0898438
prod	Android_8_1	27716	0.0307762	781	0.0160274
prod	Android_4_2	9720	0.0107932	954	0.0195777
prod	Android_4_1	5819	0.0064615	512	0.0105071
prod	Android_4_3	4157	0.0046160	513	0.0105276
prod	Android_2_3	1989	0.0022086	0	0
prod	Android_4_0	1453	0.0016134	36	0.0007388
prod	Android_3_2	51	0.0000566	0	0
prod	Android_-_-	38	0.0000422	0	0

Then I broke down the counts by app version. The tables only show top 15 versions. And we saw large discrepancies in beta versions, but not in production versions:

Version type	App version	n_users (webrequest)	proportion (webrequest)	n_users (EL)	proportion (EL)
beta	2.7.224-beta-2018-01-06	23632	0.4640003	35339	0.7776555
beta	25.0.25-alpha-2018-01-18	4733	0.0929296	0	0
beta	2.7.222-amazon-2017-12-15	3644	0.0715478	327	0.0071958
beta	2.0-dcg-2014-11-21	1617	0.0317488	0	0
beta	2.6.203-beta-2017-08-28	1378	0.0270562	1355	0.0298176
beta	2.1.141-dtac-2016-02-10	1250	0.0245430	127	0.0027947
beta	2.0-releasesprod-2015-03-23	1241	0.0243663	0	0
beta	2.6.206-beta-2017-10-30	1099	0.0215782	2175	0.0478622
beta	2.7.222-beta-2017-12-15	949	0.0186331	1598	0.0351649
beta	2.5.194-alpha-2017-05-30	742	0.0145687	0	0
beta	2.0-beta-2014-12-19	548	0.0107597	0	0
beta	2.6.203-amazon-2017-08-28	486	0.0095423	49	0.0010783
beta	2.6.206-amazon-2017-10-30	451	0.0088551	49	0.0010783
beta	2.0-beta-2014-11-03	390	0.0076574	0	0
beta	2.6.198-beta-2017-06-09	383	0.0075200	797	0.0175385

Version type	App version	n_users (webrequest)	proportion (webrequest)	n_users (EL)	proportion (EL)
prod	2.7.224-r-2018-01-06	738490	0.8197420	38864	0.7973738
prod	2.6.206-r-2017-10-30	28771	0.0319365	2626	0.0538777
prod	2.6.203-r-2017-08-28	28548	0.0316890	1800	0.0369307
prod	2.7.222-r-2017-12-15	26574	0.0294978	1699	0.0348584
prod	2.6.198-r-2017-06-09	13880	0.0154071	882	0.0180960
prod	2.1.141-r-2016-02-10	8255	0.0091633	784	0.0160854
prod	2.5.195-r-2017-04-21	5789	0.0064259	355	0.0072835
prod	2.0-r-2014-08-13	5655	0.0062772	0	0
prod	2.7.221-r-2017-12-08	5384	0.0059764	355	0.0072835
prod	2.4.160-r-2016-10-14	4461	0.0049518	184	0.0037751
prod	2.5.191-r-2017-03-31	3083	0.0034222	184	0.0037751
prod	2.1.144-r-2016-05-09	2876	0.0031924	219	0.0044932
prod	2.5.190-r-2017-02-24	2685	0.0029804	165	0.0033853
prod	2.0-r-2015-01-15	2525	0.0028028	0	0
prod	2.4.184-r-2016-12-14	2133	0.0023677	109	0.0022364

Query:

SELECT 
IF(user_agent_map['wmf_app_version'] LIKE '%-r-%', 'prod', 'beta') AS app_version,
CONCAT(user_agent_map['os_family'], '_', user_agent_map['os_major'], '_', user_agent_map['os_minor']) AS os,
COUNT(DISTINCT IF(x_analytics_map['wmfuuid'] IS NOT NULL, x_analytics_map['wmfuuid'], PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'appInstallID'))) AS n_users
FROM wmf.webrequest
WHERE year=2018 AND month=2 AND day =1
AND http_status IN('200', '304')
AND user_agent_map['os_family'] = 'Android'
AND access_method = 'mobile app'
AND IF(x_analytics_map['wmfuuid'] IS NOT NULL, x_analytics_map['wmfuuid'], PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
GROUP BY IF(user_agent_map['wmf_app_version'] LIKE '%-r-%', 'prod', 'beta'), user_agent_map['os_family'], user_agent_map['os_major'], user_agent_map['os_minor'];

SELECT
IF(useragent.wmf_app_version LIKE '%-r-%', 'prod', 'beta') AS app_version,
CONCAT(useragent.os_family, '_', useragent.os_major, '_', useragent.os_minor) AS os,
COUNT(DISTINCT event.appInstallID) AS n_users
FROM event.mobilewikiappsessions
WHERE useragent.os_family = 'Android'
AND year=2018 AND month=2 AND day=1
GROUP BY IF(useragent.wmf_app_version LIKE '%-r-%', 'prod', 'beta'), useragent.os_family, useragent.os_major, useragent.os_minor;

@Dbrant Does this breakdown ring any bell?

• chelsyx moved this task from Needs triage to Tracking on the Discovery-Analysis board.Feb 8 2018, 12:08 AM

• chelsyx updated the task description. (Show Details)Feb 8 2018, 12:13 AM

@Charlotte, according to the breakdown tables above, some Android os/versions are over-represented and some are under-represented in our Eventlogging tables -- MobileWikiAppSessions and possibly others as well. This means the data we collected is not representative of the whole population of our users.

I suggest we set the priority of this ticket to high and move it to the bug backlog, since this bug and T186768 may make the analysis in T184098 (and it's sub-tickets) unreliable.

Dbrant claimed this task.Feb 8 2018, 10:25 PM

Loads of work went into this, @chelsyx
Nice work.

If time intervals are the same on webrequest and eventlogging this data (2018/02/01) indeed does not match what we would expect to see for features like sessions that are going to be sampled across the user base. things that come to mind (thinking only about prod versions):

The sampling 1/100 does not seem to apply in any case. The ratio of EL(uniques)/webrequest(uniques) 56,061/ 1,101,078~ 5% is constant across OS and app versions if you total your 15 records. So we are sampling 4 times what we think we are, right?

While not the best situation, this might not be a problem if - like you said - at the beginning data is sampled across the user base but that does not seem to be the case in the case of OS.

Proportions seem OK when you look at app versions by themselves, each version has a similar proportion of data in EL and webrequest for the day in which you run your calculations so percentage of uniques per app version matches across the two datasources. I just do not understand how with equal numbers per version we can get such a different numbers per OS.

Well, this is going to be the biggest facepalm in history:

Upon closer scrutiny of our code, while the intention was for the sampling to apply on a per-user basis, the actual effect turns out not to be the case. In fact, the sampling instead effectively applies to each instance of launching the app. Meaning that a single user might be in the sampling bucket during one usage of the app, but then might no longer be in the bucket the next time the app is launched.

I'm guessing this would invalidate any attempts at measuring unique users from this particular schema. Sorry for all the confusion! :(

We'll be sure to correct this inconsistency in our next update.

• Tbayer awarded a token.Feb 9 2018, 7:51 PM

Thank you very much @Dbrant !

Given that the sampling applies to each instance of launching the app, if a user launched the app and get sampled to be tracked by EL, then the user closed the app for 20 min (less than the 30 min inactivity in our definition of a session), and then the user launched the app again but didn't get sampled this time. In this situation, only the interaction with the app before the 20 mins' break got logged by MobileWikiAppSessions, is that correct?

In the case you describe, there wouldn't be anything logged at all, since the session is still continuing -- not enough time has elapsed for one session to expire and the next one to begin. (this is independent of actually logging the session to EL)

If the user launches the app, and the current instance of the app happens to be in the sampling bucket, and the previous session has expired, then the previous session will be logged.

Change 409450 had a related patch set uploaded (by Dbrant; owner: Dbrant):
[apps/android/wikipedia@master] WIP: Sample eventlogging schemas based on appInstallID, not current instance.

https://gerrit.wikimedia.org/r/409450

gerritbot added a project: Patch-For-Review.Feb 9 2018, 9:42 PM

In fact, the sampling instead effectively applies to each instance of launching the app

Ok, at least we know now!
This means that users that use the app most are sampled more frequently which means that data is not random on the user/ phone dimension, and that we cannnot use it to determine the number of unique users (as you mentioned)
Now, sampling per session is not too bad, this is how data is sampled on our desktop site for the most part.

This still does not explain the low number of android 7 so we might have another problem besides this one.

Given these findings, I don't think the OS breakdown data can be trusted at this point, either. The way the current (incorrect) code works, the app can be placed into a sampling bucket every time it's started. However, in the world of Android, "starting" the app is not a predictable concept. Even if all of the app's windows are closed, the app's instance can still be kept running in the background by the system.

And of course, some versions of Android might be more greedy about garbage-collecting inactive apps, whereas others might allow app instances to persist for a long time. It's entirely possible that Android 7.0 is less greedy about terminating inactive apps, which means that there's less churn through re-bucketing of app instances, which might lead to the reduced EL numbers that we see.

I think we should wait and re-run the OS numbers after we deploy the update that fixes the per-user nature of the session schema.

mpopov awarded a token.Feb 12 2018, 5:51 PM

I see, this means that we are going to have to throw away a couple years of data unless we can resample it somehow (random sampling from what we have? cc @chelsey)

Going forward let's please make a practice for developers to do basic vetting of metrics. Example: notice that in this case to see the sampling oddities it was enough to add the uniques in both sources (a simple addition, no stats). Let's not fire and forget metrics but rather follow through a bit to make sure things add up.

In T186682#3964824, @Nuria wrote:

Going forward let's please make a practice for developers to do basic vetting of metrics. Example: notice that in this case to see the sampling oddities it was enough to add the uniques in both sources (a simple addition, no stats). Let's not fire and forget metrics but rather follow through a bit to make sure things add up.

I agree in general (people should double check after deployment as an essential practice) but do want to bring up that the modern app teams do do that in my experience. The buggy code is from almost 3 years ago (from https://phabricator.wikimedia.org/rAPAW8c39a9a96874ba537713cd8325ef5c9af2bd915f I think) and in that time people have learned and improved, so it's kind of unfair to @Dbrant et al. to point at this discovery of a nearly 3 year old bug as if it's a recent mistake and we should be better going forward.

No finger pointing on my end, at all. I am focused on looking for solutions and better process going forward, not guilty parties. If vetting is happening already, sounds excellent.

mpopov added a parent task: T184089: Understand Android app usage by market.Feb 12 2018, 7:47 PM

Change 409450 merged by jenkins-bot:
[apps/android/wikipedia@master] Sample eventlogging schemas based on appInstallID, not current instance.

https://gerrit.wikimedia.org/r/409450

mpopov mentioned this in T184089: Understand Android app usage by market.Feb 14 2018, 9:08 PM

mpopov mentioned this in T184095: Understand Android app monthly active users and daily active users.Feb 16 2018, 5:19 PM

• chelsyx mentioned this in T184087: Gather basic stats on iOS and Android app sessions.Feb 21 2018, 1:44 AM

In T186682#3964824, @Nuria wrote:

I see, this means that we are going to have to throw away a couple years of data unless we can resample it somehow (random sampling from what we have? cc @chelsey)

Hi @Nuria, sorry I'm late. Random re-sampling may or may not be helpful, depending on whether the interested metric is measured on the user level. For example, if we want to compute daily avg pageviews per user using MobileWikiAppSessions, we can't get that from the old data because it's very unlikely that every sessions of any particular user in a day are selected to be logged; however, if I want to compute how much of the page users scrolls through on average using MobileWikiAppPageScroll, which use the same sampling method as MobileWikiAppSessions, I can re-sample from the current data to avoid certain OS being overrepresented in the dataset.

In T186682#3959981, @Dbrant wrote:

And of course, some versions of Android might be more greedy about garbage-collecting inactive apps, whereas others might allow app instances to persist for a long time. It's entirely possible that Android 7.0 is less greedy about terminating inactive apps, which means that there's less churn through re-bucketing of app instances, which might lead to the reduced EL numbers that we see.

@Dbrant Does this mean that aggregating session length across OS is not meaningful (e.g. daily average session length) because session length is correlated with OS (e.g. Android 7.0 are more likely to have a longer session)? Maybe we should compare session length within OS groups?

@chelsyx Agreed, +1

Does this mean that aggregating session length across OS is not meaningful (e.g. daily average session length) because session length is correlated with OS (e.g. Android 7.0 are more likely to have a longer session)

Also , better android version means newer device and likely a better connection and better connections are always associated with longer sessions in mobile (not just for wikipedia but in general).

mpopov mentioned this in T184093: Usage of feed customisation in Android app.Mar 6 2018, 2:12 AM

@Dbrant: which release of the app will have the fix?

mpopov mentioned this in T190092: Selected languages stats on Wikipedia Android app.Mar 19 2018, 9:53 PM

• Phabricator_maintenance added a project: Product-Analytics.Apr 18 2018, 11:22 PM

• Phabricator_maintenance removed a project: Product-Analytics.Apr 19 2018, 12:21 AM

Restricted Application added a project: Product-Analytics. · View Herald TranscriptApr 19 2018, 12:21 AM

Thanks, everyone!