Page MenuHomePhabricator

Bug in user sampling for MobileWikiAppSessions
Closed, ResolvedPublic

Description

Observation

In mobile apps uniques, we have 1,080,318 unique app on 2017-12-3:

SELECT *
FROM wmf.mobile_apps_uniques_daily
WHERE year=2017 AND month=12 AND day =3
AND platform='Android'

Using webrequest data, we counted 1,101,078 unique app on 2017-12-3, closed to the number above:

SELECT COUNT(DISTINCT IF(x_analytics_map['wmfuuid'] IS NOT NULL, x_analytics_map['wmfuuid'], PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'appInstallID')))
FROM wmf.webrequest
WHERE year=2017 AND month=12 AND day =3
AND http_status IN('200', '304')
AND user_agent_map['os_family'] = 'Android'
AND access_method = 'mobile app'
AND IF(x_analytics_map['wmfuuid'] IS NOT NULL, x_analytics_map['wmfuuid'], PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'appInstallID')) IS NOT NULL

But on the same day in MobileWikiAppSessions, there are 56,061 production app and 51,708 beta app. If this calculation make sense: production*100+beta=56,061*100+51,708=5,657,808, since production is sampled 1:100 and beta is sampled 1:1. Then the difference is about 5 times, which seems too large than expected.

SELECT
IF(useragent.wmf_app_version LIKE '%-r-%', 'prod', 'beta') AS app,
COUNT(DISTINCT event.appInstallID) AS n_users
FROM event.mobilewikiappsessions
WHERE useragent.os_family = 'Android'
AND year=2017 AND month=12 AND day =3
GROUP BY IF(useragent.wmf_app_version LIKE '%-r-%', 'prod', 'beta')

Problem

There may be a bug in the sampling process which results in sampling more data than expected. This bug may affect other schema that use the same sampling method as well.

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 7 2018, 4:47 AM
Charlotte triaged this task as Medium priority.Feb 7 2018, 5:25 PM
Charlotte moved this task from Needs Triage to Tracking on the Wikipedia-Android-App-Backlog board.
Nuria added a comment.EditedFeb 7 2018, 7:50 PM

One possible cross check we can do here is to make sure our data represents (percentage wise) our android users (for metrics that are available in all versions such as session length). So if we look at android data per OS and say we have 20% of users on android 5 our data collection on the EL end for sessions (regardless of sampling) should have also 20% users on android 5. If sampling however is heavily biased towards some app versions/OS we might have found our issue.

chelsyx added a comment.EditedFeb 7 2018, 11:58 PM

One possible cross check we can do here is to make sure our data represents (percentage wise) our android users for metrics susch as session legth. So if we look at android data per OS and say we have 20% of users on android 5 our data collection on the EL end for sessions (regardless of sampling) should have also 20% users on android 5. If sampling however is heavily biased towards some app versions/OS we might have found our issue.

Thanks @Nuria !

On 2018-2-1, in MobileWikiAppSessions, there are 48,729 production app and 45,433 beta app. In webrequest, there are 900,565 production app and 50,913 beta app. The beta sampling rate is closed to 1:1, but the production sampling rate is around 1:18.5.

I broke down the counts by version type (beta vs production) and os. The proportions below don't seem very different. (Since there are too many os, I only show the top 15 here.)

Version typeOSn_users (webrequest)proportion (webrequest)n_users (EL)proportion (EL)
betaAndroid_7_0143750.2823444146900.3233333
betaAndroid_6_095400.1873785107140.2358198
betaAndroid_5_187120.171115447740.1050778
betaAndroid_4_456050.110089835610.0783792
betaAndroid_7_139380.077347641830.0920696
betaAndroid_5_018580.036493622030.0484890
betaAndroid_8_017850.035059819970.0439548
betaAndroid_8_117280.033940320360.0448132
betaAndroid_4_212030.02362855900.0129862
betaAndroid_4_06960.0136704390.0008584
betaAndroid_2_35460.010724200
betaAndroid_4_14590.00901543780.0083199
betaAndroid_4_34430.00870122560.0056347
betaAndroid_6_1100.000196430.0000660
betaAndroid_3_250.000098200

For production, we saw large discrepancies in Android 7.0 and 6.0 (The table only shows top 15):

Version typeOSn_users (webrequest)proportion (webrequest)n_users (EL)proportion (EL)
prodAndroid_7_04014140.4457357106040.2176117
prodAndroid_6_01848500.2052600173470.3559892
prodAndroid_7_1855230.094965924670.0506269
prodAndroid_5_1578100.064193046620.0956720
prodAndroid_4_4487250.054104955220.1133206
prodAndroid_8_0364940.04052349470.0194340
prodAndroid_5_0347710.038610243780.0898438
prodAndroid_8_1277160.03077627810.0160274
prodAndroid_4_297200.01079329540.0195777
prodAndroid_4_158190.00646155120.0105071
prodAndroid_4_341570.00461605130.0105276
prodAndroid_2_319890.002208600
prodAndroid_4_014530.0016134360.0007388
prodAndroid_3_2510.000056600
prodAndroid_-_-380.000042200

Then I broke down the counts by app version. The tables only show top 15 versions. And we saw large discrepancies in beta versions, but not in production versions:

Version typeApp versionn_users (webrequest)proportion (webrequest)n_users (EL)proportion (EL)
beta2.7.224-beta-2018-01-06236320.4640003353390.7776555
beta25.0.25-alpha-2018-01-1847330.092929600
beta2.7.222-amazon-2017-12-1536440.07154783270.0071958
beta2.0-dcg-2014-11-2116170.031748800
beta2.6.203-beta-2017-08-2813780.027056213550.0298176
beta2.1.141-dtac-2016-02-1012500.02454301270.0027947
beta2.0-releasesprod-2015-03-2312410.024366300
beta2.6.206-beta-2017-10-3010990.021578221750.0478622
beta2.7.222-beta-2017-12-159490.018633115980.0351649
beta2.5.194-alpha-2017-05-307420.014568700
beta2.0-beta-2014-12-195480.010759700
beta2.6.203-amazon-2017-08-284860.0095423490.0010783
beta2.6.206-amazon-2017-10-304510.0088551490.0010783
beta2.0-beta-2014-11-033900.007657400
beta2.6.198-beta-2017-06-093830.00752007970.0175385
Version typeApp versionn_users (webrequest)proportion (webrequest)n_users (EL)proportion (EL)
prod2.7.224-r-2018-01-067384900.8197420388640.7973738
prod2.6.206-r-2017-10-30287710.031936526260.0538777
prod2.6.203-r-2017-08-28285480.031689018000.0369307
prod2.7.222-r-2017-12-15265740.029497816990.0348584
prod2.6.198-r-2017-06-09138800.01540718820.0180960
prod2.1.141-r-2016-02-1082550.00916337840.0160854
prod2.5.195-r-2017-04-2157890.00642593550.0072835
prod2.0-r-2014-08-1356550.006277200
prod2.7.221-r-2017-12-0853840.00597643550.0072835
prod2.4.160-r-2016-10-1444610.00495181840.0037751
prod2.5.191-r-2017-03-3130830.00342221840.0037751
prod2.1.144-r-2016-05-0928760.00319242190.0044932
prod2.5.190-r-2017-02-2426850.00298041650.0033853
prod2.0-r-2015-01-1525250.002802800
prod2.4.184-r-2016-12-1421330.00236771090.0022364

Query:

SELECT 
IF(user_agent_map['wmf_app_version'] LIKE '%-r-%', 'prod', 'beta') AS app_version,
CONCAT(user_agent_map['os_family'], '_', user_agent_map['os_major'], '_', user_agent_map['os_minor']) AS os,
COUNT(DISTINCT IF(x_analytics_map['wmfuuid'] IS NOT NULL, x_analytics_map['wmfuuid'], PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'appInstallID'))) AS n_users
FROM wmf.webrequest
WHERE year=2018 AND month=2 AND day =1
AND http_status IN('200', '304')
AND user_agent_map['os_family'] = 'Android'
AND access_method = 'mobile app'
AND IF(x_analytics_map['wmfuuid'] IS NOT NULL, x_analytics_map['wmfuuid'], PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
GROUP BY IF(user_agent_map['wmf_app_version'] LIKE '%-r-%', 'prod', 'beta'), user_agent_map['os_family'], user_agent_map['os_major'], user_agent_map['os_minor'];

SELECT
IF(useragent.wmf_app_version LIKE '%-r-%', 'prod', 'beta') AS app_version,
CONCAT(useragent.os_family, '_', useragent.os_major, '_', useragent.os_minor) AS os,
COUNT(DISTINCT event.appInstallID) AS n_users
FROM event.mobilewikiappsessions
WHERE useragent.os_family = 'Android'
AND year=2018 AND month=2 AND day=1
GROUP BY IF(useragent.wmf_app_version LIKE '%-r-%', 'prod', 'beta'), useragent.os_family, useragent.os_major, useragent.os_minor;

@Dbrant Does this breakdown ring any bell?

chelsyx updated the task description. (Show Details)Feb 8 2018, 12:13 AM
chelsyx added a comment.EditedFeb 8 2018, 12:30 AM

@Charlotte, according to the breakdown tables above, some Android os/versions are over-represented and some are under-represented in our Eventlogging tables -- MobileWikiAppSessions and possibly others as well. This means the data we collected is not representative of the whole population of our users.

I suggest we set the priority of this ticket to high and move it to the bug backlog, since this bug and T186768 may make the analysis in T184098 (and it's sub-tickets) unreliable.

Dbrant claimed this task.Feb 8 2018, 10:25 PM
Nuria added a comment.EditedFeb 8 2018, 11:37 PM

Loads of work went into this, @chelsyx
Nice work.

If time intervals are the same on webrequest and eventlogging this data (2018/02/01) indeed does not match what we would expect to see for features like sessions that are going to be sampled across the user base. things that come to mind (thinking only about prod versions):

The sampling 1/100 does not seem to apply in any case. The ratio of EL(uniques)/webrequest(uniques) 56,061/ 1,101,078~ 5% is constant across OS and app versions if you total your 15 records. So we are sampling 4 times what we think we are, right?

While not the best situation, this might not be a problem if - like you said - at the beginning data is sampled across the user base but that does not seem to be the case in the case of OS.

Proportions seem OK when you look at app versions by themselves, each version has a similar proportion of data in EL and webrequest for the day in which you run your calculations so percentage of uniques per app version matches across the two datasources. I just do not understand how with equal numbers per version we can get such a different numbers per OS.

Dbrant added a comment.Feb 9 2018, 7:32 PM

Well, this is going to be the biggest facepalm in history:

Upon closer scrutiny of our code, while the intention was for the sampling to apply on a per-user basis, the actual effect turns out not to be the case. In fact, the sampling instead effectively applies to each instance of launching the app. Meaning that a single user might be in the sampling bucket during one usage of the app, but then might no longer be in the bucket the next time the app is launched.

I'm guessing this would invalidate any attempts at measuring unique users from this particular schema. Sorry for all the confusion! :(

We'll be sure to correct this inconsistency in our next update.

Thank you very much @Dbrant !

Given that the sampling applies to each instance of launching the app, if a user launched the app and get sampled to be tracked by EL, then the user closed the app for 20 min (less than the 30 min inactivity in our definition of a session), and then the user launched the app again but didn't get sampled this time. In this situation, only the interaction with the app before the 20 mins' break got logged by MobileWikiAppSessions, is that correct?

Dbrant added a comment.Feb 9 2018, 9:22 PM

In the case you describe, there wouldn't be anything logged at all, since the session is still continuing -- not enough time has elapsed for one session to expire and the next one to begin. (this is independent of actually logging the session to EL)

If the user launches the app, and the current instance of the app happens to be in the sampling bucket, and the previous session has expired, then the previous session will be logged.

Change 409450 had a related patch set uploaded (by Dbrant; owner: Dbrant):
[apps/android/wikipedia@master] WIP: Sample eventlogging schemas based on appInstallID, not current instance.

https://gerrit.wikimedia.org/r/409450

Nuria added a comment.Feb 9 2018, 10:30 PM

In fact, the sampling instead effectively applies to each instance of launching the app

Ok, at least we know now!
This means that users that use the app most are sampled more frequently which means that data is not random on the user/ phone dimension, and that we cannnot use it to determine the number of unique users (as you mentioned)
Now, sampling per session is not too bad, this is how data is sampled on our desktop site for the most part.

This still does not explain the low number of android 7 so we might have another problem besides this one.

Given these findings, I don't think the OS breakdown data can be trusted at this point, either. The way the current (incorrect) code works, the app can be placed into a sampling bucket every time it's started. However, in the world of Android, "starting" the app is not a predictable concept. Even if all of the app's windows are closed, the app's instance can still be kept running in the background by the system.

And of course, some versions of Android might be more greedy about garbage-collecting inactive apps, whereas others might allow app instances to persist for a long time. It's entirely possible that Android 7.0 is less greedy about terminating inactive apps, which means that there's less churn through re-bucketing of app instances, which might lead to the reduced EL numbers that we see.

I think we should wait and re-run the OS numbers after we deploy the update that fixes the per-user nature of the session schema.

Nuria added a comment.Feb 12 2018, 7:12 PM

I see, this means that we are going to have to throw away a couple years of data unless we can resample it somehow (random sampling from what we have? cc @chelsey)

Going forward let's please make a practice for developers to do basic vetting of metrics. Example: notice that in this case to see the sampling oddities it was enough to add the uniques in both sources (a simple addition, no stats). Let's not fire and forget metrics but rather follow through a bit to make sure things add up.

Going forward let's please make a practice for developers to do basic vetting of metrics. Example: notice that in this case to see the sampling oddities it was enough to add the uniques in both sources (a simple addition, no stats). Let's not fire and forget metrics but rather follow through a bit to make sure things add up.

I agree in general (people should double check after deployment as an essential practice) but do want to bring up that the modern app teams do do that in my experience. The buggy code is from almost 3 years ago (from https://phabricator.wikimedia.org/rAPAW8c39a9a96874ba537713cd8325ef5c9af2bd915f I think) and in that time people have learned and improved, so it's kind of unfair to @Dbrant et al. to point at this discovery of a nearly 3 year old bug as if it's a recent mistake and we should be better going forward.

Nuria added a comment.Feb 12 2018, 7:40 PM

No finger pointing on my end, at all. I am focused on looking for solutions and better process going forward, not guilty parties. If vetting is happening already, sounds excellent.

Change 409450 merged by jenkins-bot:
[apps/android/wikipedia@master] Sample eventlogging schemas based on appInstallID, not current instance.

https://gerrit.wikimedia.org/r/409450

I see, this means that we are going to have to throw away a couple years of data unless we can resample it somehow (random sampling from what we have? cc @chelsey)

Hi @Nuria, sorry I'm late. Random re-sampling may or may not be helpful, depending on whether the interested metric is measured on the user level. For example, if we want to compute daily avg pageviews per user using MobileWikiAppSessions, we can't get that from the old data because it's very unlikely that every sessions of any particular user in a day are selected to be logged; however, if I want to compute how much of the page users scrolls through on average using MobileWikiAppPageScroll, which use the same sampling method as MobileWikiAppSessions, I can re-sample from the current data to avoid certain OS being overrepresented in the dataset.

And of course, some versions of Android might be more greedy about garbage-collecting inactive apps, whereas others might allow app instances to persist for a long time. It's entirely possible that Android 7.0 is less greedy about terminating inactive apps, which means that there's less churn through re-bucketing of app instances, which might lead to the reduced EL numbers that we see.

@Dbrant Does this mean that aggregating session length across OS is not meaningful (e.g. daily average session length) because session length is correlated with OS (e.g. Android 7.0 are more likely to have a longer session)? Maybe we should compare session length within OS groups?

@chelsyx Agreed, +1

Does this mean that aggregating session length across OS is not meaningful (e.g. daily average session length) because session length is correlated with OS (e.g. Android 7.0 are more likely to have a longer session)

Also , better android version means newer device and likely a better connection and better connections are always associated with longer sessions in mobile (not just for wikipedia but in general).

@Dbrant: which release of the app will have the fix?

Restricted Application added a project: Product-Analytics. · View Herald TranscriptApr 19 2018, 12:21 AM
mpopov closed this task as Resolved.Apr 23 2018, 10:59 PM

Thanks, everyone!