### Observation
In [mobile apps uniques](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/mobile_apps_uniques), we have 1,080,318 unique app on 2017-12-3:
```lang=sql, lines=5
SELECT *
FROM wmf.mobile_apps_uniques_daily
WHERE year=2017 AND month=12 AND day =3
AND platform='Android'
```
Using webrequest data, we counted 1,101,078 unique app on 2017-12-3, closed to the number above:
```lang=sql, lines=5
SELECT COUNT(DISTINCT IF(x_analytics_map['wmfuuid'] IS NOT NULL, x_analytics_map['wmfuuid'], PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'appInstallID')))
FROM wmf.webrequest
WHERE year=2017 AND month=12 AND day =3
AND http_status IN('200', '304')
AND user_agent_map['os_family'] = 'Android'
AND access_method = 'mobile app'
AND IF(x_analytics_map['wmfuuid'] IS NOT NULL, x_analytics_map['wmfuuid'], PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
```
But on the same day in [MobileWikiAppSessions](https://meta.wikimedia.org/wiki/Schema:MobileWikiAppSessions), there are 56,061 production app and 51,708 beta app. If this calculation make sense: production*100+beta=56,061*100+51,708=5,657,808, since production is sampled 1:100 and beta is sampled 1:1. Then the difference is about 5 times, which seems too large than expected.
```lang=sql, lines=5
SELECT
IF(useragent.wmf_app_version LIKE '%-r-%', 'prod', 'beta') AS app,
COUNT(DISTINCT event.appInstallID) AS n_users
FROM event.mobilewikiappsessions
WHERE useragent.os_family = 'Android'
AND year=2017 AND month=12 AND day =3
GROUP BY IF(useragent.wmf_app_version LIKE '%-r-%', 'prod', 'beta')
```
### Problem
There may be a bug in the sampling process which results in sampling more data than expected. This bug may affect other schema that use [the same sampling method](https://github.com/wikimedia/apps-android-wikipedia/blob/d64e76039d541aba6205b630833021e4d030b1d4/app/src/main/java/org/wikipedia/analytics/Funnel.java#L129) as well.
It probably did not affect the quality of our collected data, and hence should not have a big impact on our on-going analysis (T184098), unless others think this is not only a sampling problem and may link to bigger issue.