Observation
In mobile apps uniques, we have 1,080,318 unique app on 2017-12-3:
SELECT * FROM wmf.mobile_apps_uniques_daily WHERE year=2017 AND month=12 AND day =3 AND platform='Android'
Using webrequest data, we counted 1,101,078 unique app on 2017-12-3, closed to the number above:
SELECT COUNT(DISTINCT IF(x_analytics_map['wmfuuid'] IS NOT NULL, x_analytics_map['wmfuuid'], PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'appInstallID'))) FROM wmf.webrequest WHERE year=2017 AND month=12 AND day =3 AND http_status IN('200', '304') AND user_agent_map['os_family'] = 'Android' AND access_method = 'mobile app' AND IF(x_analytics_map['wmfuuid'] IS NOT NULL, x_analytics_map['wmfuuid'], PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'appInstallID')) IS NOT NULL
But on the same day in MobileWikiAppSessions, there are 56,061 production app and 51,708 beta app. If this calculation make sense: production*100+beta=56,061*100+51,708=5,657,808, since production is sampled 1:100 and beta is sampled 1:1. Then the difference is about 5 times, which seems too large than expected.
SELECT IF(useragent.wmf_app_version LIKE '%-r-%', 'prod', 'beta') AS app, COUNT(DISTINCT event.appInstallID) AS n_users FROM event.mobilewikiappsessions WHERE useragent.os_family = 'Android' AND year=2017 AND month=12 AND day =3 GROUP BY IF(useragent.wmf_app_version LIKE '%-r-%', 'prod', 'beta')
Problem
There may be a bug in the sampling process which results in sampling more data than expected. This bug may affect other schema that use the same sampling method as well.