Page MenuHomePhabricator

Estimate the daily number of mobile devices who already visited on the preceding day
Closed, ResolvedPublic

Description

For the mobile web version of Wikipedia, based on last-access cookies, possibly split by OS

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 30 2017, 9:54 PM

Didn't yet get to wrap this up today and do a fuller writeup, but the answer for the English Wikipedia (en.m.wikipedia.org) is about 8-10 million per day. For Spanish, 1-2 million. Compare to e.g. the 1.1 million daily active users of the Android app.
(Starting next week or so we should also be able to get a global number - across all language Wikipedia - from the new last-access cookie.)

Thanks @Tbayer Any additional big languages are helpful.

Per our meeting, for this same presentation I'm also looking for:

  • Pageviews by platform (back to begin of 2016)
  • D7 Retention by platform (back as far as the data exists)
  • Wikidata edits and revert rates (since launch)

Thanks @Tbayer Any additional big languages are helpful.

Below is the same metric (number of mobile devices already seen yesterday) for the 20 most (re)-visited mobile web versions of Wikipedia, averaged over the four weeks from June 3-30:

wikipedia_version	avg_visited_yesterday
en.m.wikipedia.org	8859519
ja.m.wikipedia.org	1924954
es.m.wikipedia.org	1542810
de.m.wikipedia.org	1093970
ru.m.wikipedia.org	884864
it.m.wikipedia.org	849191
fr.m.wikipedia.org	808798
zh.m.wikipedia.org	391060
pt.m.wikipedia.org	372229
pl.m.wikipedia.org	254385
ar.m.wikipedia.org	245965
fa.m.wikipedia.org	193206
nl.m.wikipedia.org	176551
sv.m.wikipedia.org	144317
id.m.wikipedia.org	123807
fi.m.wikipedia.org	76474
th.m.wikipedia.org	75938
he.m.wikipedia.org	72021
ko.m.wikipedia.org	60220
vi.m.wikipedia.org	51116
20 rows selected (310.205 seconds)

SELECT CONCAT(project,'.m.wikipedia.org') AS wikipedia_version,
INT(AVG(visited_yesterday)) AS avg_visited_yesterday FROM (
SELECT year, month, day, CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date,
project, 
SUM(IF((unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') - unix_timestamp(wmf_last_access, 'dd-MMM-yyyy')) = 86400,1,0) * view_count) AS visited_yesterday
FROM tbayer.webrequest_extract_bak
WHERE unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') IS NOT NULL 
AND year = 2017 AND month = 6 AND day >= 3 -- 4x7 days, to account for weekly seasonality
AND access_method = 'mobile web'
AND project_class = 'wikipedia'
GROUP BY year, month, day, project) AS byday
GROUP BY project
ORDER BY avg_visited_yesterday DESC LIMIT 20;

Per our meeting, for this same presentation I'm also looking for:

  • Pageviews by platform (back to begin of 2016)
  • D7 Retention by platform (back as far as the data exists)
  • Wikidata edits and revert rates (since launch)

Done (with only the overall revert rate, I think we don't need all the available detail)

And here is the same data (daily average of been-here-yesterdays for June 3-30), with iOS and Android clients broken out:

wikipedia_versionall_yesterdaysios_yesterdaysandroid_yesterdays
en.m.wikipedia.org885951947946813859229
ja.m.wikipedia.org19249541107981804330
es.m.wikipedia.org15428103239531185449
de.m.wikipedia.org1093970479643591720
ru.m.wikipedia.org884864292457566299
it.m.wikipedia.org849191302864515582
fr.m.wikipedia.org808798371527411050
zh.m.wikipedia.org391060185535204657
pt.m.wikipedia.org37222977633282021
pl.m.wikipedia.org25438547918194028
ar.m.wikipedia.org24596592152147949
fa.m.wikipedia.org19320632721158611
nl.m.wikipedia.org1765519811176033
sv.m.wikipedia.org1443179060852857
id.m.wikipedia.org1238071028197996
fi.m.wikipedia.org764742309549188
th.m.wikipedia.org759382460950933
he.m.wikipedia.org720212167250114
ko.m.wikipedia.org602201358446574
vi.m.wikipedia.org511161793729189
Data source is this (slightly clumsy) query:

SELECT CONCAT(all_avgs.project,'.m.wikipedia.org') AS wikipedia_version, 
all_avgs.avg_visited_yesterday AS all_yesterdays,
ios_avgs.avg_visited_yesterday_ios AS ios_yesterdays,
android_avgs.avg_visited_yesterday_android AS android_yesterdays
FROM

(
SELECT project, os_family,
INT(AVG(visited_yesterday)) AS avg_visited_yesterday_ios FROM (
  SELECT year, month, day, CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date,
  project, 
  SUM(IF((unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') - unix_timestamp(wmf_last_access, 'dd-MMM-yyyy')) = 86400,1,0) * view_count) AS visited_yesterday,
  user_agent_map['os_family'] AS os_family
  FROM tbayer.webrequest_extract_bak
  WHERE unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') IS NOT NULL 
  AND year = 2017 AND month = 6 AND day >= 3 -- 4x7 days, to account for weekly seasonality
  AND access_method = 'mobile web'
  AND project_class = 'wikipedia'
  GROUP BY year, month, day, project, user_agent_map['os_family'] 
  HAVING user_agent_map['os_family'] = 'iOS') AS iosbyday
GROUP BY project, os_family
) AS ios_avgs

JOIN 

(
SELECT project, os_family,
INT(AVG(visited_yesterday)) AS avg_visited_yesterday_android FROM (
  SELECT year, month, day, CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date,
  project, 
  SUM(IF((unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') - unix_timestamp(wmf_last_access, 'dd-MMM-yyyy')) = 86400,1,0) * view_count) AS visited_yesterday,
  user_agent_map['os_family'] AS os_family
  FROM tbayer.webrequest_extract_bak
  WHERE unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') IS NOT NULL 
  AND year = 2017 AND month = 6 AND day >= 3 -- 4x7 days, to account for weekly seasonality
  AND access_method = 'mobile web'
  AND project_class = 'wikipedia'
  GROUP BY year, month, day, project, user_agent_map['os_family'] 
  HAVING user_agent_map['os_family'] = 'Android') AS androidbyday
GROUP BY project, os_family
) AS android_avgs
ON ios_avgs.project = android_avgs.project

JOIN 
(
SELECT project,
INT(AVG(visited_yesterday)) AS avg_visited_yesterday FROM (
  SELECT year, month, day, CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date,
  project, 
  SUM(IF((unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') - unix_timestamp(wmf_last_access, 'dd-MMM-yyyy')) = 86400,1,0) * view_count) AS visited_yesterday
  FROM tbayer.webrequest_extract_bak
  WHERE unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') IS NOT NULL 
  AND year = 2017 AND month = 6 AND day >= 3 -- 4x7 days, to account for weekly seasonality
  AND access_method = 'mobile web'
  AND project_class = 'wikipedia'
  GROUP BY year, month, day, project) AS allbyday
GROUP BY project
) AS all_avgs
ON android_avgs.project = all_avgs.project

ORDER BY all_yesterdays DESC LIMIT 20;

As discussed, I also started to look into generating a global version based on the new global Last-access cookie, but it seems this needs some extra care to make sure the interpretation is valid (and also, to take a similar average over several weeks, a potentially more time-consuming webrequest query that could easily take a day to run), so this won't be ready today, sorry.

Tbayer closed this task as Resolved.Sep 9 2017, 2:16 AM

Closing this now, as the main requests were all done back in July for the occasion at which they were needed, I haven't had time to tackle the bonus task (global numbers yet) since; it would be more complicated and probably justify a separate task,