Page MenuHomePhabricator

Generate various histograms and time series for exploring the new reader retention metric
Closed, ResolvedPublic

Description

Extending the work from T200111:

For the purpose of exploring and vetting the planned new reader retention metric(s), we want to be able to quickly compare many different graphs of the following kind:

  • Daily histograms of return time for several (e.g. +/-3) days around a date where the average next return metric shows spikes or other anomalies. Also (like in T200111#4539752 ) stacked by the following dimensions:
    • project
    • browser_family
    • os_family
    • whether the counted (return) request was a main page view or not (a frequent form of anomalous requests)
  • Time series of the average next return time (within 7 days and 31 days) for a variety of countries and projects, using all the available data (from December 2016 to now)

Since we don't currently have a tool for exploring the underlying data (tbayer.webrequest_extract_bak) interactively, @MNeisler plans to come up with some R code allowing to quickly generate a sufficient amount of examples of these kind of graphs.

Event Timeline

I finished creating time-series graphs looking at the users avg next return time (within 31 days and 7 days) for a variety of countries and projects from December 2016 through July 2018.

I’m currently reviewing the graphs and their breakdowns to identify trends. Below are some initial observations:

  • Many of the countries have noticeable spikes and other anomalies that occur across various months and the majority of these spikes occur on desktop.
  • On English Wikipedia, there were a number of sudden drops on desktop between May and July 2017, where the avg return time within 31 days decreased from around 5.5 to 1.0 or 2.0 days. Similar drops during this timeframe were also seen in for Wikimedia and Wikisource projects and from US, Japan, France, and Russia countries. I’ll investigate further by looking through the raw dataset and using daily histograms of return time around those dates.

  • There is an increase average return time in December for Germany, Spain and U.S. indicating impacts from Christmas on avg user return time.
  • The 31 day returns graphs show more noticeable time series trends compared to the 7-day graphs.

31- and 7-day time series graphs for all Wikipedia projects from Germany (Country code = 'DE') and from Spain (Country code = 'ES') below as examples:


Next, I’m working on creating daily histograms of return time for several (e.g. +/-3) days around the dates where the average next return metric shows spikes or other anomalies.

See examples for Germany below where here was a spike in avg user returns within 31 days on 2018-03-01 (rose from an avg of 5 to 8.58).


See Codebase for additional graphs.

I finished creating time-series graphs looking at the users avg next return time (within 31 days and 7 days) for a variety of countries and projects from December 2016 through July 2018.
I’m currently reviewing the graphs and their breakdowns to identify trends. Below are some initial observations:

....

  • On English Wikipedia, there were a number of sudden drops on desktop between May and July 2017, where the avg return time within 31 days decreased from around 5.5 to 1.0 or 2.0 days. Similar drops during this timeframe were also seen in for Wikimedia and Wikisource projects and from US, Japan, France, and Russia countries. I’ll investigate further by looking through the raw dataset and using daily histograms of return time around those dates.

Interesting! So it seems that the average may have been integer-valued on these drop days in F26025897 ? That would point to a data artifact.

  • On English Wikipedia, there were a number of sudden drops on desktop between May and July 2017, where the avg return time within 31 days decreased from around 5.5 to 1.0 or 2.0 days. Similar drops during this timeframe were also seen in for Wikimedia and Wikisource projects and from US, Japan, France, and Russia countries. I’ll investigate further by looking through the raw dataset and using daily histograms of return time around those dates.

The sudden integer-value drops in avg user return time between May and July 2017 to 1.00 or 2.00 appears to be due to malformed data in the WMF-Last-Access field around that time, e.g.:

26-Jul-2017, WMF-Last-Access=27-Jul-2017, GeoIP=US:VA:Ashburn:39.05:-77.47:v4, WMF-Last-Access-Global=27-Jul-2017
(everything after the first 26-Jul-2017 should not be in there)

I added a length check condition to the query to filters out all these malformed examples and added results through the end of August 2018. See update time series graph for avg user returns within 31 days from the United States to all Wikis.

Data via:

SELECT country_code as country, access_method, wmf_last_access as last_seen_date, 
-- average days until next access
SUM(((unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') - unix_timestamp(wmf_last_access, 'dd-MMM-yyyy'))/86400) * view_count)/ SUM(view_count) AS avg_days_till_next_access 
FROM tbayer.webrequest_extract_bak 
WHERE unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') IS NOT NULL 
AND length(wmf_last_access) < 15 -- length check to remove malformed data around that time. 
-- accessed between December 15, 2016 and August 27, 2018
AND unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') >= 1481760000 -- 12/15/2016 @ 12:00am (UTC)
AND unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') < 1535328000 -- 8/27/2018 12:00:00 AM (UTC)
AND ( (year = 2016) OR (year = 2017) OR (year = 2018 and month <=9) )
AND (access_method = 'desktop' OR access_method = 'mobile web') 
AND project_class = 'wikipedia' 
-- avg returns within 31 days
AND unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') < (unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') + 2764800) -- 2764800 seconds = 32 days
AND unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') >= (unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') + 86400) -- 86400 seconds = 1 day
GROUP BY country_code, wmf_last_access, access_method;

The updated time series charts also showed some recent spikes in August for several countries and project classes. See avg returns for MediaWiki and from Bangladesh below. I'll further investigate these recent spikes by looking at whether the counted (return) requests around these dates were a main page view or not and frequently viewed pages by joining tbayer.webrequest_extract_bak with wmf.webrequest.

I further investigated recent spikes in August 2018 identified in (return) requests from Spain, Japan, France, and India. For example, there was spike in average user return time seen in Japan on desktop around 2018-08-10 to all Wikipedia projects.

In addition to breakdowns by os, project, and browser_family, I looked at the number of return requests to main pages (isolating to project main pages viewed by the country reviewed) and pages where the title was not extracted using data recorded in mneisler.last_access_main_page_views_by_country extracted from the webrequest table. I looked at 3 days before and after the spike date for each countrry. A few example daily histograms show below for Japan, Spain and France:

The counted (return) requests for dates around these spikes did not show a large number of views to main pages or to pages where the title was not extracted. This indicates that the increase in average return time around these dates was not due to an increase in anomalous requests.

Histograms with main page breakdowns below:

Data for Japan August 2018 desktop returns via:


start_date <- 1533600000 ## 08/07/2018 @ 12:00am (UTC)
end_date <- 1534118400 ## 08/13/2018 @ 12:00am (UTC)

jp_return_frequency_Aug18 <- do.call(rbind, lapply(seq(start_date, end_date, by=86400), function(date) {
  cat("Fetching webrequest data from ", as.character(date), "\n")
  
  query <- paste("
                 SELECT '",date,"' AS date, user_agent_map['os_family'] as os_family, user_agent_map['browser_family'] as browser_family,
                 is_main_page, (unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') 
                 - unix_timestamp(wmf_last_access, 'dd-MMM-yyyy'))/86400 AS days_till_next_access, 
                 COUNT (*) AS returns_each_day
                 FROM mneisler.last_access_main_page_views_by_country
                 WHERE unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') = ", date, "
                 AND year = 2018
                 AND access_method = 'desktop'
                 AND country_code = 'JP'
                 AND length(wmf_last_access) < 15 -- length check to remove malformed data around that time
                 AND unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') < (unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') + 2764800) -- 2764800 seconds = 32 days
                 AND unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') >= (unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') + 86400) -- 86400 seconds = 1 day
                 GROUP BY (unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') - unix_timestamp(wmf_last_access, 'dd-MMM-yyyy'))/86400, 
                 user_agent_map['os_family'], user_agent_map['browser_family'], is_main_page
                 ;")
  
  cat(query)
  results <- wmf::query_hive(query)
  return(results)
}))

readr::write_rds(jp_return_frequency_Aug18, "jp_return_frequency_Aug18.rds", "gz")

See Codebase for additional plots.

Quick summary of work progress since my last post:

Further review of daily histograms of return time:

  • Created gifs of a series of histograms to help view changes in distribution over a few dates around identified spikes.
  • The histograms show valleys or decreases in the distributions around holidays and weekends. This reflects a decrease of active users returning during those times.
  • See histograms for United States and France around Christmas below:

Impact of External Events on Average User Return Time

  • A decrease in active users or spike in new or infrequent users from an external event may result in an increase in the average return time a day or two prior to the event. To investigate, I reviewed the average returns and pageviews around the dates of a few external events from the past year (page preview rollouts and WPO shutdown).
  • English Wikipedia page preview rollout on April 17, 2018: A/B tests indicated that this feature led to an expected decrease in pageviews by 3-5% per session but there were are no noticeable spikes in the average returns within 31 days around that date or any significant changes in the daily frequency of returns.
  • Wikipedia Zero program (WPO) shutdown on June 29, 2018 in Angola: The time series charts below shows an increase in the average user returns from within 31 days on mobile web from Angola starting on June 26, 2018, a few days prior to the shutdown. The Mobile web pageviews in Angola decreased significantly on the shutdown date, indicating that the increase in average return time is due to a decrease in active users.

  • The histograms of daily returns 5 days before and after the Wikipedia Zero Shutdown show a drop in users returns on the date of the shutdown. Prior to the shutdown on June 28, the percentage of users returning in 1 day ranges from 30 to 40%. After the shutdown, only between 20 to 30% return in 1 day.

I'll work on wrapping up a report summarizing results obtained from this investigation and present results to the Product Analytics team at an upcoming team share.

Updated codebase

MNeisler closed this task as Resolved.Apr 17 2019, 3:49 PM

Data exploration results summarized and posted on meta. Marking as done but let me know if you have any questions!