Page MenuHomePhabricator

Investigate the spikes in average user return time in Indonesia and Bangladesh on Wikipedia
Closed, ResolvedPublic

Description

In T184677#4432372, I (@MNeisler) identified several spikes in average user return time within 31 days for Indonesia and Bangladesh on Wikipedia, based on the last access data recorded in tbayer.webrequest_extract_bak. We'd like to investigate these spikes to learn more about the behavior of this new metric and what caused the average to go up.

Potential direction for investigation:

  • Plot histograms showing the frequency of next return data, separately for a few days around the identified spike.
  • Break out by project, os and browser type
  • Break out views to main page

Event Timeline

MNeisler triaged this task as Low priority.Jul 20 2018, 10:17 PM
MNeisler created this task.
MNeisler moved this task from Triage to Next Up on the Product-Analytics board.
Tbayer updated the task description. (Show Details)Aug 1 2018, 2:05 PM
MNeisler added a comment.EditedAug 28 2018, 7:21 PM

I plotted histograms showing the frequency of next return data (frequency among unique devices seen on a particular date and returning within 31 days) on all Wikipedia projects for a few days around identified spikes in Indonesia and Bangladesh.

In Indonesia, all of the identified avg return time spikes occurred on desktop. I investigated the spike on June 22, 2017 (9.9 days on desktop), on December 22, 2017 (8.5 days on desktop), and on February 9, 2018 (8.6 days on desktop). For dates leading up to the spike, the histograms show an inconsistent decline after day 1. For example, on June 22nd, the numbers of unique devices returning after 11 days increases compared to those returning after 10 days. Starting on June 24th, the histogram shows a consistent decline in the frequency of returns from day 1 to day 31.

See histograms for June 22, 2017 (peak date) and June 24, 2017 below as :


In Bangladesh, I investigated the spikes on January 27, 2017 (13 days on desktop), December 30, 2017 (7.7 days on mobile, and April 10, 2018 (7.6 days on mobile). Compared to Indonesia, histograms for days around each of these peaks showed a higher return rate on day 1 and a consistent decline in returns on the following days (2 to 31). See histograms for April 10, 2018 (peak date) and April 12, 2018 below as an example.


I also broke down each histogram by project (specifically looking at non-wikipedia projects, os, and browser type). The breakdowns appear equally distributed indicating that the increases in average return time during these times are a result of real user behavior and not an artifact. Histograms for June 22, 2017 in Indonesia below:



Data for Indonesia June 2017 desktop returns via

start_date <- 1497916800  #2017-06-20
end_date <-  1498262400 #2017-06-24

id_return_frequency_june17<- do.call(rbind, lapply(seq(start_date, end_date, by=86400), function(date) {
  cat("Fetching data from ", as.character(date), "\n")
  
  query <- paste("
    SELECT '",date,"' AS date, project_class AS project, user_agent_map['os_family'] as os_family, user_agent_map['browser_family'] as browser_family,
    (unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') 
    - unix_timestamp(wmf_last_access, 'dd-MMM-yyyy'))/86400 AS days_till_next_access, 
    COUNT (*) AS returns_each_day
    FROM tbayer.webrequest_extract_bak
    WHERE unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') = ", date, "
    AND year = 2017
    AND access_method = 'desktop'
    AND country_code = 'ID'
    AND unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') < (unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') + 2764800) -- 2764800 seconds = 32 days
    AND unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') >= (unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') + 86400) -- 86400 seconds = 1 day
    GROUP BY (unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') - unix_timestamp(wmf_last_access, 'dd-MMM-yyyy'))/86400, project_class, user_agent_map['os_family'], user_agent_map['browser_family']
                 ;") 
  cat(query)
  results <- wmf::query_hive(query)
  return(results)
}))

readr::write_rds(id_return_frequency_june17, "id_return_frequency_june17.rds", "gz")
MNeisler updated the task description. (Show Details)Aug 28 2018, 7:22 PM
Aklapper changed the edit policy from "Custom Policy" to "All Users".Sep 17 2018, 5:58 PM
Aklapper changed Risk Rating from N/A to default.
MNeisler added a comment.EditedOct 24 2018, 9:04 PM

I updated the time series graphs of the average next return time within 31 days, using all the available data (from December 2016 through August 2018). See T184677#4432372 for original time series graphs.

I further investigated some of the spikes in avg user return time seen in August 2018 for both Indonesia and Bangladesh by plotting daily histograms of return time for several (e.g. +/-3) days around the spike dates

In addition to breakdowns by os, project, and browser_family, I looked at the number of return requests to main pages (localized to wiki project main pages viewed by the country reviewed) and pages where the title was not extracted using data recorded in mneisler.last_access_main_page_views_by_country extracted from the webrequest table.

The counted (return) requests for dates around these spikes did not show a large number of views to main pages or to pages where the title was not extracted. This indicates that the increase in average return time around these dates was not due to an increase in the number of anomalous requests.

Histograms with main page breakdowns below:

Returns from Indonesia on Desktop on last access date of 2018-08-07

Returns from Indonesia on Mobile Web on last access date of 2018-08-07

Returns from Bangladesh on Desktop on last access date of 2018-08-16

Returns from Bangladesh on Mobile Web on last access date of 2018-08-16

Data for Bangladesh August 2018 desktop returns via:


start_date <- 1534118400 ## 08/13/2018 @ 12:00am (UTC)
end_date <- 1534636800 ## 08/19/2018 @ 12:00am (UTC)

bd_return_frequency_Aug18 <- do.call(rbind, lapply(seq(start_date, end_date, by=86400), function(date) {
  cat("Fetching webrequest data from ", as.character(date), "\n")
  
  query <- paste("
                SELECT '",date,"' AS date, user_agent_map['os_family'] as os_family, user_agent_map['browser_family'] as browser_family,
                 is_main_page, (unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') 
                 - unix_timestamp(wmf_last_access, 'dd-MMM-yyyy'))/86400 AS days_till_next_access, 
                 COUNT (*) AS returns_each_day
                 FROM mneisler.last_access_main_page_views_by_country
                 WHERE unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') = ", date, "
                 AND year = 2018
                 AND access_method = 'desktop'
                 AND country_code = 'BD'
                 AND length(wmf_last_access) < 15 -- length check to remove malformed data around that time
                 AND unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') < (unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') + 2764800) -- 2764800 seconds = 32 days
                 AND unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') >= (unix_timestamp(wmf_last_access, 'dd-MMM-yyyy') + 86400) -- 86400 seconds = 1 day
                 GROUP BY (unix_timestamp(CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')), 'yyyy-MM-dd') - unix_timestamp(wmf_last_access, 'dd-MMM-yyyy'))/86400, 
                user_agent_map['os_family'], user_agent_map['browser_family'], is_main_page
                 ;")
  
  cat(query)
  results <- wmf::query_hive(query)
  return(results)
}))

See Codebase for additional plots.

@Tbayer Let me know if you have any question or comments.

MNeisler updated the task description. (Show Details)Oct 24 2018, 9:05 PM
Tbayer closed this task as Resolved.Dec 15 2018, 12:17 AM