We want to measure the impact in terms of readers' engagement of adding the new Data Center in Singapore.
Readers get significant latency improvements, but does this impact their engagement with Wikipedia? Do they tend to read more pages? Does the number of unique devices significantly increase? Is it more likely that they register as new users?
One possible idea is to build, for different time snapshots, a regression model that can tell whether there is a relation between countries Wikipedia usage and data center re-routing.
We will describe each country with the following independent variables:
- Percentage increase of internet users in the country (this is to control for potential increase of unique devices due to overall internet population increase)
- Percentage increase/decrease of number of monthly unique devices (this is to see the impact on new users)
- Percentage increase/decrease of number of monthly pageviews (this is to see the impact on pageviews)
- Percentage increase/decrease of number of monthly pageviews/unique devices (this is to see the impact on reading depth)
The target/dependent variable will be:
- 0 if the country is not included in the new data center at the time of the snapshot (control countries)
- 1 if the country has been re-routed at the time of the snapshot (treated countries)
We will plot the significance of each feature over different time snapshots. If engagement-related variables becomes predictive of the target variable after re-routing time, it means that there is a significant relation between the switch to the data center and readers' engagement.
To do this, we will need:
- List of countries re-routed to Singapore and date of re-routing https://github.com/wikimedia/operations-dns/blob/master/geo-maps but missing date
- Statistics on internet usage per country - to interpolate by month https://www.itu.int/en/ITU-D/Statistics/Pages/stat/default.aspx
- Counts of pageviews/monthly unique devices from Hive