Page MenuHomePhabricator

Percentage pageviews from Russia is too low in recent geographical breakdowns in Wikistats
Closed, DeclinedPublic

Description

http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerCountryOverview.htm

Here are some figures for percentage page views from Russia:

2013Q4 4.4%
2014Q1 5.2%
2014Q2 4.9%
2014Q3 3.4%
2014Q4 3.0%

2015 Jan 3.0%
2015 Feb 2.1%
2015 Mar 0.6%
2015 Apr 0.7%
2015 May 1.2%
2015 Jun 2.1%

Some fluctuation is normal, as these figures are from 1:1000 sampled logs.
But data for Russia are clearly wrong.

Percentages for other large countries are stable: I checked Japan, Germany, France, India.

Event Timeline

ezachte raised the priority of this task from to Needs Triage.
ezachte updated the task description. (Show Details)
ezachte subscribed.

Wikistats traffic scripts that parse squid logs can't handle secure messages. Up till early 2015 only 1 or 2 percent of message was secure, so this did not influence breakdown of traffic per country/region that much.

Since June 2015 almost all traffic is secure. The transition happened earlier for traffic from Russia, so percentual traffic from there dropped.

There are theoretically three solutions:

  1. Collect the orignal ip address from the x_forwarded_for field. However the geoiplogtag tool (C) by Mark Bergsma can't do that. Direct lookup in MaxMInd DB from perl is very costly, even for 1:1000 sampled log. Extending geoiplogtag would be complicated and possible maintenance prone in, where all other logic for traffic analysis resides in hadoop environment.
  1. As squid logs are already produced from hadoop and the country lookup already happens there, upgrading the squid log format by appending the country code as an extra field would be much easier and faster.
  1. An alternative solution would be to do away with direct parsing of squid logs and generating the intermediate csv files for all Wikistats traffic reports directly from hive.

Closing this ticket as Wikistats version 1 is dead per https://stats.wikimedia.org/Wikistats_1_announcements.htm . In case this ticket is still a valid bug report or feature request for Wikistats 2, then please reopen. Thanks a lot!