Mobile impressions / pageviews for Ireland are hovering at 70%, which is unusually low. We should investigate.
Description
Related Objects
- Mentioned In
- T238560: Doubts and questions about Kerberos and Hadoop
T154560: Spike: CentralNotice: Consider how to disable CN banners, or at least Fundraising banners, on FB's "Free Basics" service - Mentioned Here
- P4700 T152650 Temporary Hive table for querying IP address, proxy, region and impressions in IE for mobile FR campaign
P4702 T152650 ipython munging to find impression rates by proxy and region in IE for mobile FR campaign
T154560: Spike: CentralNotice: Consider how to disable CN banners, or at least Fundraising banners, on FB's "Free Basics" service
Event Timeline
Just some random ideas about stuff to check... Maybe lock for lack of correspondence between the country param from the beacon/impression url and country in geocoded_data in the Hive webrequest entry? See if other geotargeted and non-geotargeted campaigns have had the same mobile impression/pageview rate? Also, any significant differences in browser versions used there? Difference in IP versions? (Sorry for going on and on, hope this is useful ;p )
I pulled the data for Dec. 7, and indeed, Ireland is abnormally way down... just above 50%.
The following data is not detailed; I used the wmf.pageviews_hourly table in Hive, so I couldn't filter all the factors that cause CN not to run. (I think we can get more precise data from wmf.webtrequests.) However, it shows that there's definitely something going on with Ireland mobile.
Date | access_method | country_code | pageviews | impressions | rate |
2016-12-07 | desktop | AU | 3602964 | 3035589 | 0.842525487348749 |
2016-12-07 | desktop | CA | 8862933 | 6816711 | 0.769125863864705 |
2016-12-07 | desktop | GB | 13133650 | 11750591 | 0.894693478202937 |
2016-12-07 | desktop | IE | 939310 | 773964 | 0.823970787067102 |
2016-12-07 | desktop | NZ | 677186 | 611892 | 0.903580404792775 |
2016-12-07 | desktop | US | 60213188 | 48814949 | 0.810701951207101 |
2016-12-07 | mobile web | AU | 3080791 | 2829559 | 0.918452111811545 |
2016-12-07 | mobile web | CA | 4568113 | 4330745 | 0.948038062981367 |
2016-12-07 | mobile web | GB | 11638393 | 10696818 | 0.919097507705746 |
2016-12-07 | mobile web | IE | 1546894 | 780968 | 0.504862000886939 |
2016-12-07 | mobile web | NZ | 480079 | 441832 | 0.920331862047705 |
2016-12-07 | mobile web | US | 52453939 | 47560001 | 0.906700276598865 |
Hi! It looks like this is mobile network issues causing CN to not display banners and/or not report impressions.
Initial digging:
- IPv6 vs. IPv4: No correlation to reduced impression rate found.
- country param on beacon/impression (comes from GeoIP cookie) vs. country assigned to webrequest in Hive (directly from IP on that request): For all countries, a small number of requests to beacon/impression were from countries other than the one indicated by the GeoIP cookie (which CN uses to geolocate). Ireland was no different from other countries in this regard. It's almost certainly due to people travelling and keeping the GeoIP cookie they got in one location when they visit the site from the second location.
- browser family/major version/os: There is some relation to client platform: the most recent versions of browsers had a normal impression rate. Rate declined generally (though haphazardly) with browser major version. This isn't what you'd expect if there were some JS bug getting triggered in some browsers but not others. Overall, some browsers are quite a bit worse off than others, though.
So, it looks like browser/major version is, at least in part, a proxy for something else, like age/speed of a device, or speed of mobile network.
Not sure if I'm not making wrong assumptions (hope not!)... but I noticed that Ireland has significantly less LTE coverage than other countries in the campaign: https://opensignal.com/reports/2016/02/state-of-lte-q4-2015/
Also, relative to other countries in the campaign, it has a higher percentage of the population living in rural areas (which, one would imagine, are the ones that have less coverage).
So, I looked at impression rates by region, to see if more rural regions had lower rates. Unfortunately, our data on regions isn't very good; Maxmind locates most users in region "Unknown". However, the rates from Unknown are staggering, accounting for basically the entire drop in impressions:
Region | Total pageviews | Impression rate |
Unknown | 1035039 | 0.297322129890758 |
Leinster | 406375 | 0.924746847123962 |
Connaught | 23929 | 0.936980233189853 |
Ulster | 9738 | 0.93787225302937 |
Munster | 71813 | 0.938841156893599 |
We don't know the conditions under which Maxmind can't figure out region based on IP in Ireland. Maybe Unknowns are all mobile networks? Or just all the bad providers? In any case, it's plausible that it would correlate with network quality.
To confirm the hypothesis that mobile network issues are the cause, I'm trying to pull data about network speed by region for mobile devices for all the countries in the campaign, to check for a more general correlation with impression rates.
More soon!!! :)
I tried to get data by region about network quality, but so far it doesn't correlate with impressions rate. I used responseStart, which we get via NavigationTiming events. While the average responseStart time is much later in the Unknown region in Ireland, indicating network slowness, the correlation doesn't hold across other regions in countries in the campaign. There are a several regions that have similar degrees of mobile slowness without a drop in mobile impressions.
Maybe there are some other network issues that would correlate? Ripe seems to have a lot of data, but I don't properly know how to interpret most of it. Maybe DNS slowness would correlate? Fetching a banner does require an extra DNS lookup (since it's retrieved from meta). @BBlack @ema @Krinkle any thoughts?
It might be interesting to see what BannerLoader logs have to say. However, their relationship to pageviews is more complex, since the banner is only fetched when it's expected to be shown (unlike beacon/impression, which calls home even if the banner was hidden).
Also wondering if we could ask Maxmind directly which Irish networks are put in the Unknown bin...?
Thanks!!
Here's the rough data for the above... See the second sheet and the ugly chart there...
Here's the ipython notebook with all the queries I used... Including this slightly better chart:
The blue dot in the lower right corner is the "Unknown" region in Ireland. So, really an outlier...
Just checked the data for December 18th. Unknown region in Ireland still had the lowest impression rate of all regions in the campaign.
Here is a pattern that holds in other countries that were in the campaign: of the 10 regions with the worst impression rate, four are Unknown:
pageviews | country_code | subdivision | impressions | ratio |
848178 | IE | Unknown | 328423 | 0.387210 |
11021080 | US | California | 6841676 | 0.620781 |
63530 | US | Wyoming | 52682 | 0.829246 |
192477 | US | West Virginia | 160825 | 0.835554 |
27219 | CA | Unknown | 23519 | 0.864066 |
455872 | US | Utah | 409941 | 0.899246 |
1895920 | GB | Unknown | 1710139 | 0.902010 |
930421 | US | Missouri | 847455 | 0.910830 |
1637987 | US | Virginia | 1503460 | 0.917871 |
24184 | AU | Unknown | 22207 | 0.918252 |
Also, the impression rate for California on the 18th is even lower than the already low rate from December 7th (the other day we've checked out data for).
One theory we've talked about is that the low impression rate is from bots that have not been identified as such... Maybe the fact that both Ireland and (I think?) California have a high concentration of data centres supports this? I'd really like to ask Maxmind what happens when they give us an "Unknown" region... Maybe it's from IPs that are closer to some part of the backbone, so not associated with a normal ISP?
@AndyRussG Maybe filling out this form with a "GeoIP data correction request" and your contact data gets you a reply from Maxmind about that.
More indications that this is due to bots:
- The Unknown region in Ireland also the lowest rate of unique IPs, that is, the highest rate of repeat pageviews from the same IP.
- In this case, there is a tendency across other regions with low impression rates. Several other (but not all) regions with low impression rates also have a low rate of unique IPs.
distinct_ips | country_code | subdivision | pageviews | impressions | imp_ratio | dist_ip_ratio |
65655 | IE | Unknown | 848178 | 327272 | 0.385853 | 0.077407 |
119140 | US | Unknown | 1285578 | 1183996 | 0.920983 | 0.092674 |
225527 | GB | Unknown | 1895920 | 1703326 | 0.898417 | 0.118954 |
2115296 | US | California | 11021080 | 6807169 | 0.617650 | 0.191932 |
112734 | IE | Leinster | 522659 | 496033 | 0.949057 | 0.215693 |
183499 | US | Alabama | 827200 | 785214 | 0.949243 | 0.221831 |
21484 | IE | Munster | 93401 | 89078 | 0.953716 | 0.230019 |
- Looking at the UAs for the IPs with the highest number of pageviews in these regions, I don't see direct declarations of bots. So, maybe bot intentionally trying to pass as real users?
Found it!!!
The lost mobile impressions in Ireland are from a little over 400 IP addresses belonging to a single, high-profile internet company. Maxmind sets Unknown region for all these addresses. If we discount requests from these addresses, the pageview rate for the region is normal.
Ah, so is this a thing to report to Maxmind as a bug / request for correction via that form after all? Or is Maxmind doing it on purpose for this ISP?
Hi! It's not an ISP, it's more like from a datacenter or office of a company so big that it's its own ISP. (Erring on the side of caution, I just would like to check before making public the name of the company, since this is info gleaned from IP addresses, which are very private, and I'm not yet positive it's a bot/spider, rather than actual humans accessing the site.)
Found more details on this...
- This is a proxy; the IP addresses are not those of readers per se. So, it should be fine to discuss here.
- This is coming from Facebook, from proxy servers for its Free Basics (formerly Internet.org) service.
- The proxy is correctly detected and set in our X-Analytics header's "proxy" field, which makes it all the way into Hive.
- Querying impressions, proxy and region in Ireland for mobile web views in the English FR campaign on December 18th gives us:
pageviews | impressions | subdivision | proxy | imp_ratio |
30529 | 29688 | Connaught | 0.972452 | |
8 | 0 | Connaught | Opera | 0.000000 |
522266 | 496031 | Leinster | 0.949767 | |
393 | 2 | Leinster | Opera | 0.005089 |
93285 | 89078 | Munster | 0.954902 | |
116 | 0 | Munster | Opera | 0.000000 |
13623 | 12775 | Ulster | 0.937752 | |
4 | 0 | Ulster | Opera | 0.000000 |
347811 | 327272 | Unknown | 0.940948 | |
499263 | 0 | Unknown | IORG | 0.000000 |
1104 | 0 | Unknown | Opera | 0.000000 |
(Here are the Hive query P4700 and the ipython code P4702 used to get this data.)
I think this can be considered now fully solved! I'll close the task shortly. I haven't described all the meandering query roads I took to get here... but I'll write up, on mediawiki.org, some querying approaches I used, just in case they're useful elsewhere. I'll link here.
Also, as discussed with fr-tech, it seems might wish to disable CN banners, or at least fundraising banners, for this proxy. Here's a task to consider that: T154560.
Finally bit of follow-up: the original users' IP addresses do appear in the X-Forwarded-For header. However, we are gelocating users at the proxy server location, and setting the client_ip field in the wmf.webrequest Hive table to the proxy's IP. Should probably check if that's how it's supposed to work...
Thanks so much, all!!!! :D
Looking at data from BannerLoader requests, it seems that we may not be showing banners at all over this proxy (same slice of time/place/requests as before):
loaderrequests | proxy | subdivision |
13101 | Connaught | |
221309 | Leinster | |
8 | Opera | Leinster |
39091 | Munster | |
16 | Opera | Munster |
5809 | Ulster | |
2 | Opera | Ulster |
155122 | Unknown | |
8 | Opera | Unknown |
We only trust XFF headers that have been appended by a "trusted" server. There is nothing stopping an given HTTP request from including a random XFF header (dork aside: I used to have my Firefox profile configured to always send an XFF header that claimed I was browsing from the NSA's IP space). There is a configuration file in operations/mediawiki-config.git that tracks the trusted XFF senders. See https://meta.wikimedia.org/wiki/XFF_project and https://phabricator.wikimedia.org/diffusion/ETXF/browse/master/trusted-hosts.txt