Spike: Impressions abnormally low for Ireland
Closed, ResolvedPublic4 Estimated Story Points
Actions

Assigned To

Authored By

	awight
	Dec 8 2016, 12:42 AM

Description

Mobile impressions / pageviews for Ireland are hovering at 70%, which is unusually low. We should investigate.

Related Objects

Mentioned In: T238560: Doubts and questions about Kerberos and Hadoop
T154560: Spike: CentralNotice: Consider how to disable CN banners, or at least Fundraising banners, on FB's "Free Basics" service
Mentioned Here: P4700 T152650 Temporary Hive table for querying IP address, proxy, region and impressions in IE for mobile FR campaign
P4702 T152650 ipython munging to find impression rates by proxy and region in IE for mobile FR campaign
T154560: Spike: CentralNotice: Consider how to disable CN banners, or at least Fundraising banners, on FB's "Free Basics" service

Event Timeline

awight created this task.Dec 8 2016, 12:42 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 8 2016, 12:42 AM

• DStrine added a project: Fundraising Sprint Waiting for Godot.Dec 8 2016, 9:14 PM

• DStrine moved this task from Triage to Completed in Q2 1617 on the Fundraising-Backlog board.

Just some random ideas about stuff to check... Maybe lock for lack of correspondence between the country param from the beacon/impression url and country in geocoded_data in the Hive webrequest entry? See if other geotargeted and non-geotargeted campaigns have had the same mobile impression/pageview rate? Also, any significant differences in browser versions used there? Difference in IP versions? (Sorry for going on and on, hope this is useful ;p )

AndyRussG claimed this task.Dec 9 2016, 8:08 PM

AndyRussG moved this task from Backlog to Doing on the Fundraising Sprint Waiting for Godot board.

AndyRussG set the point value for this task to 2.

I pulled the data for Dec. 7, and indeed, Ireland is abnormally way down... just above 50%.

The following data is not detailed; I used the wmf.pageviews_hourly table in Hive, so I couldn't filter all the factors that cause CN not to run. (I think we can get more precise data from wmf.webtrequests.) However, it shows that there's definitely something going on with Ireland mobile.

Date	access_method	country_code	pageviews	impressions	rate
2016-12-07	desktop	AU	3602964	3035589	0.842525487348749
2016-12-07	desktop	CA	8862933	6816711	0.769125863864705
2016-12-07	desktop	GB	13133650	11750591	0.894693478202937
2016-12-07	desktop	IE	939310	773964	0.823970787067102
2016-12-07	desktop	NZ	677186	611892	0.903580404792775
2016-12-07	desktop	US	60213188	48814949	0.810701951207101
2016-12-07	mobile web	AU	3080791	2829559	0.918452111811545
2016-12-07	mobile web	CA	4568113	4330745	0.948038062981367
2016-12-07	mobile web	GB	11638393	10696818	0.919097507705746
2016-12-07	mobile web	IE	1546894	780968	0.504862000886939
2016-12-07	mobile web	NZ	480079	441832	0.920331862047705
2016-12-07	mobile web	US	52453939	47560001	0.906700276598865

Hi! It looks like this is mobile network issues causing CN to not display banners and/or not report impressions.

Initial digging:

IPv6 vs. IPv4: No correlation to reduced impression rate found.
T152650_2016-12-07_mobile_impressions_ip_version.csv5 KBDownload
country param on beacon/impression (comes from GeoIP cookie) vs. country assigned to webrequest in Hive (directly from IP on that request): For all countries, a small number of requests to beacon/impression were from countries other than the one indicated by the GeoIP cookie (which CN uses to geolocate). Ireland was no different from other countries in this regard. It's almost certainly due to people travelling and keeping the GeoIP cookie they got in one location when they visit the site from the second location.
T152650_2016-12-07_mobile_impressions_country_vs_geocoded.csv7 KBDownload
browser family/major version/os: There is some relation to client platform: the most recent versions of browsers had a normal impression rate. Rate declined generally (though haphazardly) with browser major version. This isn't what you'd expect if there were some JS bug getting triggered in some browsers but not others. Overall, some browsers are quite a bit worse off than others, though.
T152650_2016-12-07_mobile_pv-impressions_in_IE_by_platform.csv20 KBDownload

So, it looks like browser/major version is, at least in part, a proxy for something else, like age/speed of a device, or speed of mobile network.

Not sure if I'm not making wrong assumptions (hope not!)... but I noticed that Ireland has significantly less LTE coverage than other countries in the campaign: https://opensignal.com/reports/2016/02/state-of-lte-q4-2015/

Also, relative to other countries in the campaign, it has a higher percentage of the population living in rural areas (which, one would imagine, are the ones that have less coverage).

So, I looked at impression rates by region, to see if more rural regions had lower rates. Unfortunately, our data on regions isn't very good; Maxmind locates most users in region "Unknown". However, the rates from Unknown are staggering, accounting for basically the entire drop in impressions:

Region	Total pageviews	Impression rate
Unknown	1035039	0.297322129890758
Leinster	406375	0.924746847123962
Connaught	23929	0.936980233189853
Ulster	9738	0.93787225302937
Munster	71813	0.938841156893599

We don't know the conditions under which Maxmind can't figure out region based on IP in Ireland. Maybe Unknowns are all mobile networks? Or just all the bad providers? In any case, it's plausible that it would correlate with network quality.

To confirm the hypothesis that mobile network issues are the cause, I'm trying to pull data about network speed by region for mobile devices for all the countries in the campaign, to check for a more general correlation with impression rates.

More soon!!! :)

I tried to get data by region about network quality, but so far it doesn't correlate with impressions rate. I used responseStart, which we get via NavigationTiming events. While the average responseStart time is much later in the Unknown region in Ireland, indicating network slowness, the correlation doesn't hold across other regions in countries in the campaign. There are a several regions that have similar degrees of mobile slowness without a drop in mobile impressions.

Maybe there are some other network issues that would correlate? Ripe seems to have a lot of data, but I don't properly know how to interpret most of it. Maybe DNS slowness would correlate? Fetching a banner does require an extra DNS lookup (since it's retrieved from meta). @BBlack @ema @Krinkle any thoughts?

It might be interesting to see what BannerLoader logs have to say. However, their relationship to pageviews is more complex, since the banner is only fetched when it's expected to be shown (unlike beacon/impression, which calls home even if the banner was hidden).

Also wondering if we could ask Maxmind directly which Irish networks are put in the Unknown bin...?

Thanks!!

Here's the rough data for the above... See the second sheet and the ugly chart there...

T152650_2016-12-07_mobile_pv-impressions_and_rs_by_country_and_subdivision.ods42 KBDownload

Here's the ipython notebook with all the queries I used... Including this slightly better chart:

mobile_impression_rates_and_responseStart_by_region.png (433×606 px, 11 KB)

The blue dot in the lower right corner is the "Unknown" region in Ireland. So, really an outlier...

Just checked the data for December 18th. Unknown region in Ireland still had the lowest impression rate of all regions in the campaign.

Here is a pattern that holds in other countries that were in the campaign: of the 10 regions with the worst impression rate, four are Unknown:

pageviews	country_code	subdivision	impressions	ratio
848178	IE	Unknown	328423	0.387210
11021080	US	California	6841676	0.620781
63530	US	Wyoming	52682	0.829246
192477	US	West Virginia	160825	0.835554
27219	CA	Unknown	23519	0.864066
455872	US	Utah	409941	0.899246
1895920	GB	Unknown	1710139	0.902010
930421	US	Missouri	847455	0.910830
1637987	US	Virginia	1503460	0.917871
24184	AU	Unknown	22207	0.918252

Also, the impression rate for California on the 18th is even lower than the already low rate from December 7th (the other day we've checked out data for).

One theory we've talked about is that the low impression rate is from bots that have not been identified as such... Maybe the fact that both Ireland and (I think?) California have a high concentration of data centres supports this? I'd really like to ask Maxmind what happens when they give us an "Unknown" region... Maybe it's from IPs that are closer to some part of the backbone, so not associated with a normal ISP?

@AndyRussG Maybe filling out this form with a "GeoIP data correction request" and your contact data gets you a reply from Maxmind about that.

https://support.maxmind.com/geoip-data-correction-request/

My guess is that these IP blocks don't have any valid SWIP data but are geolocated to a country via some other means like the RIR that "owns" the parent block. MaxMind's data is an interesting mix of public data and things that they have collected from various other sources.

More indications that this is due to bots:

The Unknown region in Ireland also the lowest rate of unique IPs, that is, the highest rate of repeat pageviews from the same IP.
In this case, there is a tendency across other regions with low impression rates. Several other (but not all) regions with low impression rates also have a low rate of unique IPs.

distinct_ips	country_code	subdivision	pageviews	impressions	imp_ratio	dist_ip_ratio
65655	IE	Unknown	848178	327272	0.385853	0.077407
119140	US	Unknown	1285578	1183996	0.920983	0.092674
225527	GB	Unknown	1895920	1703326	0.898417	0.118954
2115296	US	California	11021080	6807169	0.617650	0.191932
112734	IE	Leinster	522659	496033	0.949057	0.215693
183499	US	Alabama	827200	785214	0.949243	0.221831
21484	IE	Munster	93401	89078	0.953716	0.230019

mobile_impression_rates_and_distinct_ip_rates.png (438×616 px, 14 KB)

Looking at the UAs for the IPs with the highest number of pageviews in these regions, I don't see direct declarations of bots. So, maybe bot intentionally trying to pass as real users?

Found it!!!

The lost mobile impressions in Ireland are from a little over 400 IP addresses belonging to a single, high-profile internet company. Maxmind sets Unknown region for all these addresses. If we discount requests from these addresses, the pageview rate for the region is normal.

Ah, so is this a thing to report to Maxmind as a bug / request for correction via that form after all? Or is Maxmind doing it on purpose for this ISP?

In T152650#2913475, @Dzahn wrote:

Ah, so is this a thing to report to Maxmind as a bug / request for correction via that form after all? Or is Maxmind doing it on purpose for this ISP?

Hi! It's not an ISP, it's more like from a datacenter or office of a company so big that it's its own ISP. (Erring on the side of caution, I just would like to check before making public the name of the company, since this is info gleaned from IP addresses, which are very private, and I'm not yet positive it's a bot/spider, rather than actual humans accessing the site.)

AndyRussG mentioned this in T154560: Spike: CentralNotice: Consider how to disable CN banners, or at least Fundraising banners, on FB's "Free Basics" service.Jan 4 2017, 3:24 AM

Found more details on this...

This is a proxy; the IP addresses are not those of readers per se. So, it should be fine to discuss here.
This is coming from Facebook, from proxy servers for its Free Basics (formerly Internet.org) service.
The proxy is correctly detected and set in our X-Analytics header's "proxy" field, which makes it all the way into Hive.
Querying impressions, proxy and region in Ireland for mobile web views in the English FR campaign on December 18th gives us:

pageviews	impressions	subdivision	proxy	imp_ratio
30529	29688	Connaught		0.972452
8	0	Connaught	Opera	0.000000
522266	496031	Leinster		0.949767
393	2	Leinster	Opera	0.005089
93285	89078	Munster		0.954902
116	0	Munster	Opera	0.000000
13623	12775	Ulster		0.937752
4	0	Ulster	Opera	0.000000
347811	327272	Unknown		0.940948
499263	0	Unknown	IORG	0.000000
1104	0	Unknown	Opera	0.000000

(Here are the Hive query P4700 and the ipython code P4702 used to get this data.)

I think this can be considered now fully solved! I'll close the task shortly. I haven't described all the meandering query roads I took to get here... but I'll write up, on mediawiki.org, some querying approaches I used, just in case they're useful elsewhere. I'll link here.

Also, as discussed with fr-tech, it seems might wish to disable CN banners, or at least fundraising banners, for this proxy. Here's a task to consider that: T154560.

Finally bit of follow-up: the original users' IP addresses do appear in the X-Forwarded-For header. However, we are gelocating users at the proxy server location, and setting the client_ip field in the wmf.webrequest Hive table to the proxy's IP. Should probably check if that's how it's supposed to work...

Thanks so much, all!!!! :D

AndyRussG changed the point value for this task from 2 to 4.Jan 4 2017, 3:37 AM

AndyRussG moved this task from Doing to Done on the Fundraising Sprint Waiting for Godot board.Jan 4 2017, 3:47 AM

Looking at data from BannerLoader requests, it seems that we may not be showing banners at all over this proxy (same slice of time/place/requests as before):

loaderrequests	proxy	subdivision
13101		Connaught
221309		Leinster
8	Opera	Leinster
39091		Munster
16	Opera	Munster
5809		Ulster
2	Opera	Ulster
155122		Unknown
8	Opera	Unknown

In T152650#2915776, @AndyRussG wrote:

Finally bit of follow-up: the original users' IP addresses do appear in the X-Forwarded-For header. However, we are gelocating users at the proxy server location, and setting the client_ip field in the wmf.webrequest Hive table to the proxy's IP. Should probably check if that's how it's supposed to work...

We only trust XFF headers that have been appended by a "trusted" server. There is nothing stopping an given HTTP request from including a random XFF header (dork aside: I used to have my Firefox profile configured to always send an XFF header that claimed I was browsing from the NSA's IP space). There is a configuration file in operations/mediawiki-config.git that tracks the trusted XFF senders. See https://meta.wikimedia.org/wiki/XFF_project and https://phabricator.wikimedia.org/diffusion/ETXF/browse/master/trusted-hosts.txt

• DStrine added a project: Fundraising Sprint Autotune Earphones.Jan 5 2017, 11:47 PM

• DStrine moved this task from Completed in Q2 1617 to Current Sprint and Completed in Q3 2016-17 on the Fundraising-Backlog board.Jan 6 2017, 10:09 PM

Ejegg moved this task from Backlog to Done on the Fundraising Sprint Autotune Earphones board.Jan 12 2017, 1:41 AM

• DStrine closed this task as Resolved.Jan 17 2017, 9:09 PM

• mmodell removed a subscriber: awight.Jun 22 2017, 9:34 PM

AndyRussG mentioned this in T238560: Doubts and questions about Kerberos and Hadoop.Nov 21 2019, 8:07 PM

	F5153382: mobile_impression_rates_and_distinct_ip_rates.png
	Dec 25 2016, 3:51 AM

	F5140270: mobile_impression_rates_and_responseStart_by_region.png
	Dec 23 2016, 4:19 AM

	F5134641: T152650_2016-12-07_mobile_pv-impressions_and_rs_by_country_and_subdivision.ods
	Dec 22 2016, 6:19 PM

	F5115104: T152650_2016-12-07_mobile_pv-impressions_in_IE_by_platform.csv
	Dec 21 2016, 5:02 AM

	F5111963: T152650_2016-12-07_mobile_impressions_ip_version.csv
	Dec 21 2016, 2:15 AM

	F5111959: T152650_2016-12-07_mobile_impressions_country_vs_geocoded.csv
	Dec 21 2016, 2:15 AM

Spike: Impressions abnormally low for IrelandClosed, ResolvedPublic4 Estimated Story PointsActions

Description

Related Objects

Event Timeline

Spike: Impressions abnormally low for Ireland
Closed, ResolvedPublic4 Estimated Story Points
Actions