Page MenuHomePhabricator

Spike: Impressions abnormally low for Ireland
Closed, ResolvedPublic4 Estimated Story Points

Description

Mobile impressions / pageviews for Ireland are hovering at 70%, which is unusually low. We should investigate.

Event Timeline

Just some random ideas about stuff to check... Maybe lock for lack of correspondence between the country param from the beacon/impression url and country in geocoded_data in the Hive webrequest entry? See if other geotargeted and non-geotargeted campaigns have had the same mobile impression/pageview rate? Also, any significant differences in browser versions used there? Difference in IP versions? (Sorry for going on and on, hope this is useful ;p )

AndyRussG moved this task from Backlog to Doing on the Fundraising Sprint Waiting for Godot board.
AndyRussG set the point value for this task to 2.

I pulled the data for Dec. 7, and indeed, Ireland is abnormally way down... just above 50%.

The following data is not detailed; I used the wmf.pageviews_hourly table in Hive, so I couldn't filter all the factors that cause CN not to run. (I think we can get more precise data from wmf.webtrequests.) However, it shows that there's definitely something going on with Ireland mobile.

Dateaccess_methodcountry_codepageviewsimpressionsrate
2016-12-07desktopAU360296430355890.842525487348749
2016-12-07desktopCA886293368167110.769125863864705
2016-12-07desktopGB13133650117505910.894693478202937
2016-12-07desktopIE9393107739640.823970787067102
2016-12-07desktopNZ6771866118920.903580404792775
2016-12-07desktopUS60213188488149490.810701951207101
2016-12-07mobile webAU308079128295590.918452111811545
2016-12-07mobile webCA456811343307450.948038062981367
2016-12-07mobile webGB11638393106968180.919097507705746
2016-12-07mobile webIE15468947809680.504862000886939
2016-12-07mobile webNZ4800794418320.920331862047705
2016-12-07mobile webUS52453939475600010.906700276598865

Hi! It looks like this is mobile network issues causing CN to not display banners and/or not report impressions.

Initial digging:

  • IPv6 vs. IPv4: No correlation to reduced impression rate found.
  • country param on beacon/impression (comes from GeoIP cookie) vs. country assigned to webrequest in Hive (directly from IP on that request): For all countries, a small number of requests to beacon/impression were from countries other than the one indicated by the GeoIP cookie (which CN uses to geolocate). Ireland was no different from other countries in this regard. It's almost certainly due to people travelling and keeping the GeoIP cookie they got in one location when they visit the site from the second location.
  • browser family/major version/os: There is some relation to client platform: the most recent versions of browsers had a normal impression rate. Rate declined generally (though haphazardly) with browser major version. This isn't what you'd expect if there were some JS bug getting triggered in some browsers but not others. Overall, some browsers are quite a bit worse off than others, though.

So, it looks like browser/major version is, at least in part, a proxy for something else, like age/speed of a device, or speed of mobile network.

Not sure if I'm not making wrong assumptions (hope not!)... but I noticed that Ireland has significantly less LTE coverage than other countries in the campaign: https://opensignal.com/reports/2016/02/state-of-lte-q4-2015/

Also, relative to other countries in the campaign, it has a higher percentage of the population living in rural areas (which, one would imagine, are the ones that have less coverage).

So, I looked at impression rates by region, to see if more rural regions had lower rates. Unfortunately, our data on regions isn't very good; Maxmind locates most users in region "Unknown". However, the rates from Unknown are staggering, accounting for basically the entire drop in impressions:

RegionTotal pageviewsImpression rate
Unknown10350390.297322129890758
Leinster4063750.924746847123962
Connaught239290.936980233189853
Ulster97380.93787225302937
Munster718130.938841156893599

We don't know the conditions under which Maxmind can't figure out region based on IP in Ireland. Maybe Unknowns are all mobile networks? Or just all the bad providers? In any case, it's plausible that it would correlate with network quality.

To confirm the hypothesis that mobile network issues are the cause, I'm trying to pull data about network speed by region for mobile devices for all the countries in the campaign, to check for a more general correlation with impression rates.

More soon!!! :)

I tried to get data by region about network quality, but so far it doesn't correlate with impressions rate. I used responseStart, which we get via NavigationTiming events. While the average responseStart time is much later in the Unknown region in Ireland, indicating network slowness, the correlation doesn't hold across other regions in countries in the campaign. There are a several regions that have similar degrees of mobile slowness without a drop in mobile impressions.

Maybe there are some other network issues that would correlate? Ripe seems to have a lot of data, but I don't properly know how to interpret most of it. Maybe DNS slowness would correlate? Fetching a banner does require an extra DNS lookup (since it's retrieved from meta). @BBlack @ema @Krinkle any thoughts?

It might be interesting to see what BannerLoader logs have to say. However, their relationship to pageviews is more complex, since the banner is only fetched when it's expected to be shown (unlike beacon/impression, which calls home even if the banner was hidden).

Also wondering if we could ask Maxmind directly which Irish networks are put in the Unknown bin...?

Thanks!!

Here's the rough data for the above... See the second sheet and the ugly chart there...

Here's the ipython notebook with all the queries I used... Including this slightly better chart:

mobile_impression_rates_and_responseStart_by_region.png (433×606 px, 11 KB)

The blue dot in the lower right corner is the "Unknown" region in Ireland. So, really an outlier...

Just checked the data for December 18th. Unknown region in Ireland still had the lowest impression rate of all regions in the campaign.

Here is a pattern that holds in other countries that were in the campaign: of the 10 regions with the worst impression rate, four are Unknown:

pageviewscountry_codesubdivisionimpressionsratio
848178IEUnknown3284230.387210
11021080USCalifornia68416760.620781
63530USWyoming526820.829246
192477USWest Virginia1608250.835554
27219CAUnknown235190.864066
455872USUtah4099410.899246
1895920GBUnknown17101390.902010
930421USMissouri8474550.910830
1637987USVirginia15034600.917871
24184AUUnknown222070.918252

Also, the impression rate for California on the 18th is even lower than the already low rate from December 7th (the other day we've checked out data for).

One theory we've talked about is that the low impression rate is from bots that have not been identified as such... Maybe the fact that both Ireland and (I think?) California have a high concentration of data centres supports this? I'd really like to ask Maxmind what happens when they give us an "Unknown" region... Maybe it's from IPs that are closer to some part of the backbone, so not associated with a normal ISP?

@AndyRussG Maybe filling out this form with a "GeoIP data correction request" and your contact data gets you a reply from Maxmind about that.

https://support.maxmind.com/geoip-data-correction-request/

My guess is that these IP blocks don't have any valid SWIP data but are geolocated to a country via some other means like the RIR that "owns" the parent block. MaxMind's data is an interesting mix of public data and things that they have collected from various other sources.

More indications that this is due to bots:

  • The Unknown region in Ireland also the lowest rate of unique IPs, that is, the highest rate of repeat pageviews from the same IP.
  • In this case, there is a tendency across other regions with low impression rates. Several other (but not all) regions with low impression rates also have a low rate of unique IPs.
distinct_ipscountry_codesubdivisionpageviewsimpressionsimp_ratiodist_ip_ratio
65655IEUnknown8481783272720.3858530.077407
119140USUnknown128557811839960.9209830.092674
225527GBUnknown189592017033260.8984170.118954
2115296USCalifornia1102108068071690.6176500.191932
112734IELeinster5226594960330.9490570.215693
183499USAlabama8272007852140.9492430.221831
21484IEMunster93401890780.9537160.230019

mobile_impression_rates_and_distinct_ip_rates.png (438×616 px, 14 KB)

  • Looking at the UAs for the IPs with the highest number of pageviews in these regions, I don't see direct declarations of bots. So, maybe bot intentionally trying to pass as real users?

Found it!!!

The lost mobile impressions in Ireland are from a little over 400 IP addresses belonging to a single, high-profile internet company. Maxmind sets Unknown region for all these addresses. If we discount requests from these addresses, the pageview rate for the region is normal.

Ah, so is this a thing to report to Maxmind as a bug / request for correction via that form after all? Or is Maxmind doing it on purpose for this ISP?

Ah, so is this a thing to report to Maxmind as a bug / request for correction via that form after all? Or is Maxmind doing it on purpose for this ISP?

Hi! It's not an ISP, it's more like from a datacenter or office of a company so big that it's its own ISP. (Erring on the side of caution, I just would like to check before making public the name of the company, since this is info gleaned from IP addresses, which are very private, and I'm not yet positive it's a bot/spider, rather than actual humans accessing the site.)

Found more details on this...

  • This is a proxy; the IP addresses are not those of readers per se. So, it should be fine to discuss here.
  • This is coming from Facebook, from proxy servers for its Free Basics (formerly Internet.org) service.
  • The proxy is correctly detected and set in our X-Analytics header's "proxy" field, which makes it all the way into Hive.
  • Querying impressions, proxy and region in Ireland for mobile web views in the English FR campaign on December 18th gives us:
pageviewsimpressionssubdivisionproxyimp_ratio
3052929688Connaught0.972452
80ConnaughtOpera0.000000
522266496031Leinster0.949767
3932LeinsterOpera0.005089
9328589078Munster0.954902
1160MunsterOpera0.000000
1362312775Ulster0.937752
40UlsterOpera0.000000
347811327272Unknown0.940948
4992630UnknownIORG0.000000
11040UnknownOpera0.000000

(Here are the Hive query P4700 and the ipython code P4702 used to get this data.)

I think this can be considered now fully solved! I'll close the task shortly. I haven't described all the meandering query roads I took to get here... but I'll write up, on mediawiki.org, some querying approaches I used, just in case they're useful elsewhere. I'll link here.

Also, as discussed with fr-tech, it seems might wish to disable CN banners, or at least fundraising banners, for this proxy. Here's a task to consider that: T154560.

Finally bit of follow-up: the original users' IP addresses do appear in the X-Forwarded-For header. However, we are gelocating users at the proxy server location, and setting the client_ip field in the wmf.webrequest Hive table to the proxy's IP. Should probably check if that's how it's supposed to work...

Thanks so much, all!!!! :D

AndyRussG changed the point value for this task from 2 to 4.Jan 4 2017, 3:37 AM

Looking at data from BannerLoader requests, it seems that we may not be showing banners at all over this proxy (same slice of time/place/requests as before):

loaderrequestsproxysubdivision
13101Connaught
221309Leinster
8OperaLeinster
39091Munster
16OperaMunster
5809Ulster
2OperaUlster
155122Unknown
8OperaUnknown

Finally bit of follow-up: the original users' IP addresses do appear in the X-Forwarded-For header. However, we are gelocating users at the proxy server location, and setting the client_ip field in the wmf.webrequest Hive table to the proxy's IP. Should probably check if that's how it's supposed to work...

We only trust XFF headers that have been appended by a "trusted" server. There is nothing stopping an given HTTP request from including a random XFF header (dork aside: I used to have my Firefox profile configured to always send an XFF header that claimed I was browsing from the NSA's IP space). There is a configuration file in operations/mediawiki-config.git that tracks the trusted XFF senders. See https://meta.wikimedia.org/wiki/XFF_project and https://phabricator.wikimedia.org/diffusion/ETXF/browse/master/trusted-hosts.txt