Page MenuHomePhabricator

Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country
Closed, ResolvedPublic

Assigned To
None
Authored By
Tbayer
Feb 12 2018, 4:50 AM
Referenced Files
F17046846: Screen Shot 2018-04-17 at 2.08.57 PM.png
Apr 17 2018, 9:12 PM
F17025595: pageviews-hourly-opera-mini-xcip.png
Apr 17 2018, 9:16 AM
F16746024: image.png
Apr 6 2018, 9:27 AM
F16721571: Screen Shot 2018-04-05 at 12.01.07 PM.png
Apr 5 2018, 7:02 PM
F16641812: Screen Shot 2018-04-03 at 1.53.17 PM.png
Apr 3 2018, 9:04 PM
Tokens
"Love" token, awarded by atgo."Love" token, awarded by elukey.

Description

Between 20:00 and 21:00 UTC on February 6, Opera Mini pageviews in Nigeria dropped to almost zero, and have not recovered since:

Pageviews Opera Mini Nigeria Mobile Web 2018-01-13..2018-02-11 (Pivot screenshot).png (854×1 px, 206 KB)

Other countries, browsers (apart from Opera Mobile) or access methods seem not affected. I.e. desktop and app traffic in Nigeria, or Opera Mini traffic globally, does not appear to have dropped substantially.

This ticket did not have an analytics tag until the 3rd of April, although two members of the Analytics Engineering team have been CCed since February 12.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Tbayer renamed this task from Investigate drop in Opera Mini traffic in Nigeria in February 6 to Investigate drop in Opera Mini traffic in Nigeria on February 6.Feb 12 2018, 4:53 AM

Actually it looks like the Nigerian Opera Mini traffic was just rerouted (or reclassified in a Maxmind update), first to an "Unknown" geolocation, then (since yesterday, February 11), the US. Also from several countries besides Nigeria, including Indonesia and (although not to the same extent) India:

Pageviews Opera Mini by country Mobile Web 2018-01-13..2018-02-11 (Pivot screenshot).png (849×1 px, 184 KB)

Source: Pivot

Actually it looks like the Nigerian Opera Mini traffic was just rerouted (or reclassified in a Maxmind update), first to an "Unknown" geolocation, then (since yesterday, February 11), the US.

@Ottomata, @Milimetric I suppose we didn't roll out Maxmind updates at either of these two times (February 6 between 20:00 and 21:00 UTC, and February 11 ca. between 1-3am UTC)?

The Maxmind geoipupdate downloader script runs on the production puppetmaster once a week. The last 2 run logs have:

Sun Feb  4 03:30:01 UTC 2018: geoipupdate downloading MaxMind .dat files into /var/lib/puppet/volatile/GeoIP
Sun Feb 11 03:30:01 UTC 2018: geoipupdate downloading MaxMind .dat files into /var/lib/puppet/volatile/GeoIP

The GeoIP database files all have timestamps of Feb 11 03:30. Puppet should sync these files to client hosts within around 30 mins.

I don't know if this means the files actually changed during the download though. I think @Milimetric has a git repo of all past GeoIP dbs? If so, we could compare them to know for sure.

Yeah, /home/milimetric/GeoIP-toolbox/MaxMind-database/GeoIP is a git repository with all the backups of the GeoIP dbs that we kept over the years (once a week usually except when the scripts don't work)

Any chance this is related to work on T138505 ?

@Tbayer I'm working with @DFoy on getting a contact at Opera for this issue. If we can get updates from them, can we repair the date/update things on our side retroactively? Is there anything we can/should do to retain this data for fixes? I'm not totally familiar with what gets destroyed after 90 days, and I'm worried that we'll lose our ability to do YOY meaningfully going forward for the affected countries.

@Tbayer I'm working with @DFoy on getting a contact at Opera for this issue. If we can get updates from them, can we repair the date/update things on our side retroactively? Is there anything we can/should do to retain this data for fixes? I'm not totally familiar with what gets destroyed after 90 days, and I'm worried that we'll lose our ability to do YOY meaningfully going forward for the affected countries.

That depends on what kind of information we may get from them. If they supply us with the exact new IP ranges that Opera mini traffic from (say) Nigeria is now routed through, it may be possible, with some extra work, to do a one-time calculation of the corrected numbers since February. This would still not fix the data you see in Pivot, Superset etc. though - for that, the Analytics Engineering team would need to re-generate the pageview_hourly table and other data based on it (projectview_hourly, data in Druid...).

(The above is assuming that Maxmind's geolocation information was correct until February 6. Either way, this whole issue ultimately looks like an upstream bug with Maxmind. In that case, these retroactive corrections might be easier after they fix it, or perhaps they already have.)

atgo renamed this task from Investigate drop in Opera Mini traffic in Nigeria on February 6 to Opera mini IP addresses reassigned.Apr 2 2018, 5:51 PM

mmm, i think opera mini traffic is shifted all arround. See:

Screen Shot 2018-04-03 at 1.53.17 PM.png (1×2 px, 554 KB)

There are spikes on US traffic (wayy up) , South Africa traffic (up) and Nigeria (down). Started On February 6th and changes are visible until Feb 10th/11th. Is opera mini moving their cdns/dropping their x-forwarded for proxies? Either that or maxmind geolocation of all opera mini addresses has gone haywire (this second option seems less likely, geo location broken only for opera mini? ).

Thanks @Nuria - this actually matches what had been discussed further above in the task. We already have an action item.

If we can get updates from them, can we repair the date/update things on our side retroactively? I

Not likely as we would need the original requests they received and I doubt they would give us that. What I think is happening is that somehow Opera instead of passing along the X- Forwarded-For of the device of the User is sending the IP address of their CDD/proxy as the requestor address and so you end up with pageviews that really originate in Nigeria "artifically" being labelled (correctly) by MaxMind to South Africa. A theory.

Another theory is that something has changed on our end regarding X_forwarded_for headers.

If we can get updates from them, can we repair the date/update things on our side retroactively? I

Another theory is that something has changed on our end regarding X_forwarded_for headers.

@Nuria is that something you could check on? As I mentioned before, this is really impacting our ability to look at YOY readership in these countries, which creates problems in how we approach work in these countries.

@atgo I have checked data for January and march for Opera and I just see us receiving IP addresses of Opera's Proxy endpoints instead of IP adresses of users so -seems to me - this is an issue on Opera's end. To be clear I think @Tbayer assertion above that this is a MaxMind issue is incorrect, rather is a n issue with opera's traffic, they are sending us the wrong ips. The range 107.167.* appears all over March data but is not present in January data (before this issue started).

@atgo I have checked data for January and march for Opera and I just see us receiving IP addresses of Opera's Proxy endpoints instead of IP adresses of users so -seems to me - this is an issue on Opera's end. To be clear I think @Tbayer assertion above that this is a MaxMind issue is incorrect,

It seems you are misunderstanding the conversation about Maxmind above. There, we actually pretty much ruled out the possibility that this was caused by a Maxmind update (see T187014#3962774 ff.). Instead, we were talking later about the possibility that this was a change on Opera's side that Maxmind could accommodate in a future update of their database.

rather is a n issue with opera's traffic, they are sending us the wrong ips. The range 107.167.* appears all over March data but is not present in January data (before this issue started).

Thanks for investigating. That could still be consistent with the first hypothesis outlined on February 12 in T187014#3962774 ("Opera Mini traffic was just rerouted"). Anyway, let's see what we hear from Opera. And yes, it's totally conceivable that they will find something went wrong with XFF on their side.

Pinging @BBlack for any changes he might be aware regarding x_forwarded_for headers.

Partnerships has been looking for a contact at Opera. We reached out to
someone yesterday who is OOO until next week. Will keep you updated.

Created a ticket in Opera's bug tracking system, internal reference: MT-3735. Thanks for all the info - we will investigate.

Hi! I'm from Opera Mini server team. We did not move any traffic between DCs on specified dates and I don't see any changes on Mini server side:

  • in number of users from Nigeria,
  • in number of Wikipedia pages (page URL matching *.wikipedia.org) rendered across all DCs - that means that we send stable number of requests from each DC.

We also haven't released any server changes on February 6th and February 11th. Mini servers send XFF header with user's IP address, but please note it's lower-case: x-forwarded-for (but it hasn't changed recently). We also send Forwarded: for="IP:PORT" header.

Normally users from Nigeria connect to a data center in Europe, however between February 17th - March 10th I see a small number of users (~5%) from Nigeria that connected somehow to US, do you see any changes then?

Do the metrics attached here show number of page views on wikipedia.org or is this a number of HTTP requests? Do the metrics include other domains too, for example some zero-rated ones?

Twitter is the best! Thanks for taking time to look into this:

@mbaluta: how about we setup a short meeting and we can go over data changes we see if something rings a bell? ping me at nuria@wikimedia.org

Do the metrics attached here show number of page views on wikipedia.org or is this a number of HTTP requests?

pageviews, not just all http requests

Do the metrics include other domains too, for example some zero-rated ones?

They include those too but normally numbers of pages served by those versus rest of traffic is quite small.

Nuria moved this task from Incoming to Data Quality on the Analytics board.

Normally users from Nigeria connect to a data center in Europe, however between February 17th - March 10th I see a small number of users (~5%) from Nigeria that connected somehow to US, do you see any changes then?

On our end we see all of a sudden US traffic coming from opera mini being a lot higher from Feb 10th but opera mini traffic overall seems mostly unchanged. See opera mini country split below

Screen Shot 2018-04-05 at 12.01.07 PM.png (1×2 px, 344 KB)

Nothing varnish-related happened on Feb 6th as far as I can see from the ops SAL: https://tools.wmflabs.org/sal/production?p=0&q=&d=2018-02-06

Varnish5 rollout might have something to do with this? https://gerrit.wikimedia.org/r/#/c/409047/ cc @ema

image.png (873×1 px, 169 KB)

From the screenshot above, I see a bump in Unknown opera mini traffic between the 5th and the 7th of February. There has been no varnish5-related change merged during that timeframe. See the full list of varnish 5 changes here. More in general, I don't see any operations/puppet patch merged in that period that might be related.

Is the current line of thinking that we are somehow receiving inaccurate X-Forwarded-For data from opera mini starting ~ Feb the 5th?

Please note that number of page views prior to 6th February seems incorrect from our perspective too - number of Opera Mini users in US is far far below India, Indonesia and Nigeria. Can you inspect headers of requests that are classified as coming from US? If you provided IP address of our server, we could at least tell whether it is coming from users of Extreme (OBML) or High (Turbo) mode.

If you provided IP address of our server, we could at least tell whether it is coming from users of Extreme (OBML) or High (Turbo) mode.

@mbaluta: I'm looking at live requests with "Opera Mini" in the User-Agent header hitting our caching proxies, and it seems that something is odd. We have multiple data centers and route clients to "the closest one" by using geoip and edns-client-subnet. Currently pretty much all traffic hitting our data center in Virginia coming from "Opera Mini" should have been routed elsewhere instead.

Do you perhaps perform some type of DNS caching/manipulation on the Opera Mini servers? As an example, it seems that the Opera Mini server 107.167.107.200 incorrectly resolves the A record of en.wikipedia.org to 208.80.154.224 instead of 198.35.26.96.

@mbaluta: note that the problem I've mentioned in my comment above is probably unrelated to the stats issue discussed here (would be good to fix it nonetheless!).

The Country field in our stats is generated by parsing X-Forwarded-For and using MaxMind to figure out which country the client IP belongs to: https://meta.wikimedia.org/wiki/Research:Page_view#Resulting_format. Statistical inaccuracies must thus be due to incorrect X-Forwarded-For values (or possibly bugs in the way we generate Country). @Nuria I think we should try to debug the code that sets Country to "United States" for User-Agent: ~ "Opera Mini" and see what's going on there.

@ema:

I think we should try to debug the code that sets Country to "United States" for User-Agent: ~ "Opera Mini" and see what's going on there.

See my comment above , maxmind is geolocating correctly but ip ranges have changed a lot from Jan to March: "The range 107.167.* appears all over March data but is not present in January data (before this issue started)."

number of Opera Mini users in US is far far below India, Indonesia and Nigeria.

Note these are "pageviews", not users.

@ema: on our end we just look at the ip passed along via varnishkafka to geolocate, not at XFF. See:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/refine_webrequest.hql#L88

  • I think* varnish parses the XFF chain and it substitutes the ip field that later will get send to varnishkafka. So from a request like the following

cp1065.eqiad.wmnet: 223.*.*.*, 107.167.109.* 10.64.0.102

We would expect the value of ip field in kafka as "223.*.*.* " rather than 107.167.109.* which is the opera mini proxy. Does this make sense?

https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#L159

@ema: on our end we just look at the ip passed along via varnishkafka to geolocate, not at XFF. See:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/refine_webrequest.hql#L88

Oh, OK. If that's what ends up in pageview, then indeed X-Forwarded-For has nothing to do with this. The ip field of webrequest contains X-Client-IP as sent by varnishkafka. That is not what this document says though: https://meta.wikimedia.org/wiki/Research:Page_view#Resulting_format

  • I think* varnish parses the XFF chain and it substitutes the ip field that later will get send to varnishkafka.

No, X-Client-IP is either:

In both cases ip is set to the IP address of the client, which when it comes to this ticket would be the Opera Mini server. In light of this, the reason for the unexpected change in Opera Mini country stats is due to a surge in requests coming from Opera Mini servers with IPs geolocated in the USA.

No, X-Client-IP is either:

...ehhh wha? We used to collect XFF on the webrequest side, and then parse it to get ip. We removed this, because @BBlack implemented XFF parsing on the varnish side. Eh?

We're just getting confused between functional and logical definitions here. In practice in the common case, X-Client-IP (which is the source of the webrequest view of the client IP) is set from nginx or varnish based on the arriving port number directly as ema says, without any external XFF needing to influence anything. The only exception is when the initial idea of the client IP happens to match a trusted proxy. By default we don't trust any proxies and thus don't trust any XFF arriving from the outside world. Our conception of trusted proxies in Varnish is defined by Zero's proxy data, which at least in the past included some definition of OperaMini. But describing it in these terms is confusing because it's the internal viewpoint of how the traffic stack works within itself.

From point of view of other things inside the foundation (webrequest, or mediawiki), even for the simple case where the external request arrived with no external (possibly trusted or untrusted) XFF header set, we set several internal XFF header values as our proxies are traversed, which means internal software would have to parse XFF just to reject and/or look-through all the internal proxy IPs, which is non-trivial. From that functional POV, X-Client-IP is the result of Varnish having stripped all that work away. You can (and probably should) think of it as "Varnish has parsed the XFF for you, to provide the real client IP instead of some list of fake internal proxy IPs that are difficult to manage". The underlying functional truth is that unless an external trusted proxy situation is involved, Varnish doesn't have to do the work of parsing XFF to know the real client IP, unlike internal consumers. It knows things more-directly than that.

If you provided IP address of our server, we could at least tell whether it is coming from users of Extreme (OBML) or High (Turbo) mode.

@mbaluta: I'm looking at live requests with "Opera Mini" in the User-Agent header hitting our caching proxies, and it seems that something is odd. We have multiple data centers and route clients to "the closest one" by using geoip and edns-client-subnet. Currently pretty much all traffic hitting our data center in Virginia coming from "Opera Mini" should have been routed elsewhere instead.

Our servers don't use Client Subnet extensions as it doesn't make sense when our servers are clients to your web servers.

Do you perhaps perform some type of DNS caching/manipulation on the Opera Mini servers? As an example, it seems that the Opera Mini server 107.167.107.200 incorrectly resolves the A record of en.wikipedia.org to 208.80.154.224 instead of 198.35.26.96.

107.167.107.200 is a server in Ashburn, Virginia, however MaxMind GeoIP service tells it's in San Mateo, I guess that's the reason why it's resolved incorrectly. Still that doesn't explain changes in your metrics. We haven't modified our IP ranges at that time. If you are interested in list of IP ranges used by our servers, please contact me on mbaluta at opera dot com.

...ehhh wha? We used to collect XFF on the webrequest side, and then parse it to get ip. We removed this, because @BBlack implemented XFF parsing on the varnish side. Eh?

XFF parsing on the varnish side only happens if the proxy is "trusted", which to the best of my understanding essentially means that it is defined on Zerowiki.

Given that the list of trusted proxies (/var/netmapper/proxies.json) does not currently contain anything useful on any of the cache hosts, I'm gonna assume that something is wrong with Zerowiki data, and that might well be the cause of the stats skews we're investigating here.

Right. There was a time in the past when Zerowiki definitely provided some useful data on OperaMini (and also Nokia?) proxy servers, which Varnish would consume as trusted-proxy sources. Zero no longer exports those proxy lists, so our set of trusted proxies for Varnish-XFF-decoding purposes is now the empty set. Zero was only exporting them for their own functional purposes, and I presume they dropped the data because they no longer needed them. However, Analytics may have been usefully looking at information based on that.

It could be that the curious transition date we're looking at here in this ticket is the date that Zero stopped exporting OperaMini proxy IPs to Varnish? I haven't followed all of the top of this ticket in depth yet, but if that's the scenario the change would look like this:

Assuming a request from a user with the true client IP 192.0.2.1, going through an OperaMini server which is at 192.0.2.200, and that OperaMini server was in Zero's set of defined OperaMini proxy ranges before the Zero OperaMini data poofed:

Before:
X-Client-IP would be 192.0.2.1, and X-Analytics would contain ;proxy=OperaMini (although I'm not sure historically exactly how that string was spelled)

After:
X-Client-IP would be 192.0.2.200, and X-Analytics would not contain any ;proxy=

Ping @DFoy - might know better about when OperaMini proxy data dropped from the Zero data, I don't have any good insight into history of changes there.

@BBlack - not sure why OperaMini proxy IPs are no longer being exported. Can this information be re-established?

My only changes to the system this year have been to revise the JSON configuration of a couple of individual partners, which AFAIK does not have anything to do with control over Opera/trusted proxies.

ema renamed this task from Opera mini IP addresses reassigned to Proxies information gone from Zero portal.Apr 13 2018, 2:57 PM

The Opera Mini stats issue here is definitely due to missing proxy information from zero portal. Here's what is currently being returned when calling zero.wikimedia.org's api with action=zeroportal and type=proxies:

{"Test-all":[],"Test1":[]}

Could we restore proxies now that the nice opera folks gave us their list? Clearly we also need to look into why/how this list disappeared. Did items on that list had an expiration time? Can we look at the storage of api to see what happened? This data must be stored somewhere...

Yeah, ema and I discussed this after the meeting the other day. I'm not sure whether or how we can look into the history on the Zero side, but I don't see Zero at this point having a need or desire to restore that data through the previous infrastructure or add any new data, so we're planning to just stop pulling that empty data from them, and replace it with a private file that's puppet-managed/deployed by the SRE team and includes the newly-acquired OperaMini data.

we're planning to just stop pulling that empty data from them, and replace it with a private file that's puppet-managed/deployed by the SRE team and includes the newly-acquired OperaMini data.

+1 let me know when it is in place and i can help check things square again on my end

@Nuria @BBlack thanks for all the work on this! Once it's resolved, will the data for the time window that this was an issue be repaired?

@atgo, It cannot be, we no longer have the original Ips of the records that are wrongly labeled.

Change 426896 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Add varnish::trusted_proxies

https://gerrit.wikimedia.org/r/426896

Change 426913 had a related patch set uploaded (by Ema; owner: Ema):
[labs/private@master] Add fake trusted_proxy.json

https://gerrit.wikimedia.org/r/426913

Change 426913 merged by Ema:
[labs/private@master] Add fake trusted_proxies.json

https://gerrit.wikimedia.org/r/426913

Change 426896 merged by Ema:
[operations/puppet@production] Add varnish::trusted_proxies

https://gerrit.wikimedia.org/r/426896

Change 426920 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: use trusted_proxies netmapper database

https://gerrit.wikimedia.org/r/426920

Change 426920 merged by Ema:
[operations/puppet@production] VCL: use trusted_proxies netmapper database

https://gerrit.wikimedia.org/r/426920

+1 let me know when it is in place and i can help check things square again on my end

Changes deployed.

Here's pageview hourly after deploying the changes above:

pageviews-hourly-opera-mini-xcip.png (900×1 px, 188 KB)

US going down, upwards trend for India and Nigeria. This seems good to me, but I'll leave the final conclusions to @Nuria.

Screen Shot 2018-04-17 at 2.08.57 PM.png (1×2 px, 583 KB)

Indeed things look like they are coming back, Nigeria pageviews are present again and US traffic is quite low.

Nuria renamed this task from Proxies information gone from Zero portal to Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country.Apr 17 2018, 9:15 PM
Nuria moved this task from In Progress to Done on the Analytics-Kanban board.

Thanks everyone, great to see this working again!

Great, thanks everyone! But do we now know what caused the correct geolocation information to get lost starting on February 6? As already noted by @Nuria at T187014#4129858 , this is something that should be looked into.
(I just tried to catch up on the discussion from the last two weeks, having been on vacation, and the answer to this question is not quite clear to me, considering e.g. T187014#4117741 f. )

The proxy list for zero was emptied and that must have included also opera mini proxy values, those are now relocated to a different place.

The lack of tagging did not appear to be related to any config blob change on the portal at Zero:-OPERA (the last change was too far back for that to be plausible).

Rather, it looks like a classic bug of a switch case falling through. In this case, no pun intended, the original patch introducing the bug was in service of combatting piracy. See diff below. Now that there's an alternative in place, I don't suggest reverting that patch, mind you.

I'm not sure if this was discussed off thread (a la T187014#4129858), but there was/is a web-hosted dynamic list of IP addresses for Opera proxy-sourced traffic. @BBlack @ema @Nuria please let me know if you'd like me to share the info I have around that. It may be related to what you already received, or it might just be another point of information.

For the edification of others reading this task down the road, historically at least there have been other headers as described at https://dev.opera.com/articles/opera-mini-request-headers/ that might be helpful in automated decision making at the edge. One may also notice "X-OperaMini-Route" and other Opera-related things in Wikipedia Zero extension code with a case insensitive grep for opera in PHP files across extensions.

ZeroPortal $ git diff 614f1b ae2531
diff --git a/includes/ApiZeroPortal.php b/includes/ApiZeroPortal.php
index 674d451..6225b17 100644
--- a/includes/ApiZeroPortal.php
+++ b/includes/ApiZeroPortal.php
@@ -71,6 +71,9 @@ class ApiZeroPortal extends ApiBase {
                        case 'proxies':
                        case 'carriers':
                                $processor = function ( ZeroConfig $content, $title ) use ( $result, $moduleName ) {
+                                       if ( !$content->enabled() ) {
+                                               return;
+                                       }
                                        foreach ( $content->getIpsets() as $name => $ipset ) {
                                                $result->addValue(
                                                        $moduleName,

@dr0ptp4kt trying to understand: is this the bug that makes the request for the proxy list return empty when they should not?

@dr0ptp4kt trying to understand: is this the bug that makes the request for the proxy list return empty when they should not?

Yes.

Thanks @dr0ptp4kt for the explanation!

To add to the charts posted in T187014#4135465 and T187014#4137700 , here is one that also covers some days before the issue occurred, showing that we have returned to similar ratios - as an additional plausibility check. For the record, the affected timespan was February 6 to April 16, 2018.

I included some more countries in this chart; e.g. Bangladesh and Kenya were affected too.

And another PS (prompted by a question from @atgo): besides Opera mini, Opera Mobile traffic appears to have been affected in the same way (cf. Pivot), as were views from the regular Opera browser to the mobile site (Pivot) - but not on desktop. In any case, Opera mini has much more traffic than either of those on the mobile site.

Opera mini mobile web pageviews by country, 2018-01-29..2018-04-27.png (854×1 px, 161 KB)