Page MenuHomePhabricator

Country mapping routine for proxied requests
Closed, DeclinedPublic

Description

The result set from the following query suggests that these proxied requests originate from three countries. But peeling back the X-Forwarded-For values, the leftmost IP address appears to provide the detail necessary for geocoding back to the actual geo origin or the requests. It would appear that the geocoding is being applied to the righthand data center IPs where the proxies reside, as opposed to the expected leftmost internet-facing IP addresses.

Is this expected? Or is there something misplaced in the query?

select geocoded_data['country'], x_forwarded_for, count(1)
from webrequest where
year = 2015 and month = 10 and day = 25 and hour = 1
and is_pageview = true and agent_type = 'user' and access_method = 'mobile web'
and x_analytics_map['proxy'] = 'IORG'
group by geocoded_data['country'], x_forwarded_for;

Event Timeline

dr0ptp4kt raised the priority of this task from to Needs Triage.
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt moved this task to Incoming on the Analytics board.

Adding @BBlack - as it might be Varnish that might need to be adjusted for proper traffic tagging if it comes via IORG

@kevinator - we are unsure this is working properly - can you verify that we have correct countries identified when coming through a proxy?

I suspect this might be related to T89688 and T99226

In general, we shouldn't be trying to manually decode the XFF header: we should be moving towards relying on the X-Client-IP header we're setting in Varnish these days, where we've already centralized XFF-processing and related things to avoid issues with both internal and external proxy addresses. Internet.org may still be a special case we need to handle better there, though. AFAIK there's no list of internet.org trusted-proxy IPs in our proxy database, and all we're doing is setting the X-Analytics field "proxy=IORG" if the request happens to include a Via header matching (?i)Internet\.org.

We should probably evolve this ticket a bit into resolving a few questions:

  1. What is Internet.org's proxying behavior in terms of XFF? Does it always set XFF on the way through?
  2. Is there a set of Internet.org IPs we can list in our trusted-proxy database to identify Internet.org better than through the easily-faked and uninformative "Via" header?
  3. Does it tend to end up being layered with other trusted proxies like OperaMini in some cases, which might require us to iterate on the check for trusted proxies in the Varnish XFF-decoder, and amend our output fields for the notion of naming multiple trusted proxies? Currently the code only looks for/through a single layer of trusted proxy.

I should have said, at the top of the above comment: The new X-Client-IP code should resolve most issues, but there may be remaining issues specific to Internet.org (which I was reminded of by your SQL)...

https://developers.facebook.com/docs/internet-org/platform-technical-guidelines mentions that IP is added to XFF, and the detection can be done by checking the "Via" header for "internet.org". I think @DFoy might have some info on proxied IPs. Also, I was under the impression that a request can pass through both IOrg and Opera mini proxies in a row, but this needs to be confirmed.

Yeah it kinda sucks if they can't actually give us a list of proxy IPs or networks we can maintain. The Via header is nice, but that still leaves us with the mess of detecting where some random Internet.org proxy is within the XFF chain. We can have cases like A, B, X, Y, Z, W where W is our standard WMF-end of things with our internal addresses that we decode, Y and Z are OperaMini and IORG in some unknown order, X is the real client ip, and then A and B are untrustable things sent by the real client (faking trusted proxies or our own addresses, or sending junk internal things like 127.0.0.1 or 192.168.x.x). Simple solutions like "look at the left-most" aren't really correct; we actually have to parse through it right-to-left and stop when the chain of trustable things stops.

I agree with @BBlack, "leftmost" on its own isn't a good heuristic (it seems to be okay in this specific case, but not the general one). It has to go right-to-left and then rewind if it hits an private IP.

Well, even then, junk or untrusted ones we don't want to pay attention to may not always be private IPs. People also use browser extensions or other local hacks and stuff very invalid data onto the left of the stack (e.g. other legit public addresses to fake their country of origin, or to pretend they're behind one of our trusted proxies when they're not, or even that they're internal to us). The Via header is also easily faked, so we can't allow that to cause us to blanket trust whatever the client stuffed in XFF.

Basically there's no way we can trust the Internet.org proxies and take the IP address data from the XFF they send as legitimate unless we have a whitelist of IPs or networks from which Internet.org proxies send requests. The net result is that our view of X-Client-IP is always going to be the internet.org proxy endpoint's own address (and we'll set X-Analytics: proxy=IORG as well, but anyone could set that header so it's not really reliable information).

If we had a proxy whitelist for internet.org, then our next issue would be upgrading the VCL code to work reliably with layered proxies as well, but there probably aren't other cases we really care about yet, so it's otherwise not really a priority.

Would it be worth it for us to verify whether we could rely upon the route info from the query expressed in https://developers.facebook.com/docs/sharing/webmasters/crawler ? Their proxy IPs are volatile, but maybe we don't care so long as we can combine traffic sourced from any of those IPs with the prescribed Via: header.

Milimetric triaged this task as Medium priority.Mar 7 2016, 5:25 PM
Milimetric moved this task from Incoming to Analytics Query Service on the Analytics board.

@dr0ptp4kt - yes, it's possible/likely that the AS32934 info you linked could be used to verify internet.org proxying. Probably the best method for approaching this in the long run would be something like this:

  1. We make a policy decision that for XFF-like purposes (trusting XFF IP headers to identify client IPs for geolocation and other purposes), we trust Facebook as an entity in general, like we do OperaMini. Just keep in mind this trust will now extend to all software at Facebook, not just iorg proxies. I think this is ok.
  2. Assuming AS32934 is where all the internet.org proxies live (this isn't clear from documentation links we've seen so far, but very likely), set up a process to import this into our existing netmapper proxies database. We could start with a manual script even: something that does the whois query and encodes it to JSON, allowing a human to paste this into the zero portal to update the proxy whitelist for Facebook (it should properly be called Facebook rather than Internet.org in that list). While the set of routes for that AS is volatile, it shouldn't be all that rapidly volatile. Just having it manually updated once a month or so would be a good start, and then we can look at how to automate that better down the line. We should probably look at making this a general facility for other purposes (e.g. we may want a similar solution for Google's proxies, and maybe others).
  3. Finish fixing up our XFF processing in general (to better handle proxy recursion and private IPs in the middle of the list, etc), which is already in backlogged task T120121

At that point we'd be seeing through internet.org requests correctly for the purposes of X-Client-IP and geolocation and such, with Facebook as a proxy alongside OperaMini, and the ability to see through multiple layers of proxying in either order. Then if we want to reliably detect Internet.org, we could update the currently proxy=IORG code to only believe the internet.org Via header when Facebook was in the list of proxies we detected. Since at that point the actual proxy list would be an array, we could either make IORG indication in a separate analytics field, or we could s/Facebook/IORG/ in the proxy array when the Via header is present - whichever seems to make more sense.

Analytics uses X-Client_IP and that is what we geocode so there is no action for the team here that I can see, let us know otherwise. Moving to radar.