Page MenuHomePhabricator

We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data
Open, MediumPublic

Description

The objective of having a trusted proxy list is to be able to mark a request as coming from a proxy AND provide the requests original client Ip. If we have the original client ip it really does not matter whether the request came from the proxy so, really, we can do away with proxy list if we populate x-forwarded-for (or client-ip) for user agents that match opera mini and googleweblight.

At the time of this writing we have more than 1000 distinct IPS from googleweblight on our data, it is not a list we can maintain by hand and it seems that inspecting UA for our purposes (get IP of original request) is sufficient.

Event Timeline

Verfiable with this query:

select ip, client_ip,x_analytics, user_agent from wmf.webrequest where year=2019 and month=09 and day=01 and hour=01 and user_agent like '%pera%' limit 100;

Nuria mentioned this in Unknown Object (Task).Sep 13 2019, 3:32 PM
Nuria renamed this task from Client_IP and Ip are always the same , even for proxied requests for opera mini to Client_IP and Ip are always the same , even for proxied requests for opera mini or googleweblight.Sep 13 2019, 3:47 PM
Nuria renamed this task from Client_IP and Ip are always the same , even for proxied requests for opera mini or googleweblight to We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data .Sep 13 2019, 4:56 PM
Nuria added a project: Traffic.
Nuria updated the task description. (Show Details)

The problem stems from the "Trust" in "Trusted Proxy". The user-agent string isn't a reliable source (can be set to anything by anyone), and ditto for the contents of X-Forwarded-For. So we can't decide to trust XFF contents in the absence of something reliable, and the UA string isn't it. This is why we need a list of source IPs / networks (and a way to keep them updated) to know who we can trust XFF data from.

Right, I see the UA issue but in the absence of IPs being provided by the proxy owners themselves what I am doing to retrieve them is just look at UA data in webrequest table so effectively, it is the same thing. Not sure what else can we do to have a more trustworthy list.

@BBlack
Let me add more contex here, we are trying to increase the data of our pageview dataset by tagging "automated" data. That is, "entities" that do very spiky requests in our data, say 30 pageviews per minute (pageviews, not requests). In order to do that we need a means to identify "entities" and having an IP that is not an umbrella IP (like it is the case of a proxy) helps.

herron triaged this task as Medium priority.Sep 18 2019, 7:20 PM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Should we decline this ticket?

Should we decline this ticket?

I don't think so. This is still a valid issue even if we've not worked on it. Adding DE tag and removing analytics.