Page MenuHomePhabricator

We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data
Open, NormalPublic

Description

The objective of having a trusted proxy list is to be able to mark a request as coming from a proxy AND provide the requests original client Ip. If we have the original client ip it really does not matter whether the request came from the proxy so, really, we can do away with proxy list if we populate x-forwarded-for (or client-ip) for user agents that match opera mini and googleweblight.

At the time of this writing we have more than 1000 distinct IPS from googleweblight on our data, it is not a list we can maintain by hand and it seems that inspecting UA for our purposes (get IP of original request) is sufficient.

Event Timeline

Nuria created this task.Sep 12 2019, 11:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2019, 11:09 PM

Verfiable with this query:

select ip, client_ip,x_analytics, user_agent from wmf.webrequest where year=2019 and month=09 and day=01 and hour=01 and user_agent like '%pera%' limit 100;

Nuria mentioned this in Unknown Object (Task).Sep 13 2019, 3:32 PM
Nuria renamed this task from Client_IP and Ip are always the same , even for proxied requests for opera mini to Client_IP and Ip are always the same , even for proxied requests for opera mini or googleweblight.Sep 13 2019, 3:47 PM
Nuria renamed this task from Client_IP and Ip are always the same , even for proxied requests for opera mini or googleweblight to We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data .Sep 13 2019, 4:56 PM
Nuria added a project: Traffic.
Nuria updated the task description. (Show Details)
Restricted Application added a project: Operations. · View Herald TranscriptSep 13 2019, 4:56 PM
BBlack added a subscriber: BBlack.Sep 13 2019, 5:59 PM

The problem stems from the "Trust" in "Trusted Proxy". The user-agent string isn't a reliable source (can be set to anything by anyone), and ditto for the contents of X-Forwarded-For. So we can't decide to trust XFF contents in the absence of something reliable, and the UA string isn't it. This is why we need a list of source IPs / networks (and a way to keep them updated) to know who we can trust XFF data from.

Nuria added a comment.Sep 13 2019, 6:40 PM

Right, I see the UA issue but in the absence of IPs being provided by the proxy owners themselves what I am doing to retrieve them is just look at UA data in webrequest table so effectively, it is the same thing. Not sure what else can we do to have a more trustworthy list.

Nuria added a comment.Sep 16 2019, 5:39 PM

@BBlack
Let me add more contex here, we are trying to increase the data of our pageview dataset by tagging "automated" data. That is, "entities" that do very spiky requests in our data, say 30 pageviews per minute (pageviews, not requests). In order to do that we need a means to identify "entities" and having an IP that is not an umbrella IP (like it is the case of a proxy) helps.

herron triaged this task as Normal priority.Sep 18 2019, 7:20 PM
ema moved this task from Triage to Caching on the Traffic board.Oct 14 2019, 6:22 PM