Page MenuHomePhabricator

Replace Analytics XFF/client.ip data with X-Client-IP
Closed, ResolvedPublic5 Estimated Story Points

Description

We now have X-Client-IP as a standard header based on thorough decoding of XFF (and related data) at the Varnish layer, which we'd like to adopt as the standard notion of the client's IP for both applayer and analytics/logging.

It looks like the steps here, if I understand correctly, will be:

  1. Add XCIP to response headers, so that varnishkafka has access to it at all
  2. Add client_ip field based on the above to the webrequest varnishkafka log format (We can probably start getting this into e.g. oxygen logs at this point as well)
  3. Add new field in hive consuming that data
  4. Start using the new field in place of the computed value in analytics, dropping old XFF-processing code.
  5. Remove the "ip" and "x_forwarded_for" fields from webrequest varnishkafka log format

Event Timeline

BBlack raised the priority of this task from to Medium.
BBlack updated the task description. (Show Details)
BBlack added projects: Traffic, Analytics.
BBlack added subscribers: BBlack, Ottomata, faidon, JAllemandou.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 252427 had a related patch set uploaded (by BBlack):
add X-Client-IP to response headers for vk to consume

https://gerrit.wikimedia.org/r/252427

Change 252928 had a related patch set uploaded (by BBlack):
Add resp.http.X-Client-IP -> webrequest:client_ip

https://gerrit.wikimedia.org/r/252928

Change 252929 had a related patch set uploaded (by BBlack):
Remove webrequest "ip" and "x_forwarded_for"

https://gerrit.wikimedia.org/r/252929

BBlack set Security to None.

Add client_ip field based on the above to the webrequest varnishkafka log format (We can probably start getting this into e.g. oxygen logs at this point as well)

This will happen as soon as varnishkafka starts sending it.

Remove the "ip" and "x_forwarded_for" fields from webrequest varnishkafka log format

Hm, not sure about this. ip, maybe. But I think that @leila was saying that they use x_forwarded_for to make it easier for them to correlate requests from the same user. (It is possible that x_forwarded_for is different but client_ip is the same for different end users, no? E.g. if they originated from behind the same NAT but used different proxies?)

@BBlack, are you aware of T113817? I'll CC you over there, since that is related. Also, (perhaps this should be in a different ticket), but yesterday analytics and researcher folks were talking about using the same rotating etcd salt key that EventLogging uses to hash IPs in refined webrequest data. This would allow for correlation of requests without knowing actual IPs. We could use the same salt to also hash x_forwarded_for if needed.

Hm, not sure about this. ip, maybe.

'ip' is mostly-pointless even now, as it's almost universally our own internal IPs from nginx passing the request into varnish for HTTPS.

But I think that @leila was saying that they use x_forwarded_for to make it easier for them to correlate requests from the same user. (It is possible that x_forwarded_for is different but client_ip is the same for different end users, no? E.g. if they originated from behind the same NAT but used different proxies?)

That seems like an extreme edge case. If a bunch of users are behind one NAT IP, there's not great odds of being able to sort them out based on one of them using a certain remote proxy and the other not. Even then that would mostly be in the pre-HTTPS world. With HTTPS, very few proxies can be used at all (well, technically you could use a CONNECT-style proxy, but then that wouldn't add anything to the XFF chain).

The common XFF cases are only going to be with special clients (e.g. OperaMini, which we're already decoding for an indicating), or with corporate proxies that use company-installed fake root certs on company machines (in which case all the company machines behind a given NAT will share the same proxy and it's still not a point of differentiation).

I could see wanting to know XFF to know about proxy usage itself, separately from knowing about users, but really we should be adding those to the netmapper proxy database if so and then they'll be in our new X-Trusted-Proxy header, which we could start sending to analytics as well if you want it. Currently it only knows about OperaMini and Nokia.

The point of the new headers is to centralize the parsing of XFF in one place with well-understood semantics that are correct, where we can adapt it moving forward as our infrastructure and the internet at large changes and thus the correct parsing of it also changes.

I'd like both @JAllemandou and @leila to review this and give the go ahead to remove ip and x_forwarded_for. If they say cool, then it is fine with me.

Hm, if we do remove the ip field, we might as well just reuse it and stick client ip there. This would then require 0 changes on the analytics side of things.

Change 252427 merged by BBlack:
add X-Client-IP to response headers for vk to consume

https://gerrit.wikimedia.org/r/252427

Change 252928 merged by BBlack:
Add resp.http.X-Client-IP -> webrequest:client_ip

https://gerrit.wikimedia.org/r/252928

Change 252929 abandoned by BBlack:
Remove webrequest "ip" and "x_forwarded_for"

Reason:
rebasing after parent revert got annoying, starting over

https://gerrit.wikimedia.org/r/252929

Change 253472 had a related patch set uploaded (by BBlack):
webrequest: Add X-Client-IP -> client_ip

https://gerrit.wikimedia.org/r/253472

Change 253473 had a related patch set uploaded (by BBlack):
webrequest: remove "ip" field

https://gerrit.wikimedia.org/r/253473

Change 253474 had a related patch set uploaded (by BBlack):
webrequest: remove X-Forwarded-For

https://gerrit.wikimedia.org/r/253474

Change 253472 merged by BBlack:
webrequest: Add X-Client-IP -> client_ip

https://gerrit.wikimedia.org/r/253472

With varnishkafka-1.0.7 deployed and the patch above merged, the webrequest stream now has a correct "client_ip" field that analytics can transition to. Do we need to add it to other varnishkafka streams as well?

(and, I just read @Ottomata's comment above - we can certainly switch the data into "ip" instead of "client_ip". That might be simpler than switching analytics code to the new field then removing the old).

No problem for me to remove and reuse ip, and remove x_forwarded_for :)

Change 256002 had a related patch set uploaded (by BBlack):
webrequest: move client_ip data to legacy "ip" field

https://gerrit.wikimedia.org/r/256002

Change 256002 merged by BBlack:
webrequest: move client_ip data to legacy "ip" field

https://gerrit.wikimedia.org/r/256002

Change 253473 abandoned by BBlack:
webrequest: remove "ip" field

Reason:
Obviated by https://gerrit.wikimedia.org/r/#/c/256002/

https://gerrit.wikimedia.org/r/253473

Change 256020 had a related patch set uploaded (by BBlack):
statsv: switch "ip" field to X-Client-IP like webrequest

https://gerrit.wikimedia.org/r/256020

Ok, the old "ip" field now has the X-Client-IP data in the webrequest logs.

The remaining pending patches here are:

the (updated) one to remove XFF data from webrequest: https://gerrit.wikimedia.org/r/#/c/253474/
a new one to make the same "ip" switch for statsv: https://gerrit.wikimedia.org/r/#/c/256020/

Change 256020 merged by BBlack:
statsv: switch "ip" field to X-Client-IP like webrequest

https://gerrit.wikimedia.org/r/256020

Change 279132 had a related patch set uploaded (by Ottomata):
Remove single use of x_forwarded_for in insert_hourly_pagecounts.hq

https://gerrit.wikimedia.org/r/279132

Change 279132 merged by Joal:
Remove single use of x_forwarded_for in insert_hourly_pagecounts.hq

https://gerrit.wikimedia.org/r/279132

Just noticed this ticket again! Can/should we move forward with this?

From my perspective, where we last stalled out is waiting for Analytics to say it's ok to merge https://gerrit.wikimedia.org/r/#/c/253474 (which removes XFF data from the webrequest stream). I'm not sure what blockers/deps/validation may be pending over in the Analytics side before that.

Nuria removed a project: Analytics.
Nuria added a project: Analytics.
Nuria moved this task from Incoming to Operational Excellence Future on the Analytics board.
Nuria set the point value for this task to 5.

Change 358603 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Remove deprecated ClientIpUDF and deprecate Legacy Pageview code.

https://gerrit.wikimedia.org/r/358603

Ooook, I just checked some things.

  • x_forwarded_for was only being used by legacy pageview code in refinery, which itself is not used.
  • I created a hive table on top of JSON data with this field missing, and hive selected it as NULL, no problem.
  • I then refined the JSON table into parquet, and hive selected it as NULL, no problem (expected).

So, we should be good to go and merge that patch. I'll take care of that.

@BBlack, just one last double check: are you sure XFF is not useful for ops purposes? We can easily exclude this data from refined webrequest analytics data, while still keeping it in the JSON webrequest logs in Kafka.

I'm ready to merge though, so gimme one final +1 and I'll do it.

No I don't think we need it for non-immediate analysis like this. We still have zero, zeronet and proxy in the X-Analytics string too (which is also in webreq data)

Ok! Will merge this today then, thanks.

Change 253474 merged by Ottomata:
[operations/puppet@production] webrequest: remove X-Forwarded-For

https://gerrit.wikimedia.org/r/253474

Change 358603 merged by Ottomata:
[analytics/refinery/source@master] Remove deprecated ClientIpUDF and deprecate Legacy Pageview code.

https://gerrit.wikimedia.org/r/358603