Page MenuHomePhabricator

Add IP field only to schemas that need it. Remove it from EL capsule and do not collect it by default {mole}
Closed, InvalidPublic21 Estimated Story Points

Description

Add the hashed IP field only to schemas that need it. Remove it from EL capsule and do not collect it by default.

Most notably IP needs to be present on data collected by this schema per research's request: https://meta.wikimedia.org/wiki/Schema:QuickSurveysResponses

Event Timeline

Nuria raised the priority of this task from to Needs Triage.
Nuria updated the task description. (Show Details)
Nuria added subscribers: Nuria, Ottomata, leila.
Nuria renamed this task from Add IP field only to schemas that need it. Remove it from EL capsule and do not collect it by default to Add IP field only to schemas that need it. Remove it from EL capsule and do not collect it by default {mole}.Feb 9 2016, 7:44 PM
Nuria triaged this task as High priority.
Nuria set Security to None.

Schema has an optional IP field, will not be filled most likely and we will use what comes in the stream (from varnish kafka) to fill it in. Hashing is happening.

Do we need a special naming convention for this IP field so it is clear it doesn't need to be filled in by the client?

Do not use IP unless schema list it , thus IP will be empty for most schemas.

This task should also include comnunication of these changes to analytics@

Nuria renamed this task from Add IP field only to schemas that need it. Remove it from EL capsule and do not collect it by default {mole} to Add IP field only to schemas that need it. Remove it from EL capsule and do not collect it by default {mole} {21 pts].Feb 17 2016, 6:47 PM
Nuria assigned this task to madhuvishy.

(Cf. earlier discussion at https://phabricator.wikimedia.org/T119144#1820035 and subsequent comments)

Schema has an optional IP field, will not be filled most likely and we will use what comes in the stream (from varnish kafka) to fill it in. Hashing is happening.

Thanks for the clarification that "IP" still refers to hashed IPs; I have amended the task description accordingly.

Change 271728 had a related patch set uploaded (by Madhuvishy):
[WIP] Allow sensitive fields like clientIP to be collected only be schemas that need it

https://gerrit.wikimedia.org/r/271728

Thanks for pinging @Ottomata. :-)

@madhuvishy one thing that I just noticed: there is another schema that should be protected from this change and that is https://meta.wikimedia.org/wiki/Schema:QuickSurveyInitiation. Can you save it?

@leila I don't expect this change to be deployed until end of next week. And there is going to be a way for schema owners to still keep collecting IPs if they want to. It would involve adding a field, for e.g '__clientIp' to the QuickSurveyInitiation schema - I can let you know closer to it being deployed (I will also send out announcements to lists so anyone who wants to keep collecting it has time to prepare for the change) and we can coordinate updating your schema - would that be fine?

@leila we are working out the details, but most likely you will be able to protect any schema from this change by adding a special clientIp field to the event schema itself. The eventlogging processor logic will look for the presence of this special field (perhaps __clientIp, TBD) in the schema, and if present, populate it with the hashed IP value.

This does mean that the location of the clientIp field will change. Instead of in the top level 'capsule' object, it will be in the event itself. In MySQL, this would be event___clientIp (which, on typing, looks really dumb, so maybe we won't call this __clientIp :p ).

Op! Missed Madhu's response. What she said. :)

@madhuvishy: Actually to test that everything is golden (once our code is final) you could create a new version of ttps://meta.wikimedia.org/wiki/Schema:QuickSurveysResponses schema with our new __clientIP field and send events there (you can recycle incoming events already present on db) so we are sure things work as expected.

@madhuvishy @Ottomata got it.

@dr0ptp4kt can you help with adding the clientIP field in the schema if this change is deployed while we are running the survey?

@madhuvishy Our survey is scheduled to start on 2016-02-22 16:00 PST, and it will run for 2 weeks (maybe shorter if we collect enough earlier). We only have one shot for this survey this quarter, so I'd really appreciate if we don't change things that can affect the survey until the survey is over unless you are 100% sure that things go smoothly or you have no choice but making the change.

I think deploying this will be tricky, so we may want to wait until @leila's survey is over. We will probably deploy it in beta before then though.

I agree with @ottomatta, let's defer until the Research Q3 surveys are done running.

I anticipate software dev teams and people who have scheduled jobs or do ad hoc analysis are going to need several sprints to update their stuff to put the magic word in place / update their queries.

Out of curiosity, is hashing applied only to the IP, or to concatenated IP and X-Forwarded-For?

I anticipate software dev teams and people who have scheduled jobs or do ad hoc analysis are going to need several sprints to update their stuff to put the magic word in place / update their queries.

Actually given that IPs were broken for months w/o users noticing I doubt this is going to be the case, I do not think they are used often.

Out of curiosity, is hashing applied only to the IP, or to concatenated IP and X-Forwarded-For?

IP

I anticipate software dev teams and people who have scheduled jobs or do ad hoc analysis are going to need several sprints to update their stuff to put the magic word in place / update their queries.

Actually given that IPs were broken for months w/o users noticing I doubt this is going to be the case, I do not think they are used often.

Fair enough, sounds like that may mean it's mostly Research and, for the corresponding Quick Surveys piece, engineering for Quick Surveys.

Out of curiosity, is hashing applied only to the IP, or to concatenated IP and X-Forwarded-For?

IP

Thanks. Given EL bucketing is typically sampled at pseudorandom, is it safe to say duplicates with the same characteristics aren't too big of a deal in practice if timing or other request characteristics are taken into account?

Milimetric renamed this task from Add IP field only to schemas that need it. Remove it from EL capsule and do not collect it by default {mole} {21 pts] to Add IP field only to schemas that need it. Remove it from EL capsule and do not collect it by default {mole}.Feb 22 2016, 9:05 PM
Milimetric set the point value for this task to 21.
Milimetric added a subscriber: madhuvishy.

After some discussion, we've decided to drop ClientIPs from the eventlogging capsule altogether - we'll wait until QuickSurverys data collection is done to roll out this change - since it's the only usecase that needs it.

(Reason for why we are not doing this change - it introduces undue complexity in the parsing code, and adds magic to the schema - that breaks the contract the system has with schema owners and autofills some fields - since the only usecase for it will be done in a couple weeks - we might as well wait and remove it altogether. In future, other privacy sensitive fields like User agent can be taken out of the capsule easily since clients can send this data along with the the schema - no magic will be needed)

CCing @asherman as this will affect the WikimediaBlogVisit_5308166 schema, which has been using this field (assumed to be uncorrupted in that special case, cf. T119144) to calculate daily unique visitor IPs. An IP field would need to be re-added separately to that schema to keep this metric working.

@asherman @Tbayer We can solve the WikimediaBlogVisit usecase by instrumenting piwik on the blog now that we have that option (cc:@Nuria)

Uhhh didn't mean to resolve it.

@asherman, @Tbayer : IPs as an only measure to count unique users is really not the best, let's try to switch to piwik. Can you send us an estimate of the blog traffic?

@asherman, @Tbayer : IPs as an only measure to count unique users is really not the best, let's try to switch to piwik. Can you send us an estimate of the blog traffic?

It had been included by Ori back in the day to the blog's EventLogging stats reports. I think of it as a rough but robust measure (note that I said unique IPs, not unique users ;) that can be useful in various circumstances. But I am not sure how much it has been used anyway; I simply wanted to give @asherman a heads-up in case he had been relying on it (unlikely).
Thanks on behalf of the current blog team (cc @jrbs and @EdErhart-WMF ) for the offer regarding Piwik. Note though that the blog already has in-built stats from Automattic which provide e.g. daily visitor numbers; we retained the EventLogging instrumentation because it gives us additional information and the ability to directly query the logs using SQL.

I am closing this ticket but if you wish to use piwik for blog please open a phab ticket and we can get you started.

Change 271728 abandoned by Nuria:
[WIP] Allow sensitive fields like clientIP to be collected only be schemas that need it

Reason:
Not needed, IP was removed.

https://gerrit.wikimedia.org/r/271728