Some of the default EventLogging fields are not needed in some of the schemas, and they can be privacy-invasive and/or take up a lot of space in the database and make the log message large (cf. T91347), the prime culprits being userAgent and clientIp. There should be a way to opt out from logging those.
Description
Related Objects
- Mentioned In
- T114078: Eventlogging should transparently split large event payloads
T91347: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - Mentioned Here
- T121550: eventlogging user agent data should be parsed so spiders can be easily identified {flea}
T126366: Add IP field only to schemas that need it. Remove it from EL capsule and do not collect it by default {mole}
T128407: Remove Client IP from Eventlogging capsule {mole}
T119144: EventLogging sees too few distinct client IPs {oryx} [8 pts]
T91347: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?)
Event Timeline
As a stakeholder of the EventLogging service provided by Analytics, I request that they decline this task.
By definition any collection of data about anyone for any purpose is "privacy-invasive". We balance that privacy invasion against our desires to understand our users and how they use our sites for the purposes of serving their needs better. I agree that providing a general-purpose opt-out for data collection is important to respect a user's desire to not have data about them collected. But I don't agree with providing many such opt-outs for many bits of the data. The complexity of user interface and implementation that results is not worth the reward. I would rather see Analytics focus its effort on a general-purpose opt-out than a specific one.
I think @Tgr is talking about per-schema opt-out in the software, rather than a user choice. If I understood him correctly, I support that.
Indeed, I was thinking of a way to disable IP/useragent collection in the schema configuration (or logEvent call or whatever works), just worded it poorly.
For user opt-out, AFAIK we disable EventLogging when the Do Not Track header is set, and that seems good enough to me.
ClientIp is always encrypted and takes no space so I do not think is an issue. Also, user-agent and IP are deleted after 90 days per our privacy guidelines.
For schemas where tracking a user across log entries isn't needed, I think
this is a very good idea. The more schemas we have, the more information
about the hashed IP we're storing, and in aggregate that allows us to get
closer to reidentification.
So yes, I really think this should be supported.
Actually it looks like many or almost all schemas have already involuntarily opted out of logging valid clientIPs since more than five months, and nobody noticed ;) T119144
Reworded the title to try to capture the intent of the task based on the above discussion; feel free to modify it further.
We'll preprocess the user-agent as part of T121550, so that should help with this as well.
User agent isn't raw anymore and the IP has been dropped so this doesn't pertain anymore.
We support events now that don't have the schema capsule via EventBus.