Proposal
EventLogging schema for Special:Investigate was enabled and pipelined to event database (https://phabricator.wikimedia.org/T255687) . But the data is only kept for 90 days. We hope we can sanitize user level info (ueragent, ip, geocoded_data) and keep action event logs (defined in https://meta.wikimedia.org/wiki/Schema:SpecialInvestigate) and pipeline to event_sanitized database.
useragent | uuid | seqid | dt | wiki | webhost | schema | revision | topic | recvfrom | event | geocoded_data | ip | year | month | day | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sanitized | keep | keep | keep | keep | keep | keep | keep | sanitized | keep | keep | sanitized | sanitized | keep | keep | keep | keep | |
The description of the kept columns:
uuid : Unique event identifier
seqid: Udp2log sequence ID
dt: datetime
wiki: wiki project
webhost: Request host.
schema: Title of event schema
revision: Revision ID of event schema
recvfrom: Hostname of server emitting the log line
event: The encapsulated event object
schema doc:
https://meta.wikimedia.org/wiki/Schema:EventCapsule
https://meta.wikimedia.org/wiki/Schema:SpecialInvestigate
Stages:
- Legal approval for data retention
- Add schemas to EventLogging whitelist