Page MenuHomePhabricator

SpecialMuteSubmit: add schemas to EventLogging whitelist
Closed, ResolvedPublic


EventLogging schema for SpecialMuteSumbit is only kept for 90 days. We hope we can sanitize user level info (ueragent, ip, geocoded_data) and keep action event logs (defined in and pipeline to event_sanitized database.


The description of the kept columns:
uuid : Unique event identifier
seqid: Udp2log sequence ID
dt: datetime
wiki: wiki project
webhost: Request host.
schema: Title of event schema
revision: Revision ID of event schema
recvfrom: Hostname of server emitting the log line
event: The encapsulated event object

schema doc:


  • Legal approval for data retention
  • Add schemas to EventLogging whitelist

Event Timeline

jwang updated the task description. (Show Details)
jwang changed the visibility from "Public (No Login Required)" to "All Users".
Niharika changed the visibility from "All Users" to "Public (No Login Required)".Sep 10 2020, 7:36 PM
Niharika moved this task from Untriaged to Analytics on the Anti-Harassment board.
LGoto triaged this task as Medium priority.
LGoto edited projects, added Product-Analytics (Kanban); removed Product-Analytics.
LGoto moved this task from Next 2 weeks to Blocked on the Product-Analytics (Kanban) board.
jwang updated the task description. (Show Details)

@APalmer_WMF , Thanks.

To make it more clear, I added the description of the columns in the proposal.

Change 628235 had a related patch set uploaded (by DannyS712; owner: Jenniferwang):
[analytics/refinery@master] Add SpecialMuteSubmit schema to EventLogging whitelist

@JFishback_WMF, please let me know if you have any concern.

Change 628235 merged by Nuria:
[analytics/refinery@master] Add SpecialMuteSubmit schema to EventLogging whitelist

Hello @jwang same comment as here. Is it necessary to keep the dt field? What is the purpose of keeping this data long-term (please feel free to reply off-task if it is sensitive)? Thanks!

@JFishback_WMF , thanks for the review and comment. Please find my answer here since they are the same topic.

Generally, the best practice is to minimize the data that is collected - especially high resolution data. That said, this data appears to be LOW risk, and per @jwang there is a countervailing interest in collecting the high resolution data in order to retain sequencing and other analytical uses.

jwang updated the task description. (Show Details)

Thanks for the review and suggestion. Mark it as resolved.