Page MenuHomePhabricator

SpecialMuteSubmit: add schemas to EventLogging whitelist
Closed, ResolvedPublic

Description

Proposal
EventLogging schema for SpecialMuteSumbit is only kept for 90 days. We hope we can sanitize user level info (ueragent, ip, geocoded_data) and keep action event logs (defined in https://meta.wikimedia.org/wiki/Schema:SpecialMuteSubmit) and pipeline to event_sanitized database.

ipuseragentuuidseqiddtwikiwebhostschemarevisiontopicrecvfromeventgeocoded_datayearmonthdayhour
sanitizedsanitizedkeepkeepkeepkeepkeepkeepkeepsanitizedkeepkeepsanitizedkeepkeepkeepkeep

The description of the kept columns:
uuid : Unique event identifier
seqid: Udp2log sequence ID
dt: datetime
wiki: wiki project
webhost: Request host.
schema: Title of event schema
revision: Revision ID of event schema
recvfrom: Hostname of server emitting the log line
event: The encapsulated event object

schema doc:
https://meta.wikimedia.org/wiki/Schema:EventCapsule
https://meta.wikimedia.org/wiki/Schema:SpecialMuteSubmit

Stages:

  • Legal approval for data retention
  • Add schemas to EventLogging whitelist

Event Timeline

jwang created this task.Sep 10 2020, 1:06 AM
jwang updated the task description. (Show Details)
jwang changed the visibility from "Public (No Login Required)" to "All Users".
Niharika changed the visibility from "All Users" to "Public (No Login Required)".Sep 10 2020, 7:36 PM
Niharika moved this task from Untriaged to Analytics on the Anti-Harassment board.
jwang updated the task description. (Show Details)Sep 14 2020, 11:45 PM
LGoto assigned this task to jwang.Sep 15 2020, 5:08 PM
LGoto triaged this task as Medium priority.
LGoto edited projects, added Product-Analytics (Kanban); removed Product-Analytics.
LGoto moved this task from Next 2 weeks to Blocked on the Product-Analytics (Kanban) board.

Approved by Legal.

jwang updated the task description. (Show Details)Sep 15 2020, 6:41 PM
jwang updated the task description. (Show Details)
jwang added a comment.Sep 15 2020, 6:49 PM

@APalmer_WMF , Thanks.

To make it more clear, I added the description of the columns in the proposal.

jwang updated the task description. (Show Details)Sep 15 2020, 6:49 PM

Change 628235 had a related patch set uploaded (by DannyS712; owner: Jenniferwang):
[analytics/refinery@master] Add SpecialMuteSubmit schema to EventLogging whitelist

https://gerrit.wikimedia.org/r/628235

@JFishback_WMF, please let me know if you have any concern.

Change 628235 merged by Nuria:
[analytics/refinery@master] Add SpecialMuteSubmit schema to EventLogging whitelist

https://gerrit.wikimedia.org/r/628235

Hello @jwang same comment as here. Is it necessary to keep the dt field? What is the purpose of keeping this data long-term (please feel free to reply off-task if it is sensitive)? Thanks!

jwang added a comment.Sep 29 2020, 5:06 PM

@JFishback_WMF , thanks for the review and comment. Please find my answer here since they are the same topic.

jwang moved this task from Blocked to Doing on the Product-Analytics (Kanban) board.
kzimmerman added a subscriber: kzimmerman.

Pending final review from @JFishback_WMF

Generally, the best practice is to minimize the data that is collected - especially high resolution data. That said, this data appears to be LOW risk, and per @jwang there is a countervailing interest in collecting the high resolution data in order to retain sequencing and other analytical uses.

jwang closed this task as Resolved.Dec 16 2020, 12:26 AM
jwang updated the task description. (Show Details)

Thanks for the review and suggestion. Mark it as resolved.