Page MenuHomePhabricator

SpecialInvestigate: add schemas to EventLogging whitelist
Closed, ResolvedPublic

Description

Proposal
EventLogging schema for Special:Investigate was enabled and pipelined to event database (https://phabricator.wikimedia.org/T255687) . But the data is only kept for 90 days. We hope we can sanitize user level info (ueragent, ip, geocoded_data) and keep action event logs (defined in https://meta.wikimedia.org/wiki/Schema:SpecialInvestigate) and pipeline to event_sanitized database.

useragentuuidseqiddtwikiwebhostschemarevisiontopicrecvfromeventgeocoded_dataipyearmonthdayhour
sanitizedkeepkeepkeepkeepkeepkeepkeepsanitizedkeepkeepsanitizedsanitizedkeepkeepkeepkeep

The description of the kept columns:
uuid : Unique event identifier
seqid: Udp2log sequence ID
dt: datetime
wiki: wiki project
webhost: Request host.
schema: Title of event schema
revision: Revision ID of event schema
recvfrom: Hostname of server emitting the log line
event: The encapsulated event object

schema doc:
https://meta.wikimedia.org/wiki/Schema:EventCapsule
https://meta.wikimedia.org/wiki/Schema:SpecialInvestigate

Stages:

  • Legal approval for data retention
  • Add schemas to EventLogging whitelist

Event Timeline

jwang created this task.Sep 10 2020, 12:23 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 10 2020, 12:23 AM
jwang updated the task description. (Show Details)
jwang changed the visibility from "Public (No Login Required)" to "All Users".
jwang added a subscriber: Niharika.

jwang changed the visibility from "Public (No Login Required)" to "All Users".

@jwang: Hi, was this view policy change intentional? Asking as it makes no sense, given our setup that anyone can create an account anyway. :)

Aklapper changed the visibility from "All Users" to "Public (No Login Required)".Sep 10 2020, 7:48 AM
Niharika moved this task from Untriaged to Analytics on the Anti-Harassment board.Sep 10 2020, 7:35 PM
jwang updated the task description. (Show Details)Sep 14 2020, 11:44 PM
LGoto assigned this task to jwang.Sep 15 2020, 5:10 PM
LGoto triaged this task as Medium priority.
LGoto edited projects, added Product-Analytics (Kanban); removed Product-Analytics.
LGoto moved this task from Next 2 weeks to Blocked on the Product-Analytics (Kanban) board.

Approved by Legal.

jwang updated the task description. (Show Details)Sep 15 2020, 6:40 PM
jwang updated the task description. (Show Details)Sep 15 2020, 6:43 PM
jwang added a comment.Sep 15 2020, 6:49 PM

@APalmer_WMF , Thanks.

To make it more clear, I added the description of the kept columns in the proposal.

Change 628237 had a related patch set uploaded (by DannyS712; owner: Jenniferwang):
[analytics/refinery@master] Add SpecialInvestigate schema to EventLogging whitelist

https://gerrit.wikimedia.org/r/628237

@JFishback_WMF, please let me know if you have any concern.

Hello @jwang thanks for reaching out. Is the purpose just to track usage long-term? Also, if you don't need high-precision time, I would remove dt from the schema as well. It looks like you can still track hourly precision without that field? Other than that, this looks fine to me.

For future reference if you'd like to request a privacy review from Security you can use this process (specifically step 2). That form helps to give me background on the ask. As a bonus, I run these reviews by WMF-Legal as well, so it's like a 2-for-1 deal! (Although I see you already reached out to @APalmer_WMF on this task.)

jwang added a comment.Sep 29 2020, 4:58 PM

@JFishback_WMF, thanks for the review and suggestions. Right, when I started the review conversation, the process is unclear to me. Will follow your suggestion next time.

As for dt, it will be kept by default when we add schema to whitelist. And it's easier to extract date data with such data format. Let me know if you have further concern regarding it. If yes, I need to discuss with analytics team about how to opt it out technically.

kzimmerman added a subscriber: kzimmerman.

Pending final review from @JFishback_WMF

Change 628237 merged by Mforns:
[analytics/refinery@master] Add SpecialInvestigate schema to EventLogging whitelist

https://gerrit.wikimedia.org/r/628237

Generally, the best practice is to minimize the data that is collected - especially high resolution data. That said, this data appears to be LOW risk, and per @jwang there is a countervailing interest in collecting the high resolution data in order to retain sequencing and other analytical uses.

kzimmerman closed this task as Resolved.Mon, Dec 21, 5:12 PM
jwang updated the task description. (Show Details)Mon, Dec 21, 5:32 PM