The Privacy team has developed guidelines for data publication. (private document) Given that Research is involved in collecting and publishing data and the new guidelines can affect our work, we should set aside time to review the guidelines and provide feedback.
Description
Details
- Due Date
- Jul 21 2023, 12:00 AM
Event Timeline
@fkaelin this task is ready for you to pick up when you're ready. By the deadline, we should provide feedback in the private document linked from the description. In discussion with @Miriam we thought you likely want to get @Isaac's input as well given that he frequently works on data-releases. There may be other members of the team who you may need their input. Please judge that on your end. : )
Hey @Htriedman - thank you for this work, we encounter this question regularly.
I used the opportunity to revive the idea of a streaming topN pageviews dataset. I found this is a good example to think through the risk tiering grid, as view counts are standard metric to share but the time granularity is new (as it is a streaming pipeline). I made general comments in the doc itself.
The proposed dataset consists of tsv files that are snapshots of the topN pages a given interval (e.g. 30 minutes), but technically we could also expose the API itself publicly (to retrieve the topN pages in realtime). In my opinion such an API should go through a privacy review, while the hourly snapshots wouldn't have to (based on the risk tiering grid, plus this data is generated in batch eventually too). Where is the line when it comes to requiring a privacy review?
Thanks for your comments, @fkaelin! I'll get back to you about the topN pages once we meet about it.
Thanks again for this work and for discussion @Htriedman we had. Closing are resolved.