Page MenuHomePhabricator

Proposal for handling PII survey data
Closed, ResolvedPublic

Description

  • Guidelines for separating code analyzing survey data from the data itself, in order for the code to be public
  • Write a proposal for where to store survey data which tends to contain a lot of PII data.
  • Consult with data platform to discuss options for data storage.

Details

Due Date
Sep 22 2023, 4:00 AM

Event Timeline

fkaelin set Due Date to Aug 31 2023, 4:00 AM.
fkaelin moved this task from Backlog to Staged on the Research board.
leila triaged this task as High priority.Jul 27 2023, 6:12 PM

@fkaelin is this on track to be resolved by Thursday? If not, we should probably change the due date?

Moved the due date back. We have had some initial discussion that touched on this:

  • we will use Organizer Lab code T344625 (which involves disentangling PII data and code) to
    • describe the nature of the PII data collected
    • evaluate the requirements from research for storing such data
  • reach out to data-engineering
    • suggestions for storing survey data / discuss best practices
    • evaluate storing data in the data infra (e.g. ceph / kerberos)
fkaelin changed Due Date from Aug 31 2023, 4:00 AM to Sep 22 2023, 4:00 AM.Sep 5 2023, 1:28 PM

After learning more about the nature of the PII data from @YLiou_WMF, and some informal discussions with Joseph in data engineering, this proposal for PII survey data is simple: we will continue with the current way of doing this.

  • the survey data is stored on locked down google drives. the amount of data is very small, and backed by a privacy policy that is part of the survey process
  • data engineering has a number of ongoing efforts that will impact how PII is handled (moving away from kerberos and hdfs to e.g. ceph), this special use case is not a good fit/use case for this work

Instead, the focus for this quarter is to minimize the entangling of the existing code analyzing the survey results and the PII data itself (e.g. T344625), in order for the researchers to make the code available publicly. If as part of this work we identify the need for more guidelines for future work related to PII data, we will capture that in a new task.

Closing this as done.

@fkaelin thanks for looking into this. Should this kind of data be tracked as part of our internal tracking then? (private link) https://office.wikimedia.org/wiki/Research/Documentation#Internal_documentation . If yes, you can work with @Miriam to make sure it's tracked in this case (and possible future cases).