Problem: A number of stakeholders are interested in accessing heavily anonymized traffic data, and we would like to enable such access to a selected number of people.
Who is interested in this data: Stakeholders interested in this data (potential "light collaborators") include:
- 3rd party organizations whose mission is aligned with movement and WMF strategy
- Researchers who are working on highly relevant projects, which are not included in our annual plan, but whose research direction is highly aligned with movement strategy and WMF mission
- Community members and point of contact at local chapters interested in understanding the audiences in the regions they operate it.
Which data? These stakeholders are asking access to sufficiently anonymized traffic data. Examples include:
- Pageviews per article by country
- Hourly pageviews at project level by country
How can we make this happen? Solutions envisioned for this problem, as discussed with Analytics (please add/revise as needed):
- Layered reading permissions for the datasets in the data lake, enabling subsets of users to access given datasets. Potentially create a new user group (e.g. 'light-analytics-privatedata') which would have visibility only on the selected aggregated tables. The 'light' collaborators, once they have signed NDA/MOU, they would have to request access to this new user group.
- These 'light' collaborators would have access to the selected datasets via notebook servers, rather than stat machines. This would limit the usage of the resources to the specific datasets, and would allow us to provide with more defined templates and best-practices to use the data in the tables.
- Get dedicated notebook machines for this specific use-case.
- Restrict data visibility within a machine. Allow users to only see the files in their homes, and not in any other users' home. This cannot be done for specific groups of people, it has to be enforced in the whole system, and that is why dedicated machines for this problem look like the best solution.
- How much can we share with non-formal collaborators?
- What are the criteria we need to evaluate when granting data access?
- How much resources we need to 'educate' these new users on how to work with shared resources?
- What should be the duration of the access?
- How will the collaborators use the data? That is, will it be for their own private use or will it be made public?