Page MenuHomePhabricator

Enable layered data-access and sharing for a new form of collaboration
Open, MediumPublic

Description

Problem: A number of stakeholders are interested in accessing heavily anonymized traffic data, and we would like to enable such access to a selected number of people.

Who is interested in this data: Stakeholders interested in this data (potential "light collaborators") include:

  • 3rd party organizations whose mission is aligned with movement and WMF strategy
  • Researchers who are working on highly relevant projects, which are not included in our annual plan, but whose research direction is highly aligned with movement strategy and WMF mission
  • Community members and point of contact at local chapters interested in understanding the audiences in the regions they operate it.

Which data? These stakeholders are asking access to sufficiently anonymized traffic data. Examples include:

  • Pageviews per article by country
  • Hourly pageviews at project level by country

How can we make this happen? Solutions envisioned for this problem, as discussed with Analytics (please add/revise as needed):

  • Layered reading permissions for the datasets in the data lake, enabling subsets of users to access given datasets. Potentially create a new user group (e.g. 'light-analytics-privatedata') which would have visibility only on the selected aggregated tables. The 'light' collaborators, once they have signed NDA/MOU, they would have to request access to this new user group.
  • These 'light' collaborators would have access to the selected datasets via notebook servers, rather than stat machines. This would limit the usage of the resources to the specific datasets, and would allow us to provide with more defined templates and best-practices to use the data in the tables.
  • Get dedicated notebook machines for this specific use-case.
  • Restrict data visibility within a machine. Allow users to only see the files in their homes, and not in any other users' home. This cannot be done for specific groups of people, it has to be enforced in the whole system, and that is why dedicated machines for this problem look like the best solution.

Open questions:

  • How much can we share with non-formal collaborators?
  • What are the criteria we need to evaluate when granting data access?
  • How much resources we need to 'educate' these new users on how to work with shared resources?
  • What should be the duration of the access?
  • How will the collaborators use the data? That is, will it be for their own private use or will it be made public?

Event Timeline

Miriam renamed this task from Enable layered data-access and sharing for new collaborators to Enable layered data-access and sharing for a new form of collaboration.Feb 21 2020, 2:57 PM
Miriam updated the task description. (Show Details)
ssingh updated the task description. (Show Details)

Hi @Miriam / @ssingh, could you please associate at least one active project with this task (via the Add Action...Change Project Tags dropdown)? Is this about Research, or maybe something else? :) This will allow others to get notified and see this task when looking at the corresponding project workboard. Thanks!

Hi @Aklapper sure! We are finalizing the task description and then will add the appropriate tags :)

Leaving a note since a lot of background work is happening, we didn't forget about this :)

The Analytics team is currently working on T243934 and T246578, to reach the following goals:

  • unification of stat/notebook functionalities so that all client nodes will have the same configuration.
  • reduction of the POSIX groups, and enforcing some sane security defaults (like read restrictions on home directories with potential PII data).

The above points will be the baseline to work on a solution for this use case. In the meantime, we could work on the following:

  1. definition of datasets that are currently needed (starting from the examples that Miriam provided) - some work is needed to create all the automation to push this data to hadoop, independently from any other access/security solution.
  2. come up with one/two examples of access requests, together with background and needs. In this way we'll have a better idea about who will require access and what tools will be needed (notebooks? scripts? Superset? etc..).

Once we'll have more details we'll be able to sort out a technical solution more easily. I don't want to rush anybody, just make the point that the "access" part of this task is currently a blocker, but it is not the only technical work needed :)

Milimetric triaged this task as Medium priority.Mar 2 2020, 5:05 PM
Milimetric moved this task from Incoming to Data Exploration Tools on the Analytics board.
Milimetric subscribed.

We should have a meeting about this towards the end of this quarter / beginning of next. Food for thought until then:

highly anonymized pageviews per article per country data is almost a contradiction as anonymizing that data more or less deletes most of it (long tail / high cardinality). Let's talk about the approach and figure out how to maybe apply differential privacy in the long term. Bringing other use cases to the meeting would be useful.

The internal use cases would be nice to support, and I think we can discuss that separately from how much we trust anonymizing approaches and partial sharing.

@Milimetric that is a good point.

@Miriam I suggest replacing "highly anonymized" in the task description with "sufficiently or somewhat anonymized".

@Miriam @elukey the layered permission system can have internal use-cases, too. As I mentioned during our meeting in all-hands, I'd like to advocate for a case that we do not give blanket access to raw webrequest logs to /everyone/ who is staff (and signed an NDA) that needs to access some portion of webrequest logs or data based on webrequest logs. Layering will allow us to increase access gradually and as needed based on the analyses needs for the specific projects. (How many layers we can introduce in practice given the resource constraints we have is something to discuss, maybe there will be 2 layers only, maybe more.)

What is the number of users this potential system would serve? 10/100?

@Nuria can you help me understand in what sense the answer to this question is important? Is it about RAM and Storage or some other factors?

This ask, in terms of infrastructure is a significant one and we would like to e how many users are benefiting from it. For example, thinking about the use case of "pageviews per article per country data" I think our efforts are perhaps better used in providing a say, differentially private dataset that the whole world can use than rolling out a significant infra piece for a very select group of users. This is true in this particular case, now, there were other use cases (and some of them from Traffic ) where tradeoffs were not so apparent.

This ask, in terms of infrastructure is a significant one and we would like to e how many users are benefiting from it.

Am I correct in that you want to have a better sense of the impact of the work? I'm asking because the number of people who will have access will not necessarily be a good indicator of the impact of the work. (See below for more.)

For example, thinking about the use case of "pageviews per article per country data" I think our efforts are perhaps better used in providing a say, differentially private dataset that the whole world can use than rolling out a significant infra piece for a very select group of users. This is true in this particular case, now, there were other use cases (and some of them from Traffic ) where tradeoffs were not so apparent.

Correct. And I'd say the most critical use-cases are the ones we discussed during all-hands, including the use-cases by the Traffic team. To summarize that discussion:

  • There are 3rd party organizations that we use their data for highly critical mission focused work that we do. We can/should consider supporting these organizations back as their strength strengthens the ecosystem we operate in.
  • There are problems that are strategic for the world and not necessarily for WMF. We want to have a way to be able to accommodate research for these highly strategic problems. I'm talking at the level of COVID-19 here. In situations as the one the world is in right now with COVID-19, WMF gets approached because the data we have may help detect the spread of the disease more effectively and as a result help curb the further spread. We need a way to provide access for this kind of request in a way that is respectful of the readers' (perceived) privacy.

@Miriam @elukey the layered permission system can have internal use-cases, too. As I mentioned during our meeting in all-hands, I'd like to advocate for a case that we do not give blanket access to raw webrequest logs to /everyone/ who is staff (and signed an NDA) that needs to access some portion of webrequest logs or data based on webrequest logs. Layering will allow us to increase access gradually and as needed based on the analyses needs for the specific projects. (How many layers we can introduce in practice given the resource constraints we have is something to discuss, maybe there will be 2 layers only, maybe more.)

I completely agree, I opened T246755 to investigate a technical option (to avoid spamming this task).

I had a chat with Miriam about this:

  • The pageview granularity request from Sukhbir should be handled as separate task/project. Miriam will follow up.
  • I explained some ideas that the Analytics team has to experiment with a different access control policy for datasets (details in T246755)
  • We should focus on few specific use cases first, and see if we can generalize further. Miriam is going to add to this tasks 2/3 specific requests that came to the Research team.
  • The users of the above use cases will likely need access to stat100x and to the cluster. The main difference with the researchers currently collaborating with them (with access to our production) is that the scope of the project is narrow and specific, possibly limited in time. On the contrary, longer and more complex research collaborations start from a more unknown set of premises and hence need more time and a broader range of datasets/resources.
  • The users that we are discussing should need, most of the times, to pull pageview-related data and correlate it with some of theirs. This doesn't mean copying data to other systems but having a way (notebooks for example) to correlate different datasets with pageviews.
  • I explained how much impact new users can have to the Analytics team's resources, especially related to training (use the Hadoop cluster responsibly, use of its tools like Spark/Notebooks, datasets, etc..). The Research team is also constrained in resources so we should really think about how to handle this use cases properly. Their narrow focus could allow us to build good docs that can do a lot of work for us, needs to be investigated.

THanks @elukey for this summary.

There are two macro use-cases for the release/simplified access to article pageview by country data.

  1. Communities. Communities can use the article pageview by country data to evaluate the impact of work by volunteer editors, organizers, as well as affiliates. Examples of projects associated to the requests for this data: targeted campaigns run by Wikimedia chapters to increase awareness about specific topics; targeted editathons to increase content and engagement on specific topics; building targeted partnerships that involve donations of sources or other data to be used in the Wikimedia projects.
  2. Researchers. Researchers can use this data to measure localized collective attention on specific topics. Relevant research projects associated to the requests for this data include:
    • analysis of outbreaks (Zika, COVID-10, Influenza) - here pageview data is useful to understand the role of Wikipedia in times of crisis, and monitor information diffusion (useful for disinformation studies);
    • computational social science projects aiming at studying information consumption in the context of civic engagement studies, disinformation campaigns and fake news diffusion, and knowledge gaps . For most of these research works, getting the pageviews for top-N countries per article is enough, or viceversa the top-N articles by country. The tail is less important.

We have to think more about how to accomplish this, taking into account all the security implications we've discovered on the first pass.

My opinion on this request is that having non throughly supervised contributors accessing data introduces too big of a risk of a data leak, I think we should strive to make this data available publicly with a differentially private strategy.

I am interesting in make tools with those data, but I am not familiar with the analytics infrastructure, I am more familiar with the Toolforge infrastructure, so in my idea those data could be in a MariaDB database in the tools.db.svc.wikimedia.cloud server, and the data could work like the wikis replicas, the complete data is restricted and the "safe" data is available in a public database (that ends with _p) through a view that hides the sensible data. Or, to be simpler, create only a public database with just the safe data.

It would also be good if the pageviews that is currently only available via dumps and AQS could also be available in that way, SQL databases are much faster and can do many things that dumps and AQS can't. In some of my tools (example) I need a ptwiki articles pageviews table, so I had to create a script that once a day reads the pageviews dumps, compare the pages names with the articles names to get the page id and store the pageviews data in the tool database. If there was a database with those data maintained by the analytics team it would be simpler to develop that kind of tool.

Gehel subscribed.

Removing DPE SRE until there is a clear direction and we are needed for the implementation.