Page MenuHomePhabricator

Determine the bounds of Matomo's data collection [2hr]
Closed, ResolvedPublicSpike

Description

In T265001 we decided that Matomo was the most appropriate software for us to use on the Library Card platform to track data on how users are navigating the platform. Before implementation we determined that we need to better understand the legal issues around this - how much notification do we need to give users? Is a terms of use change required? What are the limits of the data we can collect?

The WMF legal team got back to us about this and have a few questions:

  • What data would (or could) we be collecting? Just page visits/paths?
  • Could we see which publisher content users are accessing? If so, how granular is that information (publisher, or individual articles?)
  • What options does Matomo have for retention schedules and anonymisation of data?

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptFeb 17 2021, 6:06 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Samwalton9-WMF updated the task description. (Show Details)

I am starting to work on this task. I'll post the answers to the questions posed in the task description.

What data would (or could) we be collecting? Just page visits/paths?

If we use the Javascript Tracking Client, we can collect information about page visits and events, like when someone clicks on a link or button or when someone searches for a keyword on the site. Several functions can be found here. There is a function in Matomo (which is turned off by default) that asks users for consent before tracking them, which we could implement in case some Wikipedia Library users do not want to be tracked on the website. Users can also opt-out of tracking.

When we implement the search feature, we can create an event to track how many users use the search feature.

Could we see which publisher content users are accessing? If so, how granular is that information (publisher, or individual articles?)

Matomo tracks all outlinks as seen here. I don’t think we’ll be able to check what individual articles users are accessing unless the user clicks on the link directly from the Wikipedia Library (when the search feature is built). We cannot use cross-domain tracking because we don’t own any of the domains (publishers) we want to track.

What options does Matomo have for retention schedules and anonymisation of data?

There are a lot of ways to anonymize user data. There is also a guide to configuring privacy settings here. Regarding retention schedules, assuming we have an in-house installation, we can configure the application to auto-archive reports. Matomo automatically truncates the first 1000 rows as stated here. We can either increase the number of rows truncated or remove the data limits so we can have all historical information. The sky’s the limit! (Not really. Our disk space is.) There is also a guide to keeping the database in check here.

@Samwalton9 Let me know if this answers all of the questions fully. I'll move this task to the review column in the meantime.