Page MenuHomePhabricator

Set up Matomo to track page views
Closed, ResolvedPublic

Description

Description
We're working on redesigns of the library's core workflows and want to understand what effect these are having on user behaviour. To do this we want to install some software to collect better data on how users use the library, and enable us to track changes over time.

In T265001, we decided that Matomo was the best choice software for this purpose. The WMF Legal team confirmed that our existing privacy policy should cover our use cases without modification.

We should install an instance of Matomo, and configure it to collect the data we need. We want to answer the following questions. Any data collection options which don't help us answer these questions should be turned off, anonymised, or aggregated to preserve user privacy:

  • How many users visit the library on a daily basis?
  • What percentage of users who arrive on the homepage (logged out) log in?
  • What percentage of logged in users click to access a collection or initiate a search?

Event Timeline

  • server instance at: matomo.twl.eqiad1.wikimedia.cloud
  • web interface at: https://analytics-wikipedialibrary.wmflabs.org/
  • administrative credentials in 1password
  • nfs mount added and verified working

configured smtp for sending emails

  • force_ssl enabled as recommended

Thanks for using Matomo!

We should install an instance of Matomo

Out of curiosity, is there any reason you couldn't use the Analytics-provided instance? Your traffic seems negligible compared to the microsites there, and I suppose you both are in the wmf or nda group anyway.

I'm not questioning the decision (especially as most of the work seems to be done already), I'm just interested in your experience. Ideally we can share it with others on some wiki page, ideally https://wikitech.wikimedia.org/wiki/Analytics/Systems/Matomo itself.

Out of curiosity, is there any reason you couldn't use the Analytics-provided instance?

See T265001#6747235 and the following comment - our current traffic would make the existing tool suitable, but we expect a substantial increase when we launch T132084 which would likely exceed that tool's usage guidelines.

We met to chat through the configuration today. Notes below:

  • Unique daily visitors is very straightforward to configure
  • Logging in can be done with a Goal for 'when visitors... visit [the login URL]'
  • For accessing a collection, we decided that, due to the level of anonymity we're going for here (i.e. not tracking user paths through the site), we can stick to using the EZProxy logs we already have access to. No need to configure anything here in Matomo.
  • Referrer URL: We'd like to know where folks come from, in an aggregated way. In particular we want to know if they come from the notification.
    • Follow-up question: Do we need to do any additional work on the notification implementation to facilitate this distinction @jsn.sherman? i.e. versus users clicking from some other page on Wikipedia?
  • We're turning off all other tracking, anonymising user agents, IPs, etc.
  • Instead of using Javascript tracking we will read the nginx logs instead. We're not doing anything that would warrant JS tracking, just reading page hits to certain URLs. Jason will explore whether this means moving Matomo to a TWLight container or having TWLight dump the nginx logs to NFS for Matomo to read.

Follow-up question: Do we need to do any additional work on the notification implementation to facilitate this distinction @jsn.sherman? i.e. versus users clicking from some other page on Wikipedia?

Since the notification can be clicked from any page, we'd be looking at the combination of a referrer with a valid project fqdn (eg. en.wikipedia.org) and access path that includes the the signature query parameters (eg. markasread and/or markasreadwiki)

Just an update here: I've been reading up on the matomo docs to learn the best way to measure the answers to our questions

How many users visit the library on a daily basis?

Because matomo only has the request path and the user agent (no ip address, user id, session id, etc), it's treating slightly spaced out requests from the same visitor as coming from unique visitors:


Since I've turned on tracking yesterday, I think I'm represented by about 10 "visitors" in that period. We'll need to look into how best to measure unique visitor counts while preserving privacy.

What percentage of users who arrive on the homepage (logged out) log in?

I now have those login redirects showing up in actions in matomo, and I created a conversion goal for user who hit the login link from the front page. Right now, matomo can't tell the difference between a logged-in user on the homepage and an anonymous user on the hompage, so we'll need to capture at least logged-in/anonymous status in the nginx log to get the divisor for that percentage.

What percentage of logged in users click to access a collection or initiate a search?

As we discussed earlier this week, If we want to track outbound activity, we either need to add JavaScript tracking and a fire an event onClick, or we'd need to add an interstitial redirect on our site to log them on their way out the door.

I was able to capture ezproxy logins fairly easily, though those sessions last for a very long time and can cover research across multiple resources

all discussion related to individual analytics questions have been moved to subtasks