Page MenuHomePhabricator

Stand up piwik in a permanent and privacy-sensitive way
Closed, ResolvedPublic

Description

This task is being requested by a variety of people with a variety of needs:

  • analytics-dev for dashboard usage tracking, wikimetrics tracking etc.
  • quarry for usage tracking
  • readership team for mobile site instrumentation
  • annual report folks for pageviews over the year
  • wikimedia store folks
  • ad-hoc research needs
  • ...

From IRC:

15:26:07 <milimetric> but in general, it seems like people would like at least some of the functionality and like the idea of reusing simple third party tools with quick integration. I think we have enough use cases to warrant standing up a productionized intstance
15:26:31 <milimetric> and we have enough experience with privacy to make the right tweaks - country level only, etc. Basically, we can turn privacy settings all the way up as others were saying
15:26:59 <•DarTar> milimetric: thanks, makes sense

Event Timeline

Milimetric assigned this task to kevinator.
Milimetric raised the priority of this task from to Needs Triage.
Milimetric updated the task description. (Show Details)
Milimetric added a project: Analytics-Kanban.
Milimetric subscribed.

Note that any non-labs usage should be in prod, but I think we should make this puppetized properly so we can move it to prod if necessary.

https://piwik.org/docs/privacy/ is nice :)

So, privacy steps:

  1. Auto archiving / purging of any data over 3 months
  2. IP info never reaches piwiki (we'll filter it out at the nginx level)
  3. Respect do not track
  4. Control piwik accounts (not a free for all)
  5. Control shell access even more :)

And probably more!

I agree, but I think if we do all that we can give anonymous users view-only rights. It seems PII wouldn't be an issue. And I think anon access would greatly increase the usefulness.

Hmm, we can figure anon access later, I guess. Shell access should be super restricted, though :)

From Michelle and Stephen (legal):

"Generally, we think piwik should be ok if we configure it such that we are not collecting more than what's permissible under the main WMF privacy policy and TOU and the Labs privacy policy and TOU respectively. We should also make sure to configure it such that there is no data collection that is a new practice (e.g. we shouldn't use it to count uniques until the bigger conversation about uniques has been resolved in favor of counting uniques and appropriate use/form)."

Way this should go should be:

  1. Setup a piwik instance by hand, tweak the privacy knobs and what not to see how close to what we want to get it to we can get it to
  2. If 1 goes ok, blow that instance away
  3. Setup a simple HHVM + nginx + mysql setup for a simple horizontaly scalable piwik setup
  4. Setup a simple deploy setup (fab + a git repo?) for doing actual deploys
  5. Have someone subscribe to the piwik list to keep abrest of updates and security issues.
yuvipanda subscribed.

unassigning from myself until I have time to directly work on it.

Stakeholders who want to use this:

  • wikimedia store
  • russian wikimedia chapber
  • the reading team
  • the analytics team (we want this in labs for our purposes, and for other folks who have bots etc.)
  • labs
  • annual report
Milimetric added a subscriber: Fjalapeno.

I added Corey too who's going to represent the Reading team's needs. And an update based on Yuvi's task list:

Way this should go should be:

  1. Setup a piwik instance by hand, tweak the privacy knobs and what not to see how close to what we want to get it to we can get it to

Done, piwik.wmflabs.org. Settings tweaked:

  • Anonymize Visitors' IP addresses: YES
  • masking IP addresses: 2 byte(s) - e.g. 192.168.xxx.xxx (recommended)
  • Also use the Anonymized IP addresses when enriching visits: YES
  • Regularly delete old visitor logs from the database: YES, after 90 days
  • Regularly delete old reports from the database: YES, after 6 months
  • Keep basic metrics (visits, page views, bounce rate, goal conversions, ecommerce conversions, etc.): YES
  • Support Do Not Track: YES
  1. If 1 goes ok, blow that instance away

We are using it now, so let's set up the new instance, migrate to it, then we can blow this away

  1. Setup a simple HHVM + nginx + mysql setup for a simple horizontaly scalable piwik setup
  2. Setup a simple deploy setup (fab + a git repo?) for doing actual deploys

Yuvi, when do you think you'll have time to puppetize all this?

  1. Have someone subscribe to the piwik list to keep abrest of updates and security issues

We would love some support with this from ops, how should we approach the subject?

FYI: piwik provides a handy opt out thing, just paste this into your site: <iframe style="border: 0; height: 200px; width: 600px;" src="http://piwik.wmflabs.org/index.php?module=CoreAdminHome&action=optOut&language=en"></iframe>

Wait, the *reading* team? There's very little chance this will actually end up in production.

Given my current workload, I honestly do not know when I'll have the time to do this :( You can consider this uncookielicking for now - sorry about that :(

@yuvipanda, we were hoping to springboard a productionized puppet module off of a labs puppet module. I know your workload's crazy. After the current NFS craziness dies down, please take a moment and let us know when you think you'd be able to have a puppetized piwik up in labs. We can figure out our timeline from there.

alright. I think a production piwiki instance is a non-starter at our scale and definitely needs much wider discussion and consensus than just a labs one. They way they'll be set up is also going to be fairly different, so I don't know if there's much use in setting it on labs without knowing if it's also going to go to production.

When you say "at our scale" are you thinking english wikipedia sending pageview data to it? The use cases are much smaller than that. Some WMF chapters, the annual report, etc. The biggest user would be the mobile site. It seems to me the prod use case is very similar to the labs use case, and just setting up piwik normally should be fine. And for places where piwik runs into scalability issues, we're working on putting EL on kafka. From what I've read though, piwik will handle much more than EL currently does.

Yes, even the mobile site would be way too big.

Either way, it needs a much bigger discussion with ops as well and I don't think I can be the point person for that.

@yuvipanda - for reading - we are just intending to pilot this for the mobile apps - which is a much smaller user base than mobile web. I don't think we should have scale issues with those projects.

We can work with Ops to handle issues on that end and come up with a deployment plan, but we still need someone to puppet-ize it to move forward. Is that something you have time to do?