Page MenuHomePhabricator

[Metrics Platform] Create Metrics Platform Schema
Open, In Progress, HighPublic

Description

We will need to create a schema fragment in the secondary repository to contain the fields that are decided on as part of T275420. This fragment can live alongside the other fragments and does not need to belong to the analytics common fragment.

Event Timeline

jlinehan triaged this task as Medium priority.
jlinehan moved this task from Inbox to Doing on the Better Use Of Data board.

Change 676392 had a related patch set uploaded (by Jason Linehan; author: Jason Linehan):

[schemas/event/secondary@master] [WIP] Metrics Platform context attribute schema fragment

https://gerrit.wikimedia.org/r/676392

The code in the patch defines a session_id identifier. As of today we are not using such an identifier on webrequest traffic data, only on some events.
Not using a user_id or session_id in webrequest and therefore pageviews, as the latter derives from the former, has been discussed and agreed upon a while back when there has been a demand to add this field to the the webrequest dataset (I can't recall the exact period, but it was at least 3 years from now).
Back to events: with the very large use-case of the new event-type, I have the feeling that we will move away from webrequest being the source of traffic data for metrics. While I think this move is great, I also would like the shift in privacy setting to be thoroughly acknowledged and broadly discussed.

I second @JAllemandou!

One related question: Are all fields specified in the schema going to be collected by default?
I recall that the collected fields would be specified in the extension's configuration? Is that correct?
Will they be enabled in groups (i.e. add all user fields), or individually?

I mention this, because the schema contains a lot of privacy-sensitive fields:
pageview_id, session_id, and app_install_id, page.id, page.title, page.wikidata_id, page.revision_id, user.id, user.name, user.groups, user.edit_count, user.registration_dt.
I would argue in favor of not collecting any of those by default, even if we can delete them later, following the privacy-by-design principle.

I second @JAllemandou!

One related question: Are all fields specified in the schema going to be collected by default?
I recall that the collected fields would be specified in the extension's configuration? Is that correct?
Will they be enabled in groups (i.e. add all user fields), or individually?

I mention this, because the schema contains a lot of privacy-sensitive fields:
pageview_id, session_id, and app_install_id, page.id, page.title, page.wikidata_id, page.revision_id, user.id, user.name, user.groups, user.edit_count, user.registration_dt.
I would argue in favor of not collecting any of those by default, even if we can delete them later, following the privacy-by-design principle.

Yes the system has been very deliberately set up to support a privacy-by-design process. None of these fields are collected by default. The purpose of defining them is so we can likewise define in advance the code that provides their values, and have a set menu of options that we have studied and understand from a privacy perspective, allowing a more rigorous approach using one of the many concepts of "privacy budget" out there.

I agree with both of you it is a *great* idea to set up a privacy-focused discussion, I'll make sure we do that and bother both of you to take part!!

jlinehan renamed this task from [Metrics Platform] Create schema fragment for standard fields to [Metrics Platform] Create Metrics Platform Schema.Aug 11 2021, 2:57 PM
jlinehan raised the priority of this task from Medium to High.Wed, Oct 6, 2:01 PM
jlinehan moved this task from Blocked to Work in Progress on the Metrics-Platform board.
DAbad changed the task status from Open to In Progress.Wed, Oct 6, 2:02 PM

For Jason to take a look and make sure the schema matches what is in the code Michael wrote. Then we can close this ticket.