Maniphest T209051

ReadingDepth schema is whitelisting both session ids and page ids
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• fdans
	Nov 8 2018, 12:50 PM

Description

Hi @HaeB and @phuedx! The whitelist for the ReadingDepth EventLogging schema is keeping permanently fields that contain both unique session ids and page ids. Per our data retention guidelines, we cannot keep those two items together for more than 90 days and therefore one of them should be removed from the whitelist.

Please let me know which one of the following two you'd like to keep:

Session IDs => keep sessionToken
Page IDs => keep pageTitle

Details

	Subject	Repo	Branch	Lines +/-
	Remove session token from whitelist for ReadingDepth schema	analytics/refinery	master	+0 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• fdans	T205458 Remove sessionId, pageId pairs from whitelist
		Resolved		• Tbayer	T209051 ReadingDepth schema is whitelisting both session ids and page ids

Event Timeline

• fdans triaged this task as High priority.Nov 8 2018, 12:50 PM

• fdans created this task.

• fdans removed a project: Analytics-Kanban.Nov 8 2018, 12:56 PM

• fdans removed • fdans as the assignee of this task.Nov 12 2018, 5:35 PM

• fdans moved this task from Incoming to Radar on the Analytics board.Nov 12 2018, 5:38 PM

Still need to look into this with @ovasileva and possibly @Groceryheist .

A handful of thoughts:

The current schema has page_title, but not page_id. We were able to recover page_id from this using the page_title and the timestamps. Isn't this also a violation of the policy?

I"m not sure that I'm clear on what makes sessionToken PII and not IP address. Would it be OK to replace sessionToken with an ID of the previous page token? We could then perform any analysis that doesn't involve joining on sessionToken.

Could a reasonable option be to generate the statistics we need from the pages, aggregate or add noise to make them non-identifying and then remove the page_id column?

In T209051#4760668, @Groceryheist wrote:

A handful of thoughts:

The current schema has page_title, but not page_id. We were able to recover page_id from this using the page_title and the timestamps. Isn't this also a violation of the policy?

Page title and ID contain largely the same information, so if we whitelist one of them, the other should be fine too (and vice versa - if one of them needs to be purged, the other should too).

I"m not sure that I'm clear on what makes sessionToken PII and not IP address.

IP addresses are PII (actually they are more sensitive than session tokens), and indeed the corresponding field is not contained in the whitelist for this schema.

Would it be OK to replace sessionToken with an ID of the previous page token? We could then perform any analysis that doesn't involve joining on sessionToken.

If you mean the page token of the immediately preceding pageview in the session, that probably wouldn't make a big difference privacy-wise, because the session could still be reconstructed.

Could a reasonable option be to generate the statistics we need from the pages, aggregate or add noise to make them non-identifying and then remove the page_id column?

I think we will want to remove the session IDs instead, as (IIRC) less of our data questions depend on them. But there too we could think about calculating and storing some of the session-dependent data in aggregated form.

For the record: decided with @ovasileva to remove the session IDs and keep the page IDs. I'll see to submit the patch soon.

Ping on this @Tbayer

Change 480472 had a related patch set uploaded (by Fdans; owner: Fdans):
[analytics/refinery@master] Remove session token from whitelist for ReadingDepth schema

https://gerrit.wikimedia.org/r/480472

gerritbot added a project: Patch-For-Review.Dec 18 2018, 12:12 PM

Change 480472 merged by Nuria:
[analytics/refinery@master] Remove session token from whitelist for ReadingDepth schema

https://gerrit.wikimedia.org/r/480472

• Nuria closed this task as Resolved.Jan 8 2019, 8:09 AM

• Nuria added a subscriber: HaeB.

• Tbayer removed a subscriber: HaeB.Jan 8 2019, 1:17 PM

• Tbayer mentioned this in T216096: Whitelist sample flags and page/rev ID fields for ReadingDepth schema.Feb 14 2019, 12:52 AM

It looks like we had forgotten to whitelist the actual pageID field in addition to the page title, probably because it was only introduced shortly after this task was created (it's in the current version of the schema page but not yet deployed). I should have caught that before +2ing Nuria's patch. I submitted a fix as part of 209051, also for the related revision ID field.

(Also, the purge policy on the schema talk page was not updated with the outcome of this task - I can take care of that together with T216096.)

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

Maintenance_bot removed a project: Patch-For-Review.Jun 10 2020, 7:12 AM

ReadingDepth schema is whitelisting both session ids and page idsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

ReadingDepth schema is whitelisting both session ids and page ids
Closed, ResolvedPublic
Actions

Related Objects
Search...