Page MenuHomePhabricator

ReadingDepth schema is whitelisting both session ids and page ids
Closed, ResolvedPublic

Description

Hi @HaeB and @phuedx! The whitelist for the ReadingDepth EventLogging schema is keeping permanently fields that contain both unique session ids and page ids. Per our data retention guidelines, we cannot keep those two items together for more than 90 days and therefore one of them should be removed from the whitelist.

Please let me know which one of the following two you'd like to keep:

  • Session IDs => keep sessionToken
  • Page IDs => keep pageTitle

Event Timeline

fdans created this task.
Tbayer removed a subscriber: HaeB.
Tbayer added subscribers: Groceryheist, ovasileva.

Still need to look into this with @ovasileva and possibly @Groceryheist .

A handful of thoughts:

The current schema has page_title, but not page_id. We were able to recover page_id from this using the page_title and the timestamps. Isn't this also a violation of the policy?

I"m not sure that I'm clear on what makes sessionToken PII and not IP address. Would it be OK to replace sessionToken with an ID of the previous page token? We could then perform any analysis that doesn't involve joining on sessionToken.

Could a reasonable option be to generate the statistics we need from the pages, aggregate or add noise to make them non-identifying and then remove the page_id column?

A handful of thoughts:

The current schema has page_title, but not page_id. We were able to recover page_id from this using the page_title and the timestamps. Isn't this also a violation of the policy?

Page title and ID contain largely the same information, so if we whitelist one of them, the other should be fine too (and vice versa - if one of them needs to be purged, the other should too).

I"m not sure that I'm clear on what makes sessionToken PII and not IP address.

IP addresses are PII (actually they are more sensitive than session tokens), and indeed the corresponding field is not contained in the whitelist for this schema.

Would it be OK to replace sessionToken with an ID of the previous page token? We could then perform any analysis that doesn't involve joining on sessionToken.

If you mean the page token of the immediately preceding pageview in the session, that probably wouldn't make a big difference privacy-wise, because the session could still be reconstructed.

Could a reasonable option be to generate the statistics we need from the pages, aggregate or add noise to make them non-identifying and then remove the page_id column?

I think we will want to remove the session IDs instead, as (IIRC) less of our data questions depend on them. But there too we could think about calculating and storing some of the session-dependent data in aggregated form.

For the record: decided with @ovasileva to remove the session IDs and keep the page IDs. I'll see to submit the patch soon.

Change 480472 had a related patch set uploaded (by Fdans; owner: Fdans):
[analytics/refinery@master] Remove session token from whitelist for ReadingDepth schema

https://gerrit.wikimedia.org/r/480472

Change 480472 merged by Nuria:
[analytics/refinery@master] Remove session token from whitelist for ReadingDepth schema

https://gerrit.wikimedia.org/r/480472

Nuria added a subscriber: HaeB.

It looks like we had forgotten to whitelist the actual pageID field in addition to the page title, probably because it was only introduced shortly after this task was created (it's in the current version of the schema page but not yet deployed). I should have caught that before +2ing Nuria's patch. I submitted a fix as part of 209051, also for the related revision ID field.

(Also, the purge policy on the schema talk page was not updated with the outcome of this task - I can take care of that together with T216096.)