Page MenuHomePhabricator

Identify pending analyses needing access to data older than 90 days
Closed, ResolvedPublic

Description

Members of the Product Analytics team have been using the sanitized PrefUpdate data for analysis, and might have upcoming analyses that use it. This task is for identifying what those analyses are so we can plan accordingly, or find workarounds.

  • Growth: Newcomer Tasks analysis (T230174) , will use data from 2019-11-21 onwards to exclude users who self-selected in/out of the experiment groups. The data gathering for that analysis is completed.
  • Editing: DiscussionTools analysis T249386 and T247139 will use PrefUpdate data starting from 2020-03-31 onwards to determine the adoption and use of the reply tool. If possible, we'd like to retain the discussiontools-betaenable and betafeatures-auto-enroll properties for 180 days.

Event Timeline

LGoto triaged this task as Medium priority.Apr 27 2020, 4:52 PM
LGoto edited projects, added Product-Analytics (Kanban); removed Product-Analytics.
LGoto moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

ping on this, while there is no deadline i was hoping to resolve this issue before the end of quarter if possible

Just been informed that DiscussionTools data will be retained for more than 90 days, which could imply that we will need "discussiontools-betaenable" property to be retained in the prefupdate schema for more than 90 days (however, this is dependent on when parent task T249894 will be implemented ) . I will update this task as soon as we decide on the data retention changes with Editing team.

Just been informed that DiscussionTools data will be retained for more than 90 days, which could imply that we will need "discussiontools-betaenable" property to be retained in the prefupdate schema for more than 90 days […]

There are no retention rules on the sanitized data in Hadoop/Hive/Superset/Turnilo. (The parent task will not change that, either.) So you should be all good to go there. This data is essentially available indefinitely. There might be an upperbound I'm not aware of, but it's at least several years.

There is however a hard and very real limit on how long personally identifiable individual pieces of information may be stored by the Foundation, namely 90 days. This is coded in our Privacy policy.

Questions like "How many times was the DT Beta feature was enabled or disabled on any given day/wiki?" are built-in to the "sanitized" event data for PrefUpdate, and can thus be queried any time for any date range, without any additional work.

More specific questions like number of unique users toggling the feature, or the highest number of times any user toggled the feature on the same day, are not currently answered by the sanitized data for PrefUpdate.

To answer that long-term, such queries can be run against the regular PrefUpdate table in Hive/Hadoop, and thus be limited to the last 90 days. These queries can be run regularly however, and the result of them can most likely be stored indefinitely as well (assuming they don't contain invidual names/IDs anymore). If this is needed on an on-going basis, you'll want to set up a regular cron or refinery job for this, so that the information is available at any time without worrying about it.

Since it'll take a little while longer to complete the data analysis in T230174, I updated the task description to reflect that the data gathering is complete. So as far as that is concerned, access to older data is no longer needed.

Closing this as resolved, as analyses are either done or we have made steps to keep data for the preferences we need.