Page MenuHomePhabricator

Welcome survey: anonymize data after one year
Open, MediumPublicWed, Dec 18

Description

In the process of assembling our privacy statement for this work, we decided to anonymize the data collected after one year. This is a timeline that should be long enough for any programs related to being a new editor to unfold, and also makes sure we don't keep survey data indefinitely. If our feature evolves to be more like "building a user profile", or a situation where users can change or erase their responses themselves, we can revisit this.

Here are the rules we want to implement:

  • First, a year after the welcome survey responses are given, they are archived in aggregate in a monthly summary table by @nettrom_WMF that does not keep user IDs or exact timestamps. This will facilitate longitudinal analysis on how newcomer goals shift over time.
  • Then, having been archived in aggregate, the responses are deleted from the database tables. At this point, any features that rely on welcome survey responses will be (hopefully, automatically) presented to the user with the default content that a user gets if they have not answered the welcome survey.
    • Implementation: old welcomesurvey-responses user option rows are deleted. It is up to whatever code needs welcome survey data to handle the case when such data does not exist for a given user. The deletion is done by a bi-monthly cron job running on the 1st and 15th, deleting data older than 11 months.

The maintenance script is merged and tested on cswiki and kowiki. The cronjob patches still need to be merged.

Details

Due Date
Wed, Dec 18, 8:00 AM
Related Gerrit Patches:
operations/mediawiki-config : masterAdd growthexperiments dblist, for puppet usage
operations/puppet : productionmediawiki: maintenance script for purging old GrowthExperiments data
mediawiki/extensions/GrowthExperiments : masterMaintenance script for deleting old welcome survey data

Event Timeline

MMiller_WMF moved this task from Inbox to Upcoming Work on the Growth-Team board.Apr 23 2019, 8:13 PM

As I was considering closing the Welcome Survey epic, I saw and remembered that this is a task we need to complete before November 2019. I am putting it into Upcoming Work so we can decide when to do it.

MMiller_WMF set Due Date to Nov 1 2019, 7:00 AM.Apr 23 2019, 8:14 PM
Restricted Application changed the subtype of this task from "Task" to "Deadline". · View Herald TranscriptApr 23 2019, 8:14 PM

We will work on this task ahead of November.

MMiller_WMF edited projects, added Growth-Team; removed Growth-Team (Current Sprint).
MMiller_WMF moved this task from Q1 2019-20 to Upcoming Work on the Growth-Team board.

It's time for this to be in Upcoming Work, because we need to finish it in the next four weeks.

MMiller_WMF renamed this task from Personalized first day: anonymize data after one year to Welcome survey: anonymize data after one year.Oct 15 2019, 5:27 PM
MMiller_WMF updated the task description. (Show Details)

This is ready for development. First, @nettrom_WMF needs to store aggregates of the data from the month of November 2018, but maybe we might as well store it all the way through September 2019. Maybe the right way to do this is to store aggregates as soon as a month is complete? Instead of waiting until it is 12 months old?

Then the Growth engineers to pursue deleting the November 2018, and figure out how to delete data in an ongoing manner.

I've created a subtask for the product analytics work here: T235548: Welcome survey: store aggregates

This original task will be used for the engineering work.

LGoto triaged this task as Medium priority.Oct 21 2019, 5:15 PM
LGoto moved this task from Triage to Tracking on the Product-Analytics board.
Tgr claimed this task.Oct 28 2019, 8:40 AM
Tgr updated the task description. (Show Details)Oct 28 2019, 8:17 PM
Tgr updated the task description. (Show Details)

Change 546879 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Maintenance script for deleting old welcome survey data

https://gerrit.wikimedia.org/r/546879

Change 546894 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Add growthexperiments dblist

https://gerrit.wikimedia.org/r/546894

Change 546896 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/puppet@production] mediawiki: maintenance script for purging old GrowthExperiments data

https://gerrit.wikimedia.org/r/546896

Note: @nettrom_WMF and I decided that the work to store aggregates in T235548: Welcome survey: store aggregates does not block deploying this deletion script.

Catrope moved this task from QA to Code Review on the Growth-Team (Current Sprint) board.

Change 546879 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Maintenance script for deleting old welcome survey data

https://gerrit.wikimedia.org/r/546879

Tgr moved this task from Code Review to In Progress on the Growth-Team (Current Sprint) board.EditedNov 2 2019, 4:25 PM

Since the script itself is merged we should run it manually now:
(todo list moved to task description)

Tgr added a comment.Nov 6 2019, 1:49 AM
tgr@mwmaint1002:~$ mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php testwiki --cutoff 350 --verbose --dry-run
Deleting data before 20181121014408 (over 350 days old) (dry run)
  Skipping user:27425, past-cutoff survey submit date 20190607131714
  Skipping user:29950, past-cutoff survey submit date 20190724014701
  Skipping user:30833, past-cutoff survey submit date 20190918212454
  Skipping user:39901, past-cutoff survey submit date 20190607172958
  Skipping user:40269, past-cutoff survey submit date 20190709193716
  Deleting survey data for user:41446
  Deleting survey data for user:41447
  Deleting survey data for user:41448
  Deleting survey data for user:41449
  Deleting survey data for user:41450
  Deleting survey data for user:41462
  Stopping at user:41467 which has past-cutoff registration date 20181121105317
Processed users up to ID 41467
Deleted: 6, skipped: 5
27425 -> Etonkovidova
29950 -> Zilant18
30833 -> Zilant1
39901 -> MMiller (WMF)
40269 -> KHarlan (WMF)
41446 -> Zilant22
41446 -> Zilant22
41447 -> Roantest44
41448 -> Zilant23
41449 -> Jindrat
41450 -> Jindrad
41462 -> Nov-20-2018-test-account-to-suppress

That seems reasonable.

JTannerWMF changed Due Date from Nov 1 2019, 7:00 AM to Tue, Nov 19, 8:00 AM.Tue, Nov 12, 6:27 PM
Tgr added a comment.Tue, Nov 19, 3:03 AM

Dry run:

1$ for wiki in arwiki cswiki euwiki kowiki testwiki viwiki; do echo -e "== $wiki ==\n"; mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php $wiki --cutoff 350 --dry-run; echo -e "\n\n"; done
2
3== arwiki ==
4
5Deleting data before 20181204025146 (over 350 days old) (dry run)
6Processed users up to ID 1612656
7Deleted: 0, skipped: 0
8
9
10
11== cswiki ==
12
13Deleting data before 20181204025147 (over 350 days old) (dry run)
14Processed users up to ID 435266
15Deleted: 750, skipped: 8
16
17
18
19== euwiki ==
20
21Deleting data before 20181204025148 (over 350 days old) (dry run)
22Processed users up to ID 104292
23Deleted: 0, skipped: 0
24
25
26
27== kowiki ==
28
29Deleting data before 20181204025148 (over 350 days old) (dry run)
30Processed users up to ID 541632
31Deleted: 901, skipped: 2
32
33
34
35== testwiki ==
36
37Deleting data before 20181204025150 (over 350 days old) (dry run)
38Processed users up to ID 41572
39Deleted: 4, skipped: 5
40
41
42
43== viwiki ==
44
45Deleting data before 20181204025150 (over 350 days old) (dry run)
46Processed users up to ID 645792
47Deleted: 0, skipped: 1

Per T207290 kowiki and cswiki were the first to use WelcomeSurvey, so this matches expectations (cswiki had ~1500 new users last November, and kowiki had ~2500) assuming the surveys did not have a control group initially, or being in the control group resulted in a user option record as well.

Tgr updated the task description. (Show Details)Tue, Nov 19, 3:04 AM

Mentioned in SAL (#wikimedia-operations) [2019-11-19T03:51:01Z] <tgr> T208369 ran mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php cswiki --cutoff 350

Tgr added a comment.Tue, Nov 19, 3:52 AM

I will wait a day just in case something went wrong, then repeat with kowiki.

Tgr updated the task description. (Show Details)Tue, Nov 19, 3:52 AM

Mentioned in SAL (#wikimedia-operations) [2019-11-20T03:16:02Z] <tgr> T208369 ran mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php kowiki --cutoff 350

Tgr changed Due Date from Tue, Nov 19, 8:00 AM to Tue, Dec 3, 8:00 AM.Wed, Nov 20, 3:17 AM
Tgr moved this task from In Progress to Code Review on the Growth-Team (Current Sprint) board.
Tgr updated the task description. (Show Details)
Tgr updated the task description. (Show Details)
Restricted Application added a subscriber: revi. · View Herald TranscriptWed, Nov 20, 3:18 AM

Mentioned in SAL (#wikimedia-operations) [2019-12-04T22:21:24Z] <RoanKattouw> T208369 ran mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php cswiki --cutoff 350

Catrope changed Due Date from Tue, Dec 3, 8:00 AM to Wed, Dec 18, 8:00 AM.Wed, Dec 4, 10:23 PM

Mentioned in SAL (#wikimedia-operations) [2019-12-04T22:24:37Z] <RoanKattouw> T208369 ran mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php kowiki --cutoff 350

Change 546894 merged by jenkins-bot:
[operations/mediawiki-config@master] Add growthexperiments dblist, for puppet usage

https://gerrit.wikimedia.org/r/546894

Mentioned in SAL (#wikimedia-operations) [2019-12-11T00:35:00Z] <tgr@deploy1001> Synchronized dblists/growthexperiments.dblist: SWAT: [[gerrit:546894|Add growthexperiments dblist, for puppet usage (T208369)]] (duration: 01m 02s)

Mentioned in SAL (#wikimedia-operations) [2019-12-11T00:37:45Z] <tgr@deploy1001> Synchronized wmf-config/config: SWAT: [[gerrit:546894|Add growthexperiments dblist, for puppet usage (T208369)]] (duration: 01m 01s)

Mentioned in SAL (#wikimedia-operations) [2019-12-11T00:39:14Z] <tgr@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:546894|Add growthexperiments dblist, for puppet usage (T208369)]] (duration: 01m 00s)