Page MenuHomePhabricator

Welcome survey: anonymize data after one year
Open, MediumPublic

Description

In the process of assembling our privacy statement for this work, we decided to anonymize the data collected after one year. This is a timeline that should be long enough for any programs related to being a new editor to unfold, and also makes sure we don't keep survey data indefinitely. If our feature evolves to be more like "building a user profile", or a situation where users can change or erase their responses themselves, we can revisit this.

Here are the rules we want to implement:

  • First, a year after the welcome survey responses are given, they are archived in aggregate in a monthly summary table by @nettrom_WMF that does not keep user IDs or exact timestamps. This will facilitate longitudinal analysis on how newcomer goals shift over time.
  • Then, having been archived in aggregate, the responses are deleted from the database tables. At this point, any features that rely on welcome survey responses will be (hopefully, automatically) presented to the user with the default content that a user gets if they have not answered the welcome survey.
    • Implementation: old welcomesurvey-responses user option rows are deleted. It is up to whatever code needs welcome survey data to handle the case when such data does not exist for a given user. The deletion is done by a bi-monthly cron job running on the 1st and 15th, deleting data older than 11 months.

The maintenance script is merged and tested on cswiki and kowiki. The cronjob patches still need to be merged.

Details

Related Gerrit Patches:
operations/puppet : productionmediawiki: maintenance script for purging old GrowthExperiments data
mediawiki/extensions/GrowthExperiments : masterDeleteOldSurveys: sanity-check cutoff parameter
operations/mediawiki-config : masterAdd growthexperiments dblist, for puppet usage
mediawiki/extensions/GrowthExperiments : masterMaintenance script for deleting old welcome survey data

Event Timeline

MMiller_WMF moved this task from Inbox to Upcoming Work on the Growth-Team board.Apr 23 2019, 8:13 PM

As I was considering closing the Welcome Survey epic, I saw and remembered that this is a task we need to complete before November 2019. I am putting it into Upcoming Work so we can decide when to do it.

MMiller_WMF set Due Date to Nov 1 2019, 7:00 AM.Apr 23 2019, 8:14 PM
Restricted Application changed the subtype of this task from "Task" to "Deadline". · View Herald TranscriptApr 23 2019, 8:14 PM

We will work on this task ahead of November.

MMiller_WMF edited projects, added Growth-Team; removed Growth-Team (Current Sprint).
MMiller_WMF moved this task from Q1 2019-20 to Upcoming Work on the Growth-Team board.

It's time for this to be in Upcoming Work, because we need to finish it in the next four weeks.

MMiller_WMF renamed this task from Personalized first day: anonymize data after one year to Welcome survey: anonymize data after one year.Oct 15 2019, 5:27 PM
MMiller_WMF updated the task description. (Show Details)

This is ready for development. First, @nettrom_WMF needs to store aggregates of the data from the month of November 2018, but maybe we might as well store it all the way through September 2019. Maybe the right way to do this is to store aggregates as soon as a month is complete? Instead of waiting until it is 12 months old?

Then the Growth engineers to pursue deleting the November 2018, and figure out how to delete data in an ongoing manner.

I've created a subtask for the product analytics work here: T235548: Welcome survey: store aggregates

This original task will be used for the engineering work.

LGoto triaged this task as Medium priority.Oct 21 2019, 5:15 PM
LGoto moved this task from Triage to Tracking on the Product-Analytics board.
Tgr claimed this task.Oct 28 2019, 8:40 AM
Tgr updated the task description. (Show Details)Oct 28 2019, 8:17 PM
Tgr updated the task description. (Show Details)

Change 546879 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Maintenance script for deleting old welcome survey data

https://gerrit.wikimedia.org/r/546879

Change 546894 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Add growthexperiments dblist

https://gerrit.wikimedia.org/r/546894

Change 546896 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/puppet@production] mediawiki: maintenance script for purging old GrowthExperiments data

https://gerrit.wikimedia.org/r/546896

Note: @nettrom_WMF and I decided that the work to store aggregates in T235548: Welcome survey: store aggregates does not block deploying this deletion script.

Catrope moved this task from QA to Code Review on the Growth-Team (Current Sprint) board.

Change 546879 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Maintenance script for deleting old welcome survey data

https://gerrit.wikimedia.org/r/546879

Tgr moved this task from Code Review to In Progress on the Growth-Team (Current Sprint) board.EditedNov 2 2019, 4:25 PM

Since the script itself is merged we should run it manually now:
(todo list moved to task description)

Tgr added a comment.Nov 6 2019, 1:49 AM
tgr@mwmaint1002:~$ mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php testwiki --cutoff 350 --verbose --dry-run
Deleting data before 20181121014408 (over 350 days old) (dry run)
  Skipping user:27425, past-cutoff survey submit date 20190607131714
  Skipping user:29950, past-cutoff survey submit date 20190724014701
  Skipping user:30833, past-cutoff survey submit date 20190918212454
  Skipping user:39901, past-cutoff survey submit date 20190607172958
  Skipping user:40269, past-cutoff survey submit date 20190709193716
  Deleting survey data for user:41446
  Deleting survey data for user:41447
  Deleting survey data for user:41448
  Deleting survey data for user:41449
  Deleting survey data for user:41450
  Deleting survey data for user:41462
  Stopping at user:41467 which has past-cutoff registration date 20181121105317
Processed users up to ID 41467
Deleted: 6, skipped: 5
27425 -> Etonkovidova
29950 -> Zilant18
30833 -> Zilant1
39901 -> MMiller (WMF)
40269 -> KHarlan (WMF)
41446 -> Zilant22
41446 -> Zilant22
41447 -> Roantest44
41448 -> Zilant23
41449 -> Jindrat
41450 -> Jindrad
41462 -> Nov-20-2018-test-account-to-suppress

That seems reasonable.

JTannerWMF changed Due Date from Nov 1 2019, 7:00 AM to Nov 19 2019, 8:00 AM.Nov 12 2019, 6:27 PM
Tgr added a comment.Nov 19 2019, 3:03 AM

Dry run:

1$ for wiki in arwiki cswiki euwiki kowiki testwiki viwiki; do echo -e "== $wiki ==\n"; mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php $wiki --cutoff 350 --dry-run; echo -e "\n\n"; done
2
3== arwiki ==
4
5Deleting data before 20181204025146 (over 350 days old) (dry run)
6Processed users up to ID 1612656
7Deleted: 0, skipped: 0
8
9
10
11== cswiki ==
12
13Deleting data before 20181204025147 (over 350 days old) (dry run)
14Processed users up to ID 435266
15Deleted: 750, skipped: 8
16
17
18
19== euwiki ==
20
21Deleting data before 20181204025148 (over 350 days old) (dry run)
22Processed users up to ID 104292
23Deleted: 0, skipped: 0
24
25
26
27== kowiki ==
28
29Deleting data before 20181204025148 (over 350 days old) (dry run)
30Processed users up to ID 541632
31Deleted: 901, skipped: 2
32
33
34
35== testwiki ==
36
37Deleting data before 20181204025150 (over 350 days old) (dry run)
38Processed users up to ID 41572
39Deleted: 4, skipped: 5
40
41
42
43== viwiki ==
44
45Deleting data before 20181204025150 (over 350 days old) (dry run)
46Processed users up to ID 645792
47Deleted: 0, skipped: 1

Per T207290 kowiki and cswiki were the first to use WelcomeSurvey, so this matches expectations (cswiki had ~1500 new users last November, and kowiki had ~2500) assuming the surveys did not have a control group initially, or being in the control group resulted in a user option record as well.

Tgr updated the task description. (Show Details)Nov 19 2019, 3:04 AM

Mentioned in SAL (#wikimedia-operations) [2019-11-19T03:51:01Z] <tgr> T208369 ran mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php cswiki --cutoff 350

Tgr added a comment.Nov 19 2019, 3:52 AM

I will wait a day just in case something went wrong, then repeat with kowiki.

Tgr updated the task description. (Show Details)Nov 19 2019, 3:52 AM

Mentioned in SAL (#wikimedia-operations) [2019-11-20T03:16:02Z] <tgr> T208369 ran mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php kowiki --cutoff 350

Tgr changed Due Date from Nov 19 2019, 8:00 AM to Dec 3 2019, 8:00 AM.Nov 20 2019, 3:17 AM
Tgr moved this task from In Progress to Code Review on the Growth-Team (Current Sprint) board.
Tgr updated the task description. (Show Details)
Tgr updated the task description. (Show Details)
Restricted Application added a subscriber: revi. · View Herald TranscriptNov 20 2019, 3:18 AM

Mentioned in SAL (#wikimedia-operations) [2019-12-04T22:21:24Z] <RoanKattouw> T208369 ran mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php cswiki --cutoff 350

Catrope changed Due Date from Dec 3 2019, 8:00 AM to Dec 18 2019, 8:00 AM.Dec 4 2019, 10:23 PM

Mentioned in SAL (#wikimedia-operations) [2019-12-04T22:24:37Z] <RoanKattouw> T208369 ran mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php kowiki --cutoff 350

Change 546894 merged by jenkins-bot:
[operations/mediawiki-config@master] Add growthexperiments dblist, for puppet usage

https://gerrit.wikimedia.org/r/546894

Mentioned in SAL (#wikimedia-operations) [2019-12-11T00:35:00Z] <tgr@deploy1001> Synchronized dblists/growthexperiments.dblist: SWAT: [[gerrit:546894|Add growthexperiments dblist, for puppet usage (T208369)]] (duration: 01m 02s)

Mentioned in SAL (#wikimedia-operations) [2019-12-11T00:37:45Z] <tgr@deploy1001> Synchronized wmf-config/config: SWAT: [[gerrit:546894|Add growthexperiments dblist, for puppet usage (T208369)]] (duration: 01m 01s)

Mentioned in SAL (#wikimedia-operations) [2019-12-11T00:39:14Z] <tgr@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:546894|Add growthexperiments dblist, for puppet usage (T208369)]] (duration: 01m 00s)

Tgr added a comment.Dec 12 2019, 6:15 PM

I did a stupid mistake while testing the cronjob command and ran --cutoff --dry-run 335 instead of --dry-run --cutoff 335 :/
PHP, of course, will happily convert --dry-run into 0 when it's in the position of a numerical parameter, and the extra positional argument apparently does not cause Maintenance to error out. I'll make sure 0 is not a valid cutoff range, to avoid this happening in the future.
I interrupted the job when I noticed what's happening (when it was running on arwiki); a proper dry-run says

arwiki:  Deleting data before 20190111174412 (over 335 days old) (dry run)
arwiki:    Stopping at user:1696701 which has past-cutoff registration date 20190715221111

and per T226221 welcome survey was enabled on arwiki on July 15 around 18h UTC, so that affected 81 users (or rather the half of them, since there was an A/B test).

Change 556762 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] DeleteOldSurveys: sanity-check cutoff parameter

https://gerrit.wikimedia.org/r/556762

Change 556762 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] DeleteOldSurveys: sanity-check cutoff parameter

https://gerrit.wikimedia.org/r/556762

Change 546896 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki: maintenance script for purging old GrowthExperiments data

https://gerrit.wikimedia.org/r/546896

Putting this in QA until we have seen the results from the job (on January 1, or maybe sooner by triggering the systemd job manually).

Tgr removed Due Date.Dec 23 2019, 8:13 PM

Doesn't really have a due date anymore since the job is now running every two weeks.

Checked the script with --dry-run in betalabs - works as expected. It's now up to @nettrom_WMF to do final check on the task.

Tgr added a comment.Jan 7 2020, 11:27 PM

Logs look reasonable:

1Jan 1 03:15:01 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: -----------------------------------------------------------------
2Jan 1 03:15:01 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: arwiki
3Jan 1 03:15:01 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: -----------------------------------------------------------------
4Jan 1 03:15:01 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: arwiki: Deleting data before 20190131031501 (over 335 days old)
5Jan 1 03:15:02 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: arwiki: Processed users up to ID 1696701
6Jan 1 03:15:02 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: arwiki: Deleted: 0, skipped: 0
7Jan 1 03:15:02 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: -----------------------------------------------------------------
8Jan 1 03:15:02 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: cswiki
9Jan 1 03:15:02 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: -----------------------------------------------------------------
10Jan 1 03:15:02 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: cswiki: Deleting data before 20190131031502 (over 335 days old)
11Jan 1 03:16:23 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: cswiki: Processed users up to ID 439974
12Jan 1 03:17:40 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: cswiki: Processed users up to ID 442327
13Jan 1 03:17:40 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: cswiki: Deleted: 911, skipped: 0
14Jan 1 03:17:40 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: -----------------------------------------------------------------
15Jan 1 03:17:40 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: euwiki
16Jan 1 03:17:40 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: -----------------------------------------------------------------
17Jan 1 03:17:40 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: euwiki: Deleting data before 20190131031740 (over 335 days old)
18Jan 1 03:17:41 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: euwiki: Processed users up to ID 104292
19Jan 1 03:17:41 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: euwiki: Deleted: 0, skipped: 0
20Jan 1 03:17:41 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: -----------------------------------------------------------------
21Jan 1 03:17:41 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: kowiki
22Jan 1 03:17:41 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: -----------------------------------------------------------------
23Jan 1 03:17:41 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: kowiki: Deleting data before 20190131031741 (over 335 days old)
24Jan 1 03:19:04 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: kowiki: Processed users up to ID 545993
25Jan 1 03:20:25 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: kowiki: Processed users up to ID 547037
26Jan 1 03:21:44 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: kowiki: Processed users up to ID 548405
27Jan 1 03:23:02 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: kowiki: Processed users up to ID 549526
28Jan 1 03:24:23 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: kowiki: Processed users up to ID 552616
29Jan 1 03:24:45 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: kowiki: Processed users up to ID 553551
30Jan 1 03:24:45 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: kowiki: Deleted: 273, skipped: 0
31Jan 1 03:24:45 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: -----------------------------------------------------------------
32Jan 1 03:24:45 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: testwiki
33Jan 1 03:24:45 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: -----------------------------------------------------------------
34Jan 1 03:24:45 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: testwiki: Deleting data before 20190131032445 (over 335 days old)
35Jan 1 03:24:48 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: testwiki: Processed users up to ID 42264
36Jan 1 03:24:48 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: testwiki: Deleted: 29, skipped: 6
37Jan 1 03:24:48 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: -----------------------------------------------------------------
38Jan 1 03:24:48 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: viwiki
39Jan 1 03:24:48 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: -----------------------------------------------------------------
40Jan 1 03:24:48 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: viwiki: Deleting data before 20190131032448 (over 335 days old)
41Jan 1 03:25:46 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: viwiki: Processed users up to ID 646997
42Jan 1 03:25:46 mwmaint1002 mediawiki_job_growthexperiments-deleteOldSurveys[149384]: viwiki: Deleted: 695, skipped: 1