Partially purge MobileWikiAppiOSUserHistory eventlogging schema
Closed, ResolvedPublic8 Story Points

Description

As agreed upon T192819, we'd like to partially purge MobileWikiAppiOSUserHistory after 90 days. Specifically, we'd like to purge the IP address and user agent fields, but keep os_family, os_major, os_minor, wmf_app_version from the useragent struct, country from the geocoded_data map, and everything else in the table.

Since this table will be blacklisted on MySQL (the patch hasn't been merged yet), according to this data purging doc, it seems I cannot use the purging whitelist tsv to partially purge MobileWikiAppiOSUserHistory. Also some custom scripts may be needed to extract the above mentioned fields from the struct/map. @mforns can you help with this? Thank you!

chelsyx created this task.May 21 2018, 10:59 PM
chelsyx moved this task from Triage to Tracking on the Product-Analytics board.

@chelsyx

@mforns can you help with this? Thank you!

Sure I'll try to help :]

we'd like to purge the IP address and user agent fields, but keep os_family, os_major, os_minor, wmf_app_version from the useragent struct, country from the geocoded_data map, and everything else in the table.

Regarding IP address: Sure, let's purge it.

Regarding user agent: When EL events reach Hive, they don't have an unparsed User-Agent string any more, so we'd only have to purge the sub-fields inside the useragent (map) that we don't want to keep. This is possible with the new white-list format, so no problemo. Maybe I'd advice to also purge the OS minor version, because it starts to be identifying in some cases. Is it so important? I imagine it will be useful to observe bugs, but those can be spotted within the 90 day frame, no?

Regarding "everything else in the table": I think the fields in the schema are fine with the exception of appInstallId. But this field is being discussed together with all other schemas that contain it in T178174. So, I think best will be to white-list it for now, and apply whichever process we decide in T178174 to it later on. So yea, I think we can white-list everything else.

according to this data purging doc, it seems I cannot use the purging whitelist tsv to partially purge MobileWikiAppiOSUserHistory. Also some custom scripts may be needed to extract the above mentioned fields from the struct/map.

The docs you mention refer to the fact that for MySQL you can not partially purge a map field, like useragent. But for Hive that is indeed possible (with the new white-list format), and no scripts will be needed, the white-list supports nested structures.

As the new white-list format is not yet documented, because I'm still troubleshooting the productionization of EventLogging purging in Hive, I can write that change for you, and let you review it.

mforns claimed this task.May 24 2018, 4:04 PM
mforns added a project: Analytics-Kanban.
mforns set the point value for this task to 3.
mforns triaged this task as Normal priority.
mforns moved this task from Incoming to Data Quality on the Analytics board.
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.

Thanks @mforns !

Maybe I'd advice to also purge the OS minor version, because it starts to be identifying in some cases. Is it so important? I imagine it will be useful to observe bugs, but those can be spotted within the 90 day frame, no?

@JMinor Any thoughts about this?

Maybe I'd advice to also purge the OS minor version, because it starts to be identifying in some cases. Is it so important? I imagine it will be useful to observe bugs, but those can be spotted within the 90 day frame, no?

I'd suggest we make this decision on the basis of some specific concern or evidence this is identifying. Until then I think it has value for debugging and should be preserved.

@JMinor @chelsyx

I'd suggest we make this decision on the basis of some specific concern or evidence this is identifying.

Yea, agree. I will try to explain my concerns and the criteria I followed when proposing to purge os_minor.

The main concern in this case is that, given a known user, there exists a way to trace back and obtain data we collected about their online activity that can be harmful to their privacy. This would be the case of a subpoena asking for the data of a given user.

We assume the third party that wants the data knows that user's context, like: personal information, location and devices they use online. So they might know the user's country, OS family, and OS major and minor versions.

If we keep a record in our database that matches that user's country + OS family + OS major + OS minor, we might be giving out, in certain situations, the user's online activity in our systems. It all depends on how rare/uncommon those fields are for that user (whether the country is small enough, and the OS versions are rare enough). And also it depends on a lot of external information that we can not control that can give enough context to the third party to re-identify the user's data.

IMO it is practically impossible to matematically calculate the probability of the threat to that user's privacy, because of that external information/context that we do not control. But one thing is certain, by keeping the os_minor field (or any identifying field), we are incrementing that probability by a factor.

As a proxy to calculate that factor I use the bucket size of our OS versions data. The smaller the buckets, the more identifying. To get a quick number, I queried pageview_hourly table in Hive twice: first with OS family and OS major, and then with OS family, OS major and OS minor. See results:

without os_minor
SELECT
    user_agent_map['os_family'] AS os_family,
    user_agent_map['os_major'] AS os_major,
    SUM(view_count) AS view_count
FROM
    wmf.pageview_hourly
WHERE
    year = 2018
    AND month = 5
    AND day = 10
    AND agent_type = 'user'
    AND access_method = 'mobile web'
GROUP BY
    user_agent_map['os_family'],
    user_agent_map['os_major']
ORDER BY
    os_family,
    os_major
LIMIT 1000
;

os_family	os_major	view_count
Android	-	2625664
Android	0	3078
Android	1	4456
Android	10	215
Android	100	9
Android	1000	5
Android	100000000000	1
Android	11	42
Android	12	20
Android	13	10
Android	165	3
Android	2	663744
Android	20	3
Android	3	25569
Android	30	17
Android	4	21935656
Android	5	24879548
Android	54	1
Android	6	31933897
Android	666	1
Android	7	56569539
Android	8	17454192
Android	85	1
Android	8567854	2
Android	9	240
Android	94	2
Android	98	3
Android	99	9
Android	999	2
Android	99999	4
Android	9999999	3
Android	99999999	1
Android	999999999999	2
BREW	-	3719
BREW	1	7
BREW	3	174
BREW	4	1
BREW	5	5
Bada	-	60
Bada	1	3811
Bada	2	2335
BlackBerry OS	-	18069
BlackBerry OS	10	118271
BlackBerry OS	4	302
BlackBerry OS	5	3464
BlackBerry OS	6	5428
BlackBerry OS	7	27789
BlackBerry OS	9	3
BlackBerry Tablet OS	1	4
BlackBerry Tablet OS	2	50
Brew MP	1	993
CentOS	-	59
Chrome OS	1	1
Chrome OS	10032	199
Chrome OS	10176	409
Chrome OS	10323	3178
Chrome OS	10452	899
Chrome OS	10575	122
Chrome OS	10635	12
Chrome OS	10643	12
Chrome OS	1193	1
Chrome OS	3912	45
Chrome OS	4319	5
Chrome OS	4537	2
Chrome OS	6310	30
Chrome OS	7262	1
Chrome OS	8530	1
Chrome OS	8872	2
Chrome OS	9000	3
Chrome OS	9202	2
Chrome OS	9334	3
Chrome OS	9592	20
Chrome OS	9765	40
Chrome OS	9901	93
Debian	-	200
Fedora	-	520
Fedora	1	139
Fedora	2	92
Fedora	3	143
Fedora	4	1
Firefox OS	-	651098
Firefox OS	1	2189
Firefox OS	2	3828
FreeBSD	-	115
FreeBSD	4	4
Gentoo	-	211
Kindle	1	18
Kindle	2	67
Kindle	3	42193
Kubuntu	-	11
Linux	-	258657
Linux	2	147
Linux	3	2
Linux	4	63
Linux Mint	-	9
Linux Mint	5	15
Linux Mint	6	16
Linux Mint	7	17
Linux Mint	9	1
Mac OS	-	39
Mac OS X	-	130349
Mac OS X	10	397269
Mac OS X	537	10
Maemo	-	5742
Mandriva	-	7
Mandriva	2007	1
Mandriva	2008	7
Mandriva	2009	8
MeeGo	-	17500
NetBSD	-	29
NetBSD	1	5
NetBSD	5	1
Nokia Series 40	-	1
OpenBSD	-	1020
Other	-	3164073
Philips	2012	96
Philips	2013	72
Philips	2017	2
Red Hat	-	2
Red Hat	1	41
Red Hat	3	19
SUSE	-	266
Slackware	-	274
Slackware	13	16
Slackware	2	2
Solaris	-	161
Symbian OS	-	39394
Symbian OS	24	9
Symbian OS	6	2
Symbian OS	7	18585
Symbian OS	9	5639
Symbian^3	-	4520
Symbian^3 Anna	-	2223
Symbian^3 Belle	-	788
Ubuntu	-	15292
Ubuntu	1	6
Ubuntu	10	275
Ubuntu	11	382
Ubuntu	12	1080
Ubuntu	14	178
Ubuntu	16	1
Ubuntu	7	113
Ubuntu	8	924
Ubuntu	9	314
VRE	-	1341
WebTV	1	7
WebTV	2	6
Windows	-	46349
Windows 10	-	1474874
Windows 2000	-	1005
Windows 3.1	-	273
Windows 7	-	596499
Windows 8	-	33323
Windows 8.1	-	69100
Windows 95	-	418
Windows 98	-	29854
Windows CE	-	477315
Windows ME	-	34
Windows Mobile	-	988
Windows NT	-	40
Windows NT 4.0	-	584
Windows Phone	-	154
Windows Phone	10	504407
Windows Phone	6	111
Windows Phone	7	28227
Windows Phone	8	1157318
Windows RT	-	1791
Windows RT 8.1	-	312
Windows Vista	-	9186
Windows XP	-	232697
iOS	-	5186
iOS	10	12225973
iOS	11	88371211
iOS	12	1319
iOS	13	86
iOS	2	424
iOS	3	14913
iOS	4	106458
iOS	5	203053
iOS	6	152683
iOS	7	566268
iOS	8	611266
iOS	9	4039890
webOS	1	63
webOS	2	114
Time taken: 84.223 seconds, Fetched: 185 row(s)
with os_minor
SELECT
    user_agent_map['os_family'] AS os_family,
    user_agent_map['os_major'] AS os_major,
    user_agent_map['os_minor'] AS os_minor,
    SUM(view_count) AS view_count
FROM
    wmf.pageview_hourly
WHERE
    year = 2018
    AND month = 5
    AND day = 10
    AND agent_type = 'user'
    AND access_method = 'mobile web'
GROUP BY
    user_agent_map['os_family'],
    user_agent_map['os_major'],
    user_agent_map['os_minor']
ORDER BY
    os_family,
    os_major,
    os_minor
LIMIT 1000
;

os_family	os_major	os_minor	view_count
Android	-	-	2625664
Android	0	0	45
Android	0	1	130
Android	0	5	2902
Android	0	7	1
Android	1	0	3540
Android	1	1	2
Android	1	2	10
Android	1	3	2
Android	1	31	2
Android	1	5	328
Android	1	6	571
Android	1	7	1
Android	10	0	39
Android	10	1	16
Android	10	10	10
Android	10	2	43
Android	10	4	4
Android	10	49	72
Android	10	5	1
Android	10	6	14
Android	10	8	2
Android	10	9	14
Android	100	5	9
Android	1000	0	3
Android	1000	500	2
Android	100000000000	9	1
Android	11	0	31
Android	11	1	1
Android	11	11	1
Android	11	2	9
Android	12	0	3
Android	12	1	8
Android	12	4	9
Android	13	0	9
Android	13	1	1
Android	165	0	3
Android	2	0	3529
Android	2	1	29022
Android	2	2	38828
Android	2	3	592315
Android	2	4	1
Android	2	5	30
Android	2	9	19
Android	20	1	1
Android	20	8	1
Android	20	9	1
Android	3	0	733
Android	3	1	7821
Android	3	10	2
Android	3	2	16998
Android	3	8	1
Android	3	9	14
Android	30	0	1
Android	30	1	16
Android	4	0	597146
Android	4	03	5
Android	4	1	1118148
Android	4	2	9277048
Android	4	3	705765
Android	4	4	10237437
Android	4	42	22
Android	4	5	74
Android	4	6	2
Android	4	7	1
Android	4	8	5
Android	4	9	3
Android	5	0	6854598
Android	5	01	1
Android	5	02	4
Android	5	03	1
Android	5	1	18024105
Android	5	11	1
Android	5	12	523
Android	5	2	173
Android	5	3	13
Android	5	4	77
Android	5	5	30
Android	5	6	2
Android	5	7	18
Android	5	9	2
Android	54	34	1
Android	6	0	31919107
Android	6	01	3
Android	6	1	14580
Android	6	10	1
Android	6	2	16
Android	6	3	3
Android	6	4	30
Android	6	42	2
Android	6	5	24
Android	6	53	93
Android	6	56	4
Android	6	6	18
Android	6	60	14
Android	6	9	2
Android	666	69	1
Android	7	0	44413003
Android	7	01	4
Android	7	1	12156479
Android	7	14	1
Android	7	2	26
Android	7	4	9
Android	7	5	2
Android	7	6	1
Android	7	7	10
Android	7	9	4
Android	8	0	14514183
Android	8	0000	31
Android	8	1	2939915
Android	8	2	17
Android	8	3	11
Android	8	4	24
Android	8	5	1
Android	8	6	1
Android	8	8	8
Android	8	9	1
Android	85	0	1
Android	8567854	676854	2
Android	9	0	150
Android	9	1	47
Android	9	2	4
Android	9	3	5
Android	9	4	1
Android	9	5	4
Android	9	7	1
Android	9	8	2
Android	9	9	26
Android	94	0	2
Android	98	5	3
Android	99	0	7
Android	99	99	2
Android	999	9	1
Android	999	999	1
Android	99999	9	3
Android	99999	9999	1
Android	9999999	9	3
Android	99999999	0	1
Android	999999999999	999999999999	2
BREW	-	-	3719
BREW	1	0	7
BREW	3	1	174
BREW	4	0	1
BREW	5	0	5
Bada	-	-	60
Bada	1	0	3080
Bada	1	1	14
Bada	1	2	717
Bada	2	0	2335
BlackBerry OS	-	-	18069
BlackBerry OS	10	0	336
BlackBerry OS	10	1	1371
BlackBerry OS	10	2	7317
BlackBerry OS	10	3	109247
BlackBerry OS	4	0	51
BlackBerry OS	4	1	1
BlackBerry OS	4	3	1
BlackBerry OS	4	5	72
BlackBerry OS	4	6	174
BlackBerry OS	4	7	3
BlackBerry OS	5	0	3459
BlackBerry OS	5	1	3
BlackBerry OS	5	2	2
BlackBerry OS	6	0	5428
BlackBerry OS	7	0	10094
BlackBerry OS	7	1	17695
BlackBerry OS	9	49	3
BlackBerry Tablet OS	1	0	4
BlackBerry Tablet OS	2	1	50
Brew MP	1	0	993
CentOS	-	-	59
Chrome OS	1	22	1
Chrome OS	10032	71	1
Chrome OS	10032	75	12
Chrome OS	10032	86	186
Chrome OS	10176	66	12
Chrome OS	10176	68	6
Chrome OS	10176	72	38
Chrome OS	10176	73	4
Chrome OS	10176	76	349
Chrome OS	10323	30	2
Chrome OS	10323	46	42
Chrome OS	10323	58	4
Chrome OS	10323	62	209
Chrome OS	10323	67	2904
Chrome OS	10323	68	17
Chrome OS	10452	42	3
Chrome OS	10452	69	11
Chrome OS	10452	74	702
Chrome OS	10452	85	183
Chrome OS	10575	17	1
Chrome OS	10575	22	117
Chrome OS	10575	32	2
Chrome OS	10575	8	2
Chrome OS	10635	0	12
Chrome OS	10643	0	12
Chrome OS	1193	158	1
Chrome OS	3912	101	45
Chrome OS	4319	74	5
Chrome OS	4537	56	2
Chrome OS	6310	68	30
Chrome OS	7262	57	1
Chrome OS	8530	96	1
Chrome OS	8872	73	2
Chrome OS	9000	87	2
Chrome OS	9000	91	1
Chrome OS	9202	56	2
Chrome OS	9334	72	3
Chrome OS	9592	96	20
Chrome OS	9765	81	6
Chrome OS	9765	85	34
Chrome OS	9901	66	7
Chrome OS	9901	77	86
Debian	-	-	200
Fedora	-	-	520
Fedora	1	0	27
Fedora	1	5	109
Fedora	1	9	3
Fedora	2	0	92
Fedora	3	0	61
Fedora	3	1	5
Fedora	3	5	70
Fedora	3	6	7
Fedora	4	3	1
Firefox OS	-	-	651098
Firefox OS	1	0	66
Firefox OS	1	1	553
Firefox OS	1	2	40
Firefox OS	1	3	1513
Firefox OS	1	4	17
Firefox OS	2	0	3591
Firefox OS	2	1	237
FreeBSD	-	-	115
FreeBSD	4	4	4
Gentoo	-	-	211
Kindle	1	0	18
Kindle	2	0	16
Kindle	2	1	8
Kindle	2	3	15
Kindle	2	5	28
Kindle	3	0	42193
Kubuntu	-	-	11
Linux	-	-	258657
Linux	2	0	24
Linux	2	2	17
Linux	2	4	43
Linux	2	6	63
Linux	3	2	1
Linux	3	8	1
Linux	4	15	8
Linux	4	4	55
Linux Mint	-	-	9
Linux Mint	5	-	15
Linux Mint	6	-	16
Linux Mint	7	-	17
Linux Mint	9	-	1
Mac OS	-	-	39
Mac OS X	-	-	130349
Mac OS X	10	10	46801
Mac OS X	10	11	35455
Mac OS X	10	12	30468
Mac OS X	10	13	263269
Mac OS X	10	14	27
Mac OS X	10	15	1
Mac OS X	10	2	1
Mac OS X	10	20	2
Mac OS X	10	21	1
Mac OS X	10	22	1
Mac OS X	10	23	1
Mac OS X	10	27	2
Mac OS X	10	31	1
Mac OS X	10	32	3
Mac OS X	10	33	1
Mac OS X	10	34	1
Mac OS X	10	35	1
Mac OS X	10	37	2
Mac OS X	10	39	3
Mac OS X	10	4	345
Mac OS X	10	40	1
Mac OS X	10	42	2
Mac OS X	10	44	2
Mac OS X	10	47	3
Mac OS X	10	48	1
Mac OS X	10	49	1
Mac OS X	10	5	1004
Mac OS X	10	50	1
Mac OS X	10	52	1
Mac OS X	10	53	3
Mac OS X	10	55	1
Mac OS X	10	6	3270
Mac OS X	10	7	2212
Mac OS X	10	8	10886
Mac OS X	10	9	3495
Mac OS X	537	51	10
Maemo	-	-	5742
Mandriva	-	-	7
Mandriva	2007	0	1
Mandriva	2008	1	7
Mandriva	2009	0	2
Mandriva	2009	1	6
MeeGo	-	-	17500
NetBSD	-	-	29
NetBSD	1	6	5
NetBSD	5	0	1
Nokia Series 40	-	-	1
OpenBSD	-	-	1020
Other	-	-	3164073
Philips	2012	-	96
Philips	2013	-	72
Philips	2017	-	2
Red Hat	-	-	2
Red Hat	1	0	14
Red Hat	1	5	27
Red Hat	3	0	19
SUSE	-	-	266
Slackware	-	-	274
Slackware	13	0	9
Slackware	13	37	7
Slackware	2	6	2
Solaris	-	-	161
Symbian OS	-	-	39394
Symbian OS	24	838	9
Symbian OS	6	1	2
Symbian OS	7	0	18585
Symbian OS	9	1	18
Symbian OS	9	2	310
Symbian OS	9	3	325
Symbian OS	9	4	4986
Symbian^3	-	-	4520
Symbian^3 Anna	-	-	2223
Symbian^3 Belle	-	-	788
Ubuntu	-	-	15292
Ubuntu	1	4	1
Ubuntu	1	5	5
Ubuntu	10	04	253
Ubuntu	10	10	22
Ubuntu	11	04	370
Ubuntu	11	10	12
Ubuntu	12	04	1080
Ubuntu	14	04	2
Ubuntu	14	10	176
Ubuntu	16	04	1
Ubuntu	7	04	3
Ubuntu	7	10	110
Ubuntu	8	04	187
Ubuntu	8	10	737
Ubuntu	9	04	237
Ubuntu	9	10	65
Ubuntu	9	25	12
VRE	-	-	1341
WebTV	1	2	7
WebTV	2	6	6
Windows	-	-	46349
Windows 10	-	-	1474874
Windows 2000	-	-	1005
Windows 3.1	-	-	273
Windows 7	-	-	596499
Windows 8	-	-	33323
Windows 8.1	-	-	69100
Windows 95	-	-	418
Windows 98	-	-	29854
Windows CE	-	-	477315
Windows ME	-	-	34
Windows Mobile	-	-	988
Windows NT	-	-	40
Windows NT 4.0	-	-	584
Windows Phone	-	-	154
Windows Phone	10	0	504407
Windows Phone	6	5	111
Windows Phone	7	0	317
Windows Phone	7	5	27910
Windows Phone	8	0	1012730
Windows Phone	8	1	140038
Windows Phone	8	10	4550
Windows RT	-	-	1791
Windows RT 8.1	-	-	312
Windows Vista	-	-	9186
Windows XP	-	-	232697
iOS	-	-	5186
iOS	10	0	696996
iOS	10	1	771863
iOS	10	10	1
iOS	10	2	2416041
iOS	10	3	8341038
iOS	10	4	1
iOS	10	54	8
iOS	10	7	2
iOS	10	9	23
iOS	11	0	2928752
iOS	11	1	3524020
iOS	11	11	1
iOS	11	2	20024324
iOS	11	3	61550115
iOS	11	4	343999
iOS	12	0	1319
iOS	13	0	86
iOS	2	0	70
iOS	2	1	38
iOS	2	2	316
iOS	3	0	1342
iOS	3	1	1714
iOS	3	18	1
iOS	3	2	11856
iOS	4	0	21603
iOS	4	1	2554
iOS	4	2	51215
iOS	4	3	31086
iOS	5	0	21607
iOS	5	1	181446
iOS	6	0	46074
iOS	6	1	106607
iOS	6	3	2
iOS	7	0	139247
iOS	7	01	4
iOS	7	06	3
iOS	7	1	426998
iOS	7	2	11
iOS	7	3	5
iOS	8	0	34571
iOS	8	1	200230
iOS	8	2	45287
iOS	8	3	129688
iOS	8	4	201475
iOS	8	5	13
iOS	8	6	2
iOS	9	0	87116
iOS	9	1	137904
iOS	9	2	254096
iOS	9	3	3559348
iOS	9	4	39
iOS	9	5	1387
webOS	1	0	12
webOS	1	3	11
webOS	1	4	40
webOS	2	1	63
webOS	2	2	51
Time taken: 78.2 seconds, Fetched: 436 row(s)

The median of the bucket size without os_minor is 139, while the median with os_minor is 19.5. That indicates that adding os_minor makes the data set around 7 times more "granular", hence more identifiable. This is a proxy, just to give an idea.

Also, the fact that the discussed schema contains the appInstallId field, makes it linkable to all other schemas that share it. So by adding the os_minor field just here, we're incrementing the potential of identification by a factor for all other 17 schemas that have an appInstallId.

That was my criteria. I might be wrong, though. I'm open to your ideas.
I also proposed to purge os_minor, because I thought that 90 days of os_minor data was enough to troubleshoot bugs.

Cheers

Nuria added a comment.Jun 13 2018, 9:49 PM

ping @JMinor is info provided sufficient to quantify risk?

I'm so sorry for the late response. Things has been a bit crazy on my end...

@mforns Yes, I agree with you that with os_minor a user would be more identifiable since there are fewer users in each bucket comparing to the case without os_minor. However, the difference is smaller for iOS app, which is the only platform this table is going to get data from. I ran the following two queries for iOS app, which are similar to yours. The median of the bucket size without os_minor is 10619.5, while the median with os_minor is 2641.5. The difference is around 4 times, and the bucket size is much larger.

iOS without os_minor
SELECT
    user_agent_map['os_family'] AS os_family,
    user_agent_map['os_major'] AS os_major,
    SUM(view_count) AS view_count
FROM
    wmf.pageview_hourly
WHERE
    year = 2018
    AND month = 6
    AND day = 10
    AND agent_type = 'user'
    AND access_method = 'mobile app'
    AND user_agent_map['os_family'] = 'iOS'
GROUP BY
    user_agent_map['os_family'],
    user_agent_map['os_major']
ORDER BY
    os_family,
    os_major
LIMIT 1000
;

os_family	os_major	view_count
iOS	10	202785
iOS	11	3221609
iOS	12	17102
iOS	13	1
iOS	6	761
iOS	7	4137
iOS	8	3089
iOS	9	64618
8 rows selected (68.251 seconds)
iOS with os_minor
SELECT
    user_agent_map['os_family'] AS os_family,
    user_agent_map['os_major'] AS os_major,
    user_agent_map['os_minor'] AS os_minor,
    SUM(view_count) AS view_count
FROM
    wmf.pageview_hourly
WHERE
    year = 2018
    AND month = 6
    AND day = 10
    AND agent_type = 'user'
    AND access_method = 'mobile app'
    AND user_agent_map['os_family'] = 'iOS'
GROUP BY
    user_agent_map['os_family'],
    user_agent_map['os_major'],
    user_agent_map['os_minor']
ORDER BY
    os_family,
    os_major,
    os_minor
LIMIT 1000
;

os_family	os_major	os_minor	view_count
iOS	10	0	7023
iOS	10	1	8525
iOS	10	2	26643
iOS	10	3	160594
iOS	11	0	40573
iOS	11	1	51401
iOS	11	2	265024
iOS	11	3	1382746
iOS	11	4	1481865
iOS	12	0	17102
iOS	13	0	1
iOS	6	0	30
iOS	6	1	731
iOS	7	0	364
iOS	7	1	3773
iOS	8	0	74
iOS	8	1	921
iOS	8	2	267
iOS	8	3	633
iOS	8	4	1194
iOS	9	0	609
iOS	9	1	521
iOS	9	2	1510
iOS	9	3	61978
24 rows selected (66.257 seconds)

Additionally, I believe it's very hard to identify a user by the os_minor field in this particular table MobileWikiAppiOSUserHistory, because for anyone who has a record in this table, they have to:

  1. Agree to share their usage report with us. The approximate probability of a user agree to share their usage report with us is around 10%, according to the graph in T130432.

AND

  1. The user is in one of the top 50 countries with the most visitors in this list (from US to Egypt). This has been approved by legal in T192819#4180467.

AND

  1. To be identifiable by os_minor, the user has to be someone who hasn't update their iOS system for a while, or an iOS beta user. From the numbers above, we can see that more than 90% of them are using iOS 11. Even if we look at the os_minor, more than 80% of them are using iOS 11.3 or 11.4 (latest).

So my point is, yes, the bucket size could be very small for some subset of the users, but the chance for a user to fall into such small bucket is very very low.

Tbayer added a subscriber: Tbayer.Sat, Jun 23, 3:14 AM

Change 442076 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] Add inline comments to WhitelistSanitization

https://gerrit.wikimedia.org/r/442076

Change 442076 merged by Ottomata:
[analytics/refinery/source@master] Add inline comments to WhitelistSanitization

https://gerrit.wikimedia.org/r/442076

Sorry, this change was not meant to be linked to this task... please ignore.

Hi @chelsyx!
Also forgive me for the late response, your thorough report gave me a lot to think about.

However, the difference is smaller for iOS app, which is the only platform this table is going to get data from.

Sorry for missing that obvious fact. Indeed, the factor is smaller.

The user is in one of the top 50 countries with the most visitors in this list (from US to Egypt). This has been approved by legal in T192819#4180467.

That's great. This will prevent many small buckets to be collected in any case.

Agree to share their usage report with us. The approximate probability of a user agree to share their usage report with us is around 10%, according to the graph in T130432.

I think the graph in T130432 has changed since then. I executed the provided query and on 2017-04-20 there is a significant x8 jump up, and in 1 year since then it has grown around 75% (don't think this is the natural growth of iOS app usage). Not sure which percentage would it be, but definitely higher.

the chance for a user to fall into such small bucket is very very low.

This has made me reflect a lot. Intuitively it seems obvious to me. But on the other hand, I don't think we're currently able to determine the exact value (nor even an aproximation to the order of magnitude) of this probability. And also more importantly, we don't know how "very low" is low enough (10%? 1%? 0.1%? 0.01%). I think to do those 2 requirements we're missing a process and a clear criteria that helps us in discussions like this one. Maybe when we hire the Privacy Engineer, we'll be able to tackle this!


So my personal conclusion/guess of all this is that whitelisting os_family, os_major, os_minor, country and wmf_app_version fields is still risky. I would try to risk as few as possible and try to obtain the benefits of os_minor without keeping it for more than 90 days. If the value of os_minor is purely for troubleshooting, I guess 90 days would be enough for that, also because it's not likely to troubleshoot a problem that is not happening currently. If its value goes beyond that, I'd try to calculate metrics on it during the 90 day window, and store them in privacy-safe reports (that do not contain other sensitive data, like appInstallId).

But this is just my opinion, I don't have any authority on this subject. I just happen to have done these analyses a couple times in the last 3 years. But my opinion is clearly biased. I believe that privacy is one of the few assets of the WMF: our software is open, our data is free, any big IT company would be able to provide infrastructure to run Wikipedia. So the only thing I can see that remains "ours" (appart from us WMF people) is the trust the Wikimedia community has in the WMF, which I think can be fostered by trying to risk as few as possible privacy-wise, among others.

Heh, sorry for the philosophical divagations.
Cheers!

Vvjjkkii renamed this task from Partially purge MobileWikiAppiOSUserHistory eventlogging schema to 6icaaaaaaa.Sun, Jul 1, 1:08 AM
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii removed mforns as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed the point value for this task.
Vvjjkkii removed a subscriber: gerritbot.
JAllemandou renamed this task from 6icaaaaaaa to Partially purge MobileWikiAppiOSUserHistory eventlogging schema.Mon, Jul 2, 3:11 PM

@mforns

I think the graph in T130432 has changed since then. I executed the provided query and on 2017-04-20 there is a significant x8 jump up, and in 1 year since then it has grown around 75% (don't think this is the natural growth of iOS app usage). Not sure which percentage would it be, but definitely higher.

Yeah that's true. I don't have a better way to estimate the percentage of app users who agree to share their usage report with us and the graph in T130432 is the best approximate I can think of. @Tbayer may have a better idea.

I don't think we're currently able to determine the exact value (nor even an aproximation to the order of magnitude) of this probability. And also more importantly, we don't know how "very low" is low enough (10%? 1%? 0.1%? 0.01%). I think to do those 2 requirements we're missing a process and a clear criteria that helps us in discussions like this one.

I agree with you. Hopefully we can solve this problem in the near future!

For this particular case, let's not whitelist the os_minor field for now. We released the new version about a week ago and I need to verify that we are collecting data from users as expected. After that, I will run another analysis using the data from this table and check the bucket size to see how small it is.


In our next release (about 1-2 months later), we will add several new fields to this table and I want to whitelist them as well. Should I create a new ticket for that? Or can I whitelist them myself? Thanks!

mforns added a comment.Tue, Jul 3, 1:55 PM

For this particular case, let's not whitelist the os_minor field for now. We released the new version about a week ago and I need to verify that we are collecting data from users as expected. After that, I will run another analysis using the data from this table and check the bucket size to see how small it is.

Ok, I will create a change to the whitelist and let you review it.

In our next release (about 1-2 months later), we will add several new fields to this table and I want to whitelist them as well. Should I create a new ticket for that? Or can I whitelist them myself? Thanks!

You can totally whitelist them, just create a Gerrit change to the white-list* and add us Analytics to review it. Maybe these docs can help. If you need a Phab task to associate the change with, you can use this one, or create another one. If you want me to do the white-list change, just let me know.

(*) We are changing the location of the EventLogging white-list. Until now it lived in:
https://github.com/wikimedia/puppet/blob/production/modules/profile/files/analytics/data/sanitization/eventlogging_purging_whitelist.yaml
And now we're moving it to:
https://github.com/wikimedia/analytics-refinery/blob/master/static_data/eventlogging/whitelist.yaml
Hopefully, in 1-2 months you'll have to apply the change to the latter only.

mforns set the point value for this task to 8.Fri, Jul 6, 4:06 PM

Change 444894 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add MobileWikiAppiOSUserHistory to EL whitelist

https://gerrit.wikimedia.org/r/444894

Hi all,

I added MobileWikiAppiOSUserHistory to EventLogging sanitization white-list.
Please review that everything we talked about is reflected correctly in the Gerrit change above.

I've seen that since we started this conversation the schema has changed, and a 2-level nested field has been added.
Theoretically, nothing should fail throughout the refining and sanitization pipelines.
However, it's the first time we have such a schema being refined/sanitized and we might see issues.
I will follow up with this.
Also, I'll leave a comment related to this new nested field in T192819.

Cheers!

Nuria added a comment.Tue, Jul 10, 4:02 PM

I've seen that since we started this conversation the schema has changed, and a 2-level nested field has been added.

It is worth pointing out that when we started talking about these events we wanted them to be easily ingestable into druid and with the current schema that is not possible. Now, this might not matter cause your plan might be creating another table with this data and making that the one ingestable into druid (that would work too).

Just pointing this out to make sure we are all aware, see guidelines for schemas that can be easily ingested: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines

Thank you very much @mforns! The patch looks good to me!

It is worth pointing out that when we started talking about these events we wanted them to be easily ingestable into druid and with the current schema that is not possible. Now, this might not matter cause your plan might be creating another table with this data and making that the one ingestable into druid (that would work too).

Thanks @Nuria ! Yes I understand that nested field cannot be easily ingestable into druid. I will create another table for druid in the future.

fdans added a subscriber: fdans.Thu, Jul 12, 4:59 PM

@chelsyx shall we close this task if you think it's done?

Change 444894 merged by Mforns:
[analytics/refinery@master] Add MobileWikiAppiOSUserHistory to EL whitelist

https://gerrit.wikimedia.org/r/444894

I just merged the change, because it got a +1 from Nuria and a +1 from Chelsy.
We'll deploy that with the next refinery deployment.
I think we can close this task as resolved.

Nuria moved this task from In Progress to Done on the Analytics-Kanban board.Thu, Jul 12, 5:42 PM

@fdans Yes we can close this task as resolved.

Thanks all!

fdans closed this task as Resolved.Fri, Jul 13, 12:55 AM