[EL sanitization] Remove AppInstallIId from EventLogging purging white-list
Closed, DeclinedPublic1 Story Points

Description

A couple EventLogging schemas have the AppInstallId field white-listed.
If those schemas have URLs, pageTitles/pageIds or other user activity indicators, they should be removed.

mforns created this task.Oct 13 2017, 3:25 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 13 2017, 3:25 PM
Nuria claimed this task.Oct 13 2017, 7:09 PM
Nuria set the point value for this task to 1.

Change 384093 had a related patch set uploaded (by Nuria; owner: Nuria):
[operations/puppet@production] Removing appInstallId from whitelist

https://gerrit.wikimedia.org/r/384093

Nuria added a comment.Oct 13 2017, 8:22 PM

Also, what about raw userAgents?

Tbayer added a subscriber: JMinor.Oct 13 2017, 9:34 PM

Folks, we spent quite a bit of time just a few months ago on a comprehensive review of the purging settings for all apps schemas, which included discussion of this field (cf. e.g. T164125 ). Which schemas are affected exactly? Please do not change the settings without an opportunity for the apps teams to review the tradeoffs involved (CC @JMinor).

Also, what about raw userAgents?

See T164125#3284763 , it seems we lack a bit of institutional memory here. In general, as noted at T164125#3234590 , app users agents contain a lot less entropy than general user agents.

The appinstallid was wrongly included on whitelist when it should not have been, this is a mistake on our end when putting whitelist together so this is just a correction. The raw user agents are not being changed.

Also, what about raw userAgents?

See T164125#3284763 , it seems we lack a bit of institutional memory here. In general, as noted at T164125#3234590 , app users agents contain a lot less entropy than general user agents.

Yes, that's right. We agreed that we would keep the whole parsed user agent map for the mobile schemas until we were technically able to partially purge some fields inside the user agent map, so that only the necessary fields inside it would be kept.

The appinstallid was wrongly included on whitelist when it should not have been, this is a mistake on our end when putting whitelist together so this is just a correction. The raw user agents are not being changed.

I agree with Nuria, this was a mistake from myself at the earlier stages of the EL audit. So sorry for that. The AppInstallId field is a permanent cross-schema identifier. As such, it has a high potential of identification, and should always be purged after 90 days.

mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.Oct 16 2017, 12:54 PM
Nuria added a comment.Oct 17 2017, 3:43 PM

Please do not change the settings without an opportunity for the apps teams to review the tradeoffs involved

Please do analyze the tradeoffs , but again, this is a mistake, this is a strong PII field that should not be there so I doubt in your end you were relying on it to be retained beyond the 90 day period.

Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Oct 27 2017, 3:10 PM

Change 384093 merged by Ottomata:
[operations/puppet@production] Removing appInstallId from whitelist

https://gerrit.wikimedia.org/r/384093

Nuria moved this task from Paused to Done on the Analytics-Kanban board.Nov 7 2017, 7:19 PM

We don't consent to this change. Before you merge a change which will irrevocably delete data that could be vital to the long term understanding of our users, and from our perspective is being misinterpreted by your team, please give me a chance to talk to legal (we are scheduled to meet with them next week) and come back to the discussion with you.

I find these interaction patterns and choices to be counterproductive. Please do not unilaterally delete data which we have already collected (even if, as you claim, that was an oversight). I do not see the urgency here.

Nuria added a comment.Nov 8 2017, 4:37 PM

@JMinor I find yours quite an unfair representation of events.
Please be so kind to invite us to the meting you have with legal on this regard.

Please do not unilaterally delete data which we have already collected

We do not delete data unilaterally, ever, we delete fields that contain PII 90 days after them being collected due to our privacy policy. We would really like to clarify with your team that the privacy policy is driven by legal, abided by WMF and enforced in the data collection end by our team.

At no time did we agree to preserve AppinstallId cause, given our privacy policy, we cannot do that. That identifier points to a phone, and that is the very definition of PII. This changes merely corrects an oversight on our end.

I do not see the urgency here.

Well, some of this data is out of compliance with our 90 day window, we want to have all eventlogging data be compliant by the end of this quarter.

Change has been reverted until we have our meeting with legal.

JMinor added a comment.Nov 8 2017, 6:50 PM

Thank you for holding this.

We would really like to clarify with your team that the privacy policy is driven by legal, abided by WMF and enforced in the data collection end by our team.

I think the issue is your interpretation of this policy and the claims that this is correcting an "oversight". Policy is rarely black and white, and we believe your team is and has interpreted this incorrectly. I understand you don't agree. I may be wildly wrong. But that requires conversation and open communication to resolve. We understand legal is the ultimate arbiter here, and thats why I would like to speak to them about this before you implement this change. If that conversation goes as you say it will, there is no need for you to be involved as I will simply comment that I was wrong on this ticket and you can proceed. If there is more nuace to it, then we can all meet and discuss how to preserve privacy to the levels and threat model you operate under while preserving the significant value we derive from understanding long term patterns of usage.

mforns added a comment.Nov 9 2017, 2:21 PM

@JMinor
Please, invite me to the meeting as well. Regardless of the outcome, I'd like to know Legal's point of view on this subject, so that we can apply it to similar scenarios in the future.

Nuria moved this task from Done to Paused on the Analytics-Kanban board.Nov 9 2017, 4:01 PM
mforns added a comment.Nov 9 2017, 4:20 PM

@JMinor
Actually, no need to invite me. We can follow up on Legal's position after the meeting. Thanks!

(To record some more information here while other conversations are ongoing:)

Folks, we spent quite a bit of time just a few months ago on a comprehensive review of the purging settings for all apps schemas, which included discussion of this field (cf. e.g. T164125 ). Which schemas are affected exactly?

Since there was no response here: I guess one can generate that list by searching for "appInstallID" in the current whitelist (thanks to @mforns for fixing the link on the documentation page).

It looks like for some of them (e.g. MobileWikiAppAppearanceSettings) the field was whitelisted by the Analytics Engineering team back in 2015 already, IIRC based on a consultation with Reading. For some other app schemas, this decision was made jointly during another such consultation in June 2017: T164125#3335485 That 2017 discussion outcome also involved purging (non-whitelisting) this field for some other schemas, or conversely purging fields like pageID that would have presented a privacy concern in connection with whitelisting the install ID.

mforns added a comment.Dec 4 2017, 1:25 PM

Hi @Tbayer and all,

Sorry for not responding to T178174#3684426, didn't catch it.

Which schemas are affected exactly?

Yes, searching for "appInstallID" in the white-list was a good catch.

It looks like for some of them (e.g. MobileWikiAppAppearanceSettings) the field was whitelisted by the Analytics Engineering team back in 2015 already, IIRC based on a consultation with Reading. For some other app schemas, this decision was made jointly during another such consultation in June 2017: T164125#3335485 That 2017 discussion outcome also involved purging (non-whitelisting) this field for some other schemas, or conversely purging fields like pageID that would have presented a privacy concern in connection with whitelisting the install ID.

Yes, at the time of the discussion in that task (June 2017) I defended the possibility of keeping the appInstallID provided the schema didn't possess any fields containing user personal preference or user personal characteristics. I now think this wasn't accurate and assume my fault in that matter.

A couple weeks after that, though, in July 2017, we Analytics updated the documentation regarding EventLogging purging and data retention to improve the purging rationale and other subjects. The new version states that both persistent tokens and cross-schema tokens are not suited to be kept for more than 90 days. appInstallId is both a persistent token and a cross-schema token, and in addition to that, it is a token referenced by data sources external to the WMF that point directly to a user's phone (PII). So, now I'm of the opinion that we should not keep the appInstallId field in any case.

However, I understand that you were using that token to group events generated by the same user. And that deleting it would prevent you from doing that any more (at least for some schemas that do not have a session token). That's why I suggested IIRC that you alter that field historically and transform it into a token that is non-PII, non-persistent and non-cross-schema. This way we could keep it for more than 90 days. Note, however, that the "sanitized" appInstallId would not be able to join events belonging to the same user across different schemas.

I hope this helps a bit!

Nuria added a subscriber: APalmer_WMF.EditedJan 3 2018, 10:07 PM

I know @APalmer_WMF is busy but still this is been on the backburner for a bit so asking again:

Ddoes legal have any guidelines here as to the appinstallId retention?

Nuria added a subscriber: keynote2k.Feb 2 2018, 5:58 PM

Ping @keynote2k who will be handling these type of questions going forward

Hi all!
2 weeks ago I wrote in the thread with legal and proposed a modification to the appInstallId field,
can the Reading team give their opinion please? Whould this solution be acceptable to you? Thanks!

  • Hash the appInstallId on the client side before sending it to EventLogging. This way, the hashed appInstallId does not directly point to a user's device, but is still usable to group events of the same user, and is still cross-schema. This is a thing that we should do in any case, I think.
  • Salt it before hashing, and rotate the salt every 3 months. Only hashing the appInstallId (without salt) would still allow to get to a user's history given its appInstallId. But if we salt it before hashing and change the salt every 3 months (and delete the old salt) that would not be possible any more. At least not for data older than 3 months.

    The salted and hashed appInstallId would be a *lot* better in terms of data privacy. With the only disadvantage from the data analysis perspective that one could only relate events of the same user if those events belonged to the same quarter. Like if it was a quarterly session token (and still cross-schema), but *not* pointing to a personal device.

(Since March, this conversation has been continuing elsewhere, mainly with @JMinor on the Readers team's side. Since it appeared not all participants were seeing benefits of being able to whitelist this field in appropriate cases, I'm posting here a sketch of use cases that Josh drafted earlier with my support:)

Understanding and improving the user value of the Wikipedia apps is a unique context, different than building for the web, and is immensely aided by being able to study installed user behavior outside the 90 day window. Examples of helpful analysis we do or plan to do include:

Detecting behavioral change over long periods
The current app strategy focuses on increasing medium and long term re-use value of the apps. To know if our tactics are advancing this goal it is helpful to see the use pattern and effects on installed user retention of features like the Explore feed, and iterations of the the content in the feed. So, for example, did adding “On This Day” increase frequency of use for users who’ve had the app installed for more than 3 months?

Any question related to user counts and user actions, of the form “how many users did…”
We’ve added a significant number of features to the apps over the last 2 years but in order to determine which features to iterate or remove it is helpful to know the % of active users of that feature. For example, what % of users in the global north use the feed in a session? Does that meaningfully differ from the usage pattern in New Readers markets? Questions like this allow us to understand who is using what feature and where to invest in the future.

Looking at differences between older install populations and new installers
One of the app’s annual program goals is to run targeted awareness campaigns, to bring in new users for the apps. As we add new features meant to appeal to these new users its important to understand the impact of that change on older, long time users. For example, does making the Andorid app more “multi-lingual” decrease use among our existing pool of monolingual american users?

Year-over-year comparisons (identifying longer-term changes in usage patterns without seasonality effects).

@Tbayer
Thanks for posting this here. It helped me understand how you use that field.
I can totally see the benefits of keeping appInstallId, and also the drawbacks privacy-wise that we have been discussing.
I was trying to think of a way we can perform the analyses you described with alternative fields or methods.
Listed my thoughts below. Please, let me know if they make sense.

Detecting behavioral change over long periods
The current app strategy focuses on increasing medium and long term re-use value of the apps. To know if our tactics are advancing this goal it is helpful to see the use pattern and effects on installed user retention of features like the Explore feed, and iterations of the the content in the feed. So, for example, did adding “On This Day” increase frequency of use for users who’ve had the app installed for more than 3 months?

I think if we store the install date (YYYY-MM-DD) in all events, we could calculate things like this using that field instead of the appInstallId, no?

Any question related to user counts and user actions, of the form “how many users did…”
We’ve added a significant number of features to the apps over the last 2 years but in order to determine which features to iterate or remove it is helpful to know the % of active users of that feature. For example, what % of users in the global north use the feed in a session? Does that meaningfully differ from the usage pattern in New Readers markets? Questions like this allow us to understand who is using what feature and where to invest in the future.

I can not see a way of calculating this without a unique ID indeed. But we could use the appInstallId to create tsv reports (easily with reportupdater for instance) that register the number of unique users for each feature that we want to track. After reportupdater calculates the value for the current day/week/month, the appInstallId could be removed (after 90 days), and we'd still would have a historical count of unique users for each feature stored in the reports. Not as flexible as keeping the appInstallId, but an alternative.

Looking at differences between older install populations and new installers
One of the app’s annual program goals is to run targeted awareness campaigns, to bring in new users for the apps. As we add new features meant to appeal to these new users its important to understand the impact of that change on older, long time users. For example, does making the Andorid app more “multi-lingual” decrease use among our existing pool of monolingual american users?

I think the install date field also would solve this case, no?

Year-over-year comparisons (identifying longer-term changes in usage patterns without seasonality effects).

Same here, I guess.

I think if we store the install date (YYYY-MM-DD) in all events, we could calculate things like this using that field instead of the appInstallId, no?

Even easier if we have app sent a bucket [0-3] , [3-6], [6-1yr] and calculate reports aggregating on these values. Right?

Nuria reassigned this task from Nuria to mforns.Jul 11 2018, 10:15 PM
mforns renamed this task from Remove AppInstallIId from EventLogging purging white-list to [EL sanitization] Remove AppInstallIId from EventLogging purging white-list.Jul 18 2018, 11:18 AM
Milimetric moved this task from Paused to Done on the Analytics-Kanban board.Aug 20 2018, 4:18 PM
Milimetric closed this task as Declined.
Milimetric added a subscriber: Milimetric.

Declining this in favor of work going on as part of T199898, like T199902