Page MenuHomePhabricator

Preserve userAgent field in apps schemas
Closed, ResolvedPublic8 Estimated Story Points

Description

(Filing this as ticket rather than editing each schema talk page, per Nuria's recommendation on Analytics-l)

Now that the auto-purging of EventLogging schemas is being implemented (and we have converted the userAgent field in T153207 to a parsed version with less information, on the grounds that it is more privacy-friendly), we need to make sure that the information on OS (Android vs. iOS) and app version number is not lost. Both are used frequently when analyzing this data, also for historical timespans.

Below is a list of affected tables (alternatively, use this search for schemas on Meta) :

mysql:research@analytics-store.eqiad.wmnet [(none)]> SHOW TABLES FROM log LIKE 'MobileWikiApp%';
+----------------------------------------------------+
| Tables_in_log (MobileWikiApp%)                     |
+----------------------------------------------------+
| MobileWikiAppAppearanceSettings_10375462           |
| MobileWikiAppAppearanceSettings_10375462_15423246  |
| MobileWikiAppAppearanceSettings_9378399            |
| MobileWikiAppArticleSuggestions_10590869           |
| MobileWikiAppArticleSuggestions_10590869_15423246  |
| MobileWikiAppArticleSuggestions_11448426           |
| MobileWikiAppArticleSuggestions_11448426_15423246  |
| MobileWikiAppArticleSuggestions_12443791           |
| MobileWikiAppArticleSuggestions_12443791_15423246  |
| MobileWikiAppArticleSuggestions_15302212           |
| MobileWikiAppArticleSuggestions_15302212_15423246  |
| MobileWikiAppCreateAccount_8240702                 |
| MobileWikiAppCreateAccount_8240702_15423246        |
| MobileWikiAppCreateAccount_9135391                 |
| MobileWikiAppCreateAccount_9135391_15423246        |
| MobileWikiAppDailyStats_12637385                   |
| MobileWikiAppDailyStats_12637385_15423246          |
| MobileWikiAppEdit_16256587                         |
| MobileWikiAppEdit_16256587_15423246                |
| MobileWikiAppEdit_8198182                          |
| MobileWikiAppEdit_8993428                          |
| MobileWikiAppEdit_8994704                          |
| MobileWikiAppEdit_9003125                          |
| MobileWikiAppEdit_9003125_15423246                 |
| MobileWikiAppFeed_15734713                         |
| MobileWikiAppFeed_15734713_15423246                |
| MobileWikiAppFeed_16432467                         |
| MobileWikiAppFindInPage_14586774                   |
| MobileWikiAppFindInPage_14586774_15423246          |
| MobileWikiAppInstallReferrer_12601905              |
| MobileWikiAppInstallReferrer_12601905_15423246     |
| MobileWikiAppIntents_15237384                      |
| MobileWikiAppIntents_15237384_15423246             |
| MobileWikiAppLangSelect_12588733                   |
| MobileWikiAppLangSelect_12588733_15423246          |
| MobileWikiAppLinkPreview_12014128                  |
| MobileWikiAppLinkPreview_12143205                  |
| MobileWikiAppLinkPreview_12143205_15423246         |
| MobileWikiAppLinkPreview_14095177                  |
| MobileWikiAppLinkPreview_14095177_15423246         |
| MobileWikiAppLinkPreview_15730939                  |
| MobileWikiAppLinkPreview_15730939_15423246         |
| MobileWikiAppLogin_8234533                         |
| MobileWikiAppLogin_8234533_15423246                |
| MobileWikiAppLogin_9135390                         |
| MobileWikiAppLogin_9135390_15423246                |
| MobileWikiAppMediaGallery_10914526                 |
| MobileWikiAppMediaGallery_10923135                 |
| MobileWikiAppMediaGallery_10923135_15423246        |
| MobileWikiAppMediaGallery_12588701                 |
| MobileWikiAppMediaGallery_12588701_15423246        |
| MobileWikiAppNavMenu_12732211                      |
| MobileWikiAppNavMenu_12732211_15423246             |
| MobileWikiAppOnboarding_9122680                    |
| MobileWikiAppOnboarding_9123466                    |
| MobileWikiAppOnboarding_9123466_15423246           |
| MobileWikiAppPageScroll_14591606                   |
| MobileWikiAppPageScroll_14591606_15423246          |
| MobileWikiAppProtectedEditAttempt_8682497          |
| MobileWikiAppProtectedEditAttempt_8682497_15423246 |
| MobileWikiAppReadingLists_15520526                 |
| MobileWikiAppReadingLists_15520526_15423246        |
| MobileWikiAppSavedPages_10375480                   |
| MobileWikiAppSavedPages_10375480_15423246          |
| MobileWikiAppSavedPages_8909354                    |
| MobileWikiAppSearch_10593635                       |
| MobileWikiAppSearch_10633564                       |
| MobileWikiAppSearch_10641988                       |
| MobileWikiAppSearch_10641988_15423246              |
| MobileWikiAppSearch_15729321                       |
| MobileWikiAppSearch_15729321_15423246              |
| MobileWikiAppSessions_14031591                     |
| MobileWikiAppSessions_14031591_15423246            |
| MobileWikiAppSessions_15522505                     |
| MobileWikiAppSessions_15522505_15423246            |
| MobileWikiAppSessions_9742902                      |
| MobileWikiAppShareAFact_10916168                   |
| MobileWikiAppShareAFact_11331974                   |
| MobileWikiAppShareAFact_11331974_15423246          |
| MobileWikiAppShareAFact_12588711                   |
| MobileWikiAppShareAFact_12588711_15423246          |
| MobileWikiAppStuffHappens_8955468                  |
| MobileWikiAppStuffHappens_8955468_15423246         |
| MobileWikiAppTabs_12453651                         |
| MobileWikiAppTabs_12453651_15423246                |
| MobileWikiAppToCInteraction_10375484               |
| MobileWikiAppToCInteraction_10375484_15423246      |
| MobileWikiAppToCInteraction_11014396               |
| MobileWikiAppToCInteraction_11014396_15423246      |
| MobileWikiAppToCInteraction_14585319               |
| MobileWikiAppToCInteraction_14585319_15423246      |
| MobileWikiAppToCInteraction_8461467                |
| MobileWikiAppWidgets_11312870                      |
| MobileWikiAppWidgets_11312870_15423246             |
| MobileWikiAppWiktionaryPopup_15158116              |
| MobileWikiAppWiktionaryPopup_15158116_15423246     |
+----------------------------------------------------+
96 rows in set (0.12 sec)

Event Timeline

The capsule includes:

{
device_family: ,

 browser_family: ,
browser_major: ,
browser_major: ,
os_family: ,
os_major,
os_minor: ,
wmf_app_version: -

}

What else is missing? Android/IOS numbers are quite different when it comes to OS. Are you asking for a field that identifies Android/IOS specifically?

@Nuria: As the task name and description say, it's about preserving the userAgent field from getting auto-purged after 90 days in this case - not about adding new elements to it.

Hi @Tbayer

We should make sure though that all these schema_revision pairs do not have any fields that can constitute - together with the user agent map - a sensitive structure. For example: user agent + URL, which could associate a potentially identifying user agent with a browsing history. We would not be able to store this kind of structures for longer than 90 days.

Some time ago, an audit was made to identify those sensitive structures and decide and agree with the schema owners on a purging strategy that would break those sensitive structures, for each schema. It took a while, a couple months... Just mentioning it here, because if we want to repeat that for these ~27 schemas it will also be a long task.

An alternative that I see here is:

  1. Write a SQL script that adds 2 columns to all those tables (if they do not have it already): platform and version [ONE-OFF]
  2. Write another script that populates those fields: platform with either "Android" or "iOS", and version with the app version number, by deducing them from the user agent map [ONE-OFF]
  3. Apply the purging script normally and delete user agent map after 90 days
  4. For new schemas, the fields platform and version should be explicit and part of the schema definition in meta. Also, they should be populated in the EL client, so that no further database patching scripts are necessary.

Storing platform and version would not be identifying and would comply with the current status of the audit. Also, we have done something very similar to keep the editCount field from being purged in many schemas, by bucketizing it, see: T108856. IMO this would mean less work and less discussions that would take some calendar time.

@mforns: Thanks for these explanations! Yes, I remember that audit and the surrounding discussions, and I'm also aware of the important privacy concerns surrounding user agents in general. That said, one thing that has changed since then is that we are now (going forward) dealing with a sanitized user agent field that has already been cleared of a lot of sensitive information, by replacing the raw string with the parsed version.

What's more, the app user agents contain a lot less entropy than general user agents (basically, because they are restricted to specific clients based on code written by WMF instead of any piece of software written since the 1990s that can retrieve HTML over the internet). On the other hand, the information is also much more directly relevant to our product development process (obviously, because it tracks our own changes). Thus I think the assumptions and privacy tradeoffs that informed the purging strategy for UAs in general don't apply here 1:1. (Let me know in case I overlooked something and the app situation was actually already considered separately back then.)

I agree that assessing all these 27 schemas individually would be quite a bit of work, that's why the task proposes to handle this in an unified way.

To be concrete, here is an example of what we are talking about:

{"os_minor": "3", "os_major": "10", "device_family": "Other", "os_family": "iOS", "browser_minor": null, "wmf_app_version": "5.4.0.1095", "browser_major": null, "browser_family": "Other"}

(Arbitrarily picked via SELECT userAgent FROM log.MobileWikiAppDailyStats_12637385 WHERE timestamp LIKE '20170501%' LIMIT 1;.)

...

An alternative that I see here is:

  1. Write a SQL script that adds 2 columns to all those tables (if they do not have it already): platform and version [ONE-OFF]
  2. Write another script that populates those fields: platform with either "Android" or "iOS", and version with the app version number, by deducing them from the user agent map [ONE-OFF]

In the parsed user agent field, these are already present as os_family and wmf_app_version. So these first two steps would be unnecessary for the new tables, and for the old tables we could just replace the raw with the parsed UA as a one-off.

  1. Apply the purging script normally and delete user agent map after 90 days

If we really find that the other parts of the parsed UA (OS versions) are too sensitive to keep, it might be the simplest solution to just delete them (or replace them by empty strings) in the parsed UA. That would be the same process for the new tables (ongoing) and the old (one-off).

  1. For new schemas, the fields platform and version should be explicit and part of the schema definition in meta. Also, they should be populated in the EL client, so that no further database patching scripts are necessary.

See above - they are indeed already part of the parsed UA JSON. The only argument I see for adding them as separate fields is the current lack of JSON support that we are discussing in a separate task right now.

Storing platform and version would not be identifying and would comply with the current status of the audit. Also, we have done something very similar to keep the editCount field from being purged in many schemas, by bucketizing it, see: T108856. IMO this would mean less work and less discussions that would take some calendar time.

Milimetric triaged this task as Medium priority.May 8 2017, 2:27 PM

Sorry for the delay @Tbayer

That said, one thing that has changed since then is that we are now (going forward) dealing with a sanitized user agent field that has already been cleared of a lot of sensitive information, by replacing the raw string with the parsed version.
...
What's more, the app user agents contain a lot less entropy than general user agents (basically, because they are restricted to specific clients based on code written by WMF instead of any piece of software written since the 1990s that can retrieve HTML over the internet).

Yes, I totally agree. The recent changes and the nature of app data significantly reduce the probability of privacy violation. But still, I think the less data we keep, the smaller the possibility of user re-identification is. And as a principle of anonymization we try to decrease probability of re-identification, no?

In the parsed user agent field, these are already present as os_family and wmf_app_version. So these first two steps would be unnecessary for the new tables, and for the old tables we could just replace the raw with the parsed UA as a one-off.

Awesome, thanks for spotting that. Will ping @elukey, who is implementing the purging script as part of T156933.

If we really find that the other parts of the parsed UA (OS versions) are too sensitive to keep, it might be the simplest solution to just delete them (or replace them by empty strings) in the parsed UA. That would be the same process for the new tables (ongoing) and the old (one-off).

Yes, as I mentioned before, when it comes to UA I think the less data we keep the better. If we have a use case for keeping OS versions and other UA data, we can study it, but if not I think the default should be purging. Now, your idea of just deleting all UA fields except os_family and wmf_app_version sounds good to me!

The only argument I see for adding them as separate fields is the current lack of JSON support that we are discussing in a separate task right now.

Thinking a bit... I guess we don't need them as separate fields if we could improve the syntax of the white-list and permit things like:

schema             field
SomeAppSchema      user_agent['os_family']
SomeAppSchema      user_agent['wmf_app_version']

So we can keep parts of a map-like field. @elukey do you think this is possible?
This way, the whole thing would be generic and maintainable and we'd need very few changes, no?

NOTE: In any case, every time a new app-related schema is created, the white-list will need to be updated to include that schema.

Trying to summarize what needs to be taken into account for the script (be patient :)

We'd need to find a way to have a more granular deletion policy for fields like the User Agent since it is a JSON structure with several fields. The option of keep/not-keep is not enough so we'd need something like "NULL all the fields in the JSON structure but not os_family and wmf_app_version", "keep all", "NULL everything".

Marcel's proposal is to add a special syntax to the whitelist file to allow the above distinction. If this is more or less the intended meaning, I think that there shouldn't be any issue implementing it. I am a bit concerned that we are trying to make a super generic config file for a single special case like the UA, but it is definitely clearer and more maintainable in the longer term.

Hi, @Tbayer!

We've been working on the script to purge EL data T156933 during the analytics off-site, and made some progress.
Sadly, we encountered some complications in implementing the white-list application for json-formatted fields like the userAgent (map).
The problem is that sanitizing the userAgent map means updating its string value one record at a time as opposed to cleaning a bunch of records with a single update statement that sets the not white-listed fields to NULL (see task T156933).

So, the temporary solution we came out with is: We'll white-list the whole userAgent field for the list of app schemas that you mentioned. This way the whole userAgent information will be kept for those schemas, until we can fix this issue.

Thanks @mforns! Good to have this taken care of for now (also in the context of T164799). Feel free to close this task, and later reopen it or otherwise ping me and the apps folks once you want to revisit that temporary solution. BTW, my understanding from conversations with @JMinor and @Dbrant is that from a product perspective there are important use cases for OS (iOS/Android) version information too.

mforns edited projects, added Analytics-Kanban; removed Analytics.
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.

I am currently adding the new userAgent fields to the EventLogging purging white-list.
Between the 27 schemas listed in the task description, I found 7 that did not exist yet at the time of the last audit on summer 2016:

MobileWikiAppFeed
MobileWikiAppFindInPage
MobileWikiAppIntents
MobileWikiAppPageScroll
MobileWikiAppReadingLists
MobileWikiAppTabs
MobileWikiAppWiktionaryPopup

They do not have any fields white-listed yet, but I assume that you want to keep their data indefinitely given the requirements of this task.
Here's a quick privacy analysis of those schemas:

  • MobileWikiAppFeed, MobileWikiAppIntents, MobileWikiAppReadingLists and MobileWikiAppTabs schemas are not sensitive.I will add all their fields to the white-list.
  • MobileWikiAppFindInPage: Sensitive; Potentially identifying field -> appInstallID, personal sensitive fields -> pageId and findText. I suggest to purge both pageId and findText, and white-list the rest.
  • MobileWikiAppPageScroll: Sensitive; Potentially identifying field -> appInstallID, personal sensitive field -> pageId. I suggest to purge pageId and white-list the rest.
  • MobileWikiAppWiktionaryPopup: Sensitive; Potentially identifying field -> appInstallID, personal sensitive field -> text. I suggest to purge appInstallID and white-list the rest.

Sorry, I sent this comment by accident before finishing it (adding links and transcribing note-like lines to sentences).
Well, I guess it's understandable in spite of that.

@Tbayer please let me know if the purging strategy suggestions make sense to you, and I will proceed and add the fields to the white-list. Thanks!

@mforns Great point; thanks for checking, and sorry about the delayed response - I had to take a second look at some of these myself.

Your suggestions all look great to me, except that for the MobileWikiAppPageScroll, the "funnel" aspect is actually the less important part. Can we keep the pageID for that one and instead drop the install ID?

Similarly, @JKatzWMF and I reconsidered MobileWikiAppShareAFact, and there too we would like to keep the content part. I.e. add the pageID, revID, article, and text fields to the existing whitelisting for this schema, but drop the install ID.

Sorry, I sent this comment by accident before finishing it (adding links and transcribing note-like lines to sentences).
Well, I guess it's understandable in spite of that.

Sure, non worries! But just in case: One can edit one's own Phabricator comments after the fact (drop-down menu on the right) ;)

PS: There are also four older app schemas (each last revised in 2013, but still active) that were accidentally left out in the task description because they use an outdated naming schema ("MobileApp..." instead of "MobileWikiApp..."):

In principle, these should treated in the same way as the other app schemas here regarding the userAgent field. But I don't think there was any whitelisting required for other fields in any of these four, meaning that rows can be removed completely after 90 days anyway (CCing @Dbrant to be sure).

Hey @Tbayer, thanks for looking into this!

Your suggestions all look great to me, except that for the MobileWikiAppPageScroll, the "funnel" aspect is actually the less important part. Can we keep the pageID for that one and instead drop the install ID?

Yes, I think so. Will change that.

PS: There are also four older app schemas (each last revised in 2013, but still active) that were accidentally left out in the task description because they use an outdated naming schema ("MobileApp..." instead of "MobileWikiApp..."):

MobileAppCategorizationAttempts
MobileAppLoginAttempts
MobileAppShareAttempts
MobileAppUploadAttempts

In principle, these should treated in the same way as the other app schemas here regarding the userAgent field. But I don't think there was any whitelisting required for other fields in any of these four, meaning that rows can be removed completely after 90 days anyway (CCing @Dbrant to be sure).

These schemas are not in the white-list right now. It means that all fields (including userAgent) will be kept for 90 days and then purged. As I understand it, there are no action items regarding this schemas. Let me know otherwise!

Similarly, @JKatzWMF and I reconsidered MobileWikiAppShareAFact, and there too we would like to keep the content part. I.e. add the pageID, revID, article, and text fields to the existing whitelisting for this schema, but drop the install ID.

I think this is possible, too. The only issue would be if the field shareSessionToken is shared across schemas, because then one could link one event to another event of another schema with the same shareSessionToken and get the corresponding appInstallID. So, is the shareSessionToken unique to this schema or is it shared?

Cheers!

Hey @Tbayer, thanks for looking into this!

...

PS: There are also four older app schemas (each last revised in 2013, but still active) that were accidentally left out in the task description because they use an outdated naming schema ("MobileApp..." instead of "MobileWikiApp..."):

MobileAppCategorizationAttempts
MobileAppLoginAttempts
MobileAppShareAttempts
MobileAppUploadAttempts

In principle, these should treated in the same way as the other app schemas here regarding the userAgent field. But I don't think there was any whitelisting required for other fields in any of these four, meaning that rows can be removed completely after 90 days anyway (CCing @Dbrant to be sure).

These schemas are not in the white-list right now. It means that all fields (including userAgent) will be kept for 90 days and then purged. As I understand it, there are no action items regarding this schemas. Let me know otherwise!

Yes, that's what I was trying to say above; this was just about recording this understanding here and pinging @Dbrant to confirm.

Similarly, @JKatzWMF and I reconsidered MobileWikiAppShareAFact, and there too we would like to keep the content part. I.e. add the pageID, revID, article, and text fields to the existing whitelisting for this schema, but drop the install ID.

I think this is possible, too. The only issue would be if the field shareSessionToken is shared across schemas, because then one could link one event to another event of another schema with the same shareSessionToken and get the corresponding appInstallID. So, is the shareSessionToken unique to this schema or is it shared?

I'm confident it is unique to this schema (for starters, it refers to a user interaction funnel that's only tracked by this schema); confirmed by this search.

^ This is correct. The fields in the schemas you mentioned do not need to be preserved.

Hi @Tbayer and @Dbrant

I modified the white-list to reflect the discussion we had in this task.
Please review the gerrit change here: https://gerrit.wikimedia.org/r/#/c/298721/8..9/files/mariadb/eventlogging_purging_whitelist.tsv
Here's a summary of the changes:

White-list the userAgent field for the following schemas (they are already in the white-list):

MobileWikiAppAppearanceSettings
MobileWikiAppArticleSuggestions
MobileWikiAppCreateAccount
MobileWikiAppDailyStats
MobileWikiAppEdit
MobileWikiAppInstallReferrer
MobileWikiAppLangSelect
MobileWikiAppLinkPreview
MobileWikiAppLogin
MobileWikiAppMediaGallery
MobileWikiAppNavMenu
MobileWikiAppOnboarding
MobileWikiAppProtectedEditAttempt
MobileWikiAppSavedPages
MobileWikiAppSearch
MobileWikiAppSessions
MobileWikiAppStuffHappens
MobileWikiAppToCInteraction
MobileWikiAppWidgets

White-list all fields (including userAgent) of the following schemas:

MobileWikiAppFeed
MobileWikiAppIntents
MobileWikiAppReadingLists
MobileWikiAppTabs

White-list all fields (including userAgent) except event_pageId and event_findText of the schema:

MobileWikiAppFindInPage

White-list all fields (including userAgent) except appInstallID for the schemas:

MobileWikiAppPageScroll
MobileWikiAppWiktionaryPopup

Add the following fields to the whitelist: userAgent, event_pageID, event_revID, event_article, event_text to white-list and drop field event_installID, for the schema:

MobileWikiAppShareAFact

Thanks a lot!

mforns set the point value for this task to 8.Jun 16 2017, 3:13 PM

Will move this task to done, because the editing of the white-list is finished and will be merged in a Gerrit patch belonging to another task: T156933.

Nuria closed this task as Resolved.