Page MenuHomePhabricator

Android add stream `app_patroller_experience` to event sanitization allowlist
Closed, ResolvedPublic

Description

Data from Android Patroller feature is limited in quantity, we would like to preserve event data from sanitization past the usual 90 days

Columns to preserve:

  • active_interface
  • action
  • action_data
  • primary_language
  • platform
  • wiki_id
  • platform
  • primary_language
  • app_install_id
  • app_session_id
  • dt
  • is_anon

app_patroller_experience => 'schema_title' => 'analytics/mobile_apps/app_interaction', 'destination_event_service' => 'eventgate-analytics-external'

https://wikitech.wikimedia.org/wiki/Analytics/Event_Sanitization

Event Timeline

Change rWFCG101314285ba6 had a related patch set uploaded (by Cooltey; author: Cooltey):

[analytics/refinery@master] Add `app_patroller_experience` to allowlist for Android app

https://gerrit.wikimedia.org/r/1013142

Hi @SNowick_WMF, I have submitted a patch and please help to review it, thanks!

Change #1013142 merged by Aleksandar Mastilovic:

[analytics/refinery@master] Add `app_patroller_experience` to allowlist for Android app

https://gerrit.wikimedia.org/r/1013142

There is no specific QA item that needs to be tested. @SNowick_WMF Please confirm this is merged and deployed and we can move this to Ready for release.

Just checked dataset and the MIN date we have events for is 2024-04-15 so it looks like this may still be removing data daily and not preserving past the 90 days. If the allowlist was in effect when patch was approved we would have data going back to 2024-03-16. @cooltey is helping investigate this.

Also noticed a gap between when this was submitted for review 2024-03-20 and when it was verified 2024-06-13. This means we lost data that I was hoping to preserve from time of request. Next time we will need to expedite approval for allowlist datasets in order to prevent data loss.

mforns subscribed.

Hey all! I was asked to review another unrelated sanitization patch, and I saw that the sanitization snippet for app_patroller_experience was not hashing the fields app_install_id and app_session_id.
Those are long lived identifiers and should ideally be hashed. Is there a reason not to? 🙏

Change #1087453 had a related patch set uploaded (by Dbrant; author: Dbrant):

[analytics/refinery@master] Correctly hash IDs in app_patroller_experience stream.

https://gerrit.wikimedia.org/r/1087453

Hey all! I was asked to review another unrelated sanitization patch, and I saw that the sanitization snippet for app_patroller_experience was not hashing the fields app_install_id and app_session_id.
Those are long lived identifiers and should ideally be hashed. Is there a reason not to? 🙏

@mforns Good catch -- there's no reason for those to be unhashed. Please see the new gerrit patch (assuming it's ok to update the existing configuration in-place).

Change #1087453 merged by Mforns:

[analytics/refinery@master] Correctly hash IDs in app_patroller_experience stream.

https://gerrit.wikimedia.org/r/1087453

@Dbrant Thanks a lot! I merged the patch, will be deployed in the next train.

One thing to notice is that the ids will be hashed from now on.
You will still be able to group by app_install_id and app_session_id,
and you will now be able to join on those fields to other event datasets that also use the same hashing mechanism.
However, remember that the hash uses a salt that is rotated every quarter.
This means you will not be able to link events from different quarters by those fields.

Thanks again!