Page MenuHomePhabricator

Add wikifunctions_ui data stream to the sanitization allowlist
Closed, ResolvedPublic

Description

Wikifunctions_ui was instrumented in T336722 using the metrics platform and data will be deleted after 90 days by default.

We'd like to add non-PII fields from the wikifunction_ui data stream to the allowlist so this data is retained longer.

This schema tracks user interactions with wikifunctions. I've proposed fields below I believe need to be scrubbed in accordance with data retention guidelines but please let me know if you have any changes.

Fields that should not be kept:

  • custom_data (map): zlang
  • performer.name
  • performer.edit_count
  • performer.language
  • performer.language_variant
  • performer.groups
  • user_agent_map

Fields that should be hashed:

  • performer.session_id
  • performer.pageview_id
  • performer.id

Schema Documentation

Event Timeline

MNeisler triaged this task as Medium priority.
MNeisler moved this task from Triage to Current Quarter on the Product-Analytics board.

This appears to be the first metrics platform schema that will be added to the allowlist. Further discussion is needed to confirm the correct process for adding schemas created with the metrics platform to the allowlist including how to handle custom data. I will schedule a meeting once I'm back from vacation (Sept 5) to start this discussion and confirm the next steps.

Scheduled meeting on 27-September to discuss next steps

Change 962657 had a related patch set uploaded (by MNeisler; author: MNeisler):

[analytics/refinery@master] Add the wikifunctions_ui metrics platform schema to the allowlist

https://gerrit.wikimedia.org/r/962657

MNeisler added a subscriber: mforns.

@mforns - I've added you as a reviewer on this patch to add wikifunctions_ui to the allowlist. Here's a link to the current documentation about the wikifunctions_ui custom fields. Please let me know if you have any questions or suggested revisions. Thank you!

Jdforrester-WMF changed the task status from Open to In Progress.Oct 12 2023, 7:07 PM
Jdforrester-WMF moved this task from To Triage to In Progress on the Abstract Wikipedia team board.

Change 962657 merged by Mforns:

[analytics/refinery@master] Add the wikifunctions_ui metrics platform schema to the allowlist

https://gerrit.wikimedia.org/r/962657

Reopening as there was an error with the initial patch when computing sanitization

Note: This is currently blocked on resolution of T349121

Hello @MNeisler:

Regarding T349121, we had to make a change to the sanitization allow list. You can find the details at T349121#9416368, but the TL;DR is:

...
performer:
    edit_count_bucket: keep
    name: hash                         <<<<<<<<<< switched from id to name due to technical limitation on hashing integers
    is_bot: keep
    is_logged_in: keep
    pageview_id: hash
    session_id: hash
...

You should still be able to do the same analytics with name vs id. If not, let us know and we will figure something else. Thanks!

OK, with the fixes in T349121 this seems to now be working as expected – running

SELECT "name" AS "eventName",
       "dt" AS "eventDatetime",
       "custom_data" AS "customData",
       "page" AS "pageData"
FROM "event_sanitized"."wikifunctions_ui"
ORDER BY dt DESC
LIMIT 10

… delivers results as expected.

@xcollazo - Thank you for the update and work to resolve T349121!
Just confirming the change from id to name is fine and the data in event_sanitized.wikifunctions_ui appears to be logging as expected as @Jdforrester-WMF confirmed above.

I think we can resolve this task now.