Page MenuHomePhabricator

Evaluate the effects of IP Masking on Analytics
Open, MediumPublic

Description

IP Masking will affect lots of our products, features, tools, gadgets, etc. This task is for tracking work to analyse and adapt to the effects of IP Masking on analytics ahead of IP Masking being enabled on WMF sites.

See T326816: Update features for IP Masking, particularly What will be affected.

For examples of code that is likely to be affected, please refer to T326759. One example, by no means the only one, of such a side effect is the way instrumentation code uses isRegistered() and isAnon(). As it stands, isRegistered() returns false for anonymous users. After IP Masking is rolled out, any anonymous user who performs a single successful edit will have a temporary account created for them, resulting in isRegistered() subsequently returning true (and isAnon() returning false). The analytics teams might have to decide if this is the desired behaviour or adapt accordingly if not.

Event Timeline

mpopov subscribed.

Per the meeting between Anti-Harassment Tools, Product Analytics, and Data Engineering (notes here), Maya (PA) and Dan (DE) will collaborate on investigating and analyzing the impact on downstream data, which includes this work and:

mpopov triaged this task as Medium priority.Apr 11 2023, 5:12 PM
mpopov moved this task from Triage to Current Quarter on the Product-Analytics board.

We have been meeting with @Milimetric on a regular basis to be informed of upstream decisions and discuss impacts.

proposed changes for anonymous editors
A new boolean field user_is_temp will be added to the upstream user table in mediawiki. This field will be true for temporary users and false for registered users. Note: the user table in mw does not store anonymous users.
new boolean field user_is_temp will be added to the upstream user table in mediawiki when editor is unregistered and the wiki has IP masking rolled out
This will be added as-is to the wmf_raw.mediawiki_user table in the Data Lake.
From there on, DE proposes to transform it and add it to mediawiki_history and other tables where

  • Registered: if (event_user_id <> 0 and not event_user_is_anonymous)
  • Anonymous: if (event_user_id = 0 and event_user_is_anonymous)
  • Temporary: if (event_user_id <> 0 and event_user_is_anonymous)

We have started discussing internally within the new Decision Science and Research (amalgamation of Product Analytics, Global Data & Insights, Research) team to understand impacts of this change to topline metrics, product feature metrics, eventlogging etc. and will add details to T332205.

We are making a list of downstream tables used by data folks at the Foundation and community that may potentially be impacted by the IP Masking project.
IP Masking impacted Data tables

  • This document(WIP) will serve as a reference for Data Engineering to determine which tables need to be altered to add the new field to identify temp users.
  • It should also help data users understand what is changing in the tables they use frequently for calculating editing metrics.
  • Facilitate data-QA for data pipelines and monitoring IP and temp user counts

from @Milimetric : For now, we're going to aggregate any activity or counts of "temp" users and "IP" users under the umbrella "anon". This is temporary while we figure out if temporary user accounts are going to behave substantially different from logged out users. Internally we can differentiate between the users and if that difference is worth bubbling out to our users, we'll update the APIs and dashboards.

Expanding on T332205#8842633 , I am drafting IP Masking Impact Report (Downstream Data) that has a comprehensive list of public/private reports, dashboards, etc. that tracks editor data, which can be used to analyze downstream data impact. This document also has changes to Data tables that will be needed for IP Masking (as noted in the comment above).

Added Movement-Insights team tag to accurately reflect current work. Keeping a close tab on T344919 as this task will inform the impact of IP Masking on datasets and pipelines.

Keeping the Product-Analytics tag and moving to Tracking as this task aims to capture the overall impact to both our teams.

T356701 has been opened to request data platform engineering to add the user_is_temp field in the relevant downstream data tables. This field will help us identify temporary accounts from registered and IP accounts.
(reference : T333223)

Today Movement-Insights spoke about doing some preliminary impact analysis.
We have a report Temporary Accounts Initiative aka IP Masking Impact Report (Data) and it has ‘what’ will be impacted, we need to understand ‘by how much’ it will be impacted, and update that report.
PS: timeline - testwiki deployment will happen before the end of June 2024.