Page MenuHomePhabricator

Temporary Accounts Initiative (IP Masking) - Add user_is_temporary and user_is_permanent to data tables
Closed, DuplicatePublic

Description

Background/Goal

The Temporary Accounts Initiative (formerly IP Masking) will have an impact on how we analyze users (editors, registrations etc.). In T333223 we asked the MediaWiki team to help us identify a temporary user in the upstream tables and they have added a new Boolean field called user_is_temp in the user table.

We need to add a similar field in the downstream data tables that rely on the mediawiki user table to help identify temporary users. See tables and fields needed listed below

KR/Hypothesis(Initiative)

[Committed] Temporary Accounts for Unregistered Editors

Success metrics

  • How we will measure success

Example areas:

  • Deadlines
  • User satisfaction : querying this data becomes easier using our existing data tables
  • Performance
  • Accessibility : we will be able to clearly identify and report on which users have registered using the registration process or had a temporary account created for them to avoid displaying their IP addresses
  • Maintenance
  • Movement impact
  • Scalability
  • Data Quality
  • Integration
  • Compliance

In scope

  • known scope

The following tables should get a new field similar to user_is_temp (reference document)

image.png (310×1 px, 58 KB)

Out of Scope

  • known boundaries

Artifacts & Resources

IP Masking impacted Data tables
IP Masking Impact on Data Pipelines: Work Breakdown Structure

Link to diagrams
Link to specifications, architecture and design docs
Link to product one pagers

Event Timeline

Mayakp.wiki triaged this task as Medium priority.Feb 5 2024, 8:17 PM
Mayakp.wiki created this task.

FYI, based on the results of T337103, user_is_anonymous and its siblings should be false for temporary users.

Also, I'd suggest making the field user_is_temporary rather than user_is_temp. That would be in line with our usual practice of spelling things out fully, and also probably slightly more accessible.

mw:User account types is the best reference on the difference between IP, temporary, and permanent users.

Also, I'd suggest making the field user_is_temporary rather than user_is_temp. That would be in line with our usual practice of spelling things out fully, and also probably slightly more accessible.

@nshahquinn-wmf and I discussed this further today for answering a few questions in @Milimetric 's document and we would prefer that every table which has the column user_is_anonymous should have a new corresponding column user_is_temp (not _temporary, to keep the naming consistent with mediawiki)

I also want to note here that the Impacted table list is not exhaustive of all the tables. I have only included the tables I know of that are frequently used by data users to perform editor data analysis.

From a data/metrics usage perspective, the user_is_anonymous field seems to be mostly used for the binary anon/editor classification (e.g. in wikistats editor types are "Anonymous - a user that is not logged in" and "User - a registered, logged in", in research we create datasets/models that use this as a feature). In my understanding this binary nature will not change (we might want to update some naming), i.e. the temp accounts will eventually replace all anonymous edits (once the feature is rolled out to all wikis).

Looking at the user account types, the data engineering usage corresponds to the isNamed property. I agree that the data eng schema should not define concepts differently (hence user_is_anonymous is False for temp users), but only adding user_is_temp means that

  • to groupBy/filter for logged in editors will require doing !user_is_anonymous && !user_is_temp
  • adding this code adds complexity that is tied to the roll-out of a feature, i.e. eventually the user_is_anonymous should always be false

What about also adding a user_is_named field that captures the semantics of what user_is_anonymous is currently used for? That would mean pipelines could just switch to using that field without other code changes.

@Ottomata thanks for sharing - the intricacies of naming/classifying the user types are real!

I don't have input or opinions on what the names should be, but I feel there should be a boolean field in the schema that tells me whether an edit was done by someone with a permanent user account. Unless I misunderstand the current proposal, we won't have such a field anymore?

I think so.

FYI: Discussion about this has moved to phab:T337103. The leading proposal is now that we should treat temp users as an entirely new category which is neither unregistered/anonymous or registered. More input is of course welcome. Neil Shah-Quinn (WMF) (talk) 20:00, 27 June 2023 (UTC)

T337103: Decide a standard approach for classifying temporary, IP and registered users

Decision

Temporary users should be considered as something separate from unregistered or registered users, because:

There is no way to categorize temporary users that will preserve all or even almost all existing assumptions
It's dangerous to treat temporary users as either registered or unregistered, because their capabilities could change significantly in the future
Temporary users are so paradigm-breaking that we should not try to make the change seamless for API or data consumers. Instead, we want each consumer to stop and think how they want to handle temporary users.

Instead, we want each consumer to stop and think how they want to handle temporary users.

My expectation is that most data consumers will come to the conclusion is that the signal they want is "is this a permanent user, not a temp or ip user", i.e. User::isNamed() in mediawiki. If that is the case, why not also add a field for that purpose? That would make downstream adoption easier and less error prone.

From a data/metrics usage perspective, the user_is_anonymous field seems to be mostly used for the binary anon/editor classification (e.g. in wikistats editor types are "Anonymous - a user that is not logged in" and "User - a registered, logged in", in research we create datasets/models that use this as a feature). In my understanding this binary nature will not change (we might want to update some naming), i.e. the temp accounts will eventually replace all anonymous edits (once the feature is rolled out to all wikis).

Looking at the user account types, the data engineering usage corresponds to the isNamed property. I agree that the data eng schema should not define concepts differently (hence user_is_anonymous is False for temp users), but only adding user_is_temp means that

  • to groupBy/filter for logged in editors will require doing !user_is_anonymous && !user_is_temp
  • adding this code adds complexity that is tied to the roll-out of a feature, i.e. eventually the user_is_anonymous should always be false

What about also adding a user_is_named field that captures the semantics of what user_is_anonymous is currently used for? That would mean pipelines could just switch to using that field without other code changes.

One additional consideration--currently, a statement like "user is anonymous" currently expresses two things: 1) a reader who is not logged into an account, 2) a contributor who edits the wiki without an account. When temporary accounts rollout, "user is anonymous" will only apply to (1), while "user is temp" will apply to (2). I think there are plenty of event logging schemas where it's meaningful to distinguish events based on whether 1) "user is anonymous", e.g. a reader who has not contributed on that device and 2) "user is temp", a someone who's had an account autocreated for them on the wiki via a loggable event (a successful edit or an edit attempt).

@fkaelin : I agree with your point. But our mandate here was to get an identifier for temp accounts first.
For a while we will have to use the combination of !user_is_anonymous && !user_is_temp to get a permanent (registered) user, but this is temporary.
Eventually, once temp accounts initiative is rolled out to all wikis, the user_is_anonymous field will become obsolete, as all editors will have username/userid. and the user_is_temp field can be used to distinguish between temporary accounts and permanent accounts.

Thanks for the clarifications @Mayakp.wiki, though in my opinion we should still consider adding a user_is_named flag as a replacement for the previous definition of user_is_anonymous, to minimize the downstream implications of this change.

Another point for discussion: the mediawiki history dumps are published as tsv files (without a header) for the community. Changing the definition of user_is_anonymous could have an impact consumers in the community? Likely it would involve a prior notice to the community, cc @KinneretG who is working on an announcement about temp accounts to the research list.

What about also adding a user_is_named field that captures the semantics of what user_is_anonymous is currently used for? That would mean pipelines could just switch to using that field without other code changes.

I think this is a good idea! @fkaelin is right that the main query pattern will be !user_is_anonymous && !user_is_temp, and we can easily provide some syntatic sugar alongside user_is_temp (we could even have it instead of user is temp, but boolean fields are cheap and I don't see any reason not to have two).

It should be called user_is_permanent since that's the standard name (now that we've agreed on standard names, the MediaWiki method being isNamed is a historical artifact).

For a while we will have to use the combination of !user_is_anonymous && !user_is_temp to get a permanent (registered) user, but this is temporary.
Eventually, once temp accounts initiative is rolled out to all wikis, the user_is_anonymous field will become obsolete, as all editors will have username/userid. and the user_is_temp field can be used to distinguish between temporary accounts and permanent accounts.

FWIW, there will always be analyses that cover the period before temporary users, so the user_is_anonymous field will never actually become obsolete although it will become less and less important as time goes on.

@nshahquinn-wmf and I discussed this further today...and we would prefer that every table which has the column user_is_anonymous should have a new corresponding column user_is_temp (not _temporary, to keep the naming consistent with mediawiki)

I've actually changed my mind about this; consistency with MediaWiki internals is somewhat useful, but consistency with standard human-language names (the ones which we want to be used in documentation, research papers, and so on) is much more important. So, although it's not that important, I do think user_is_temporary would be a better name, and I think it's worth switching since we haven't started any implementation yet.

Hm, interesting! TIL about MW's user_is_permanent status. That's new then?

Should we consider adding that to mediawiki.page_change and other event data models? cc @gmodena for T374811#10221361

Hm, interesting! TIL about MW's user_is_permanent status. That's new then?

Are you talking about the choice of "permanent" as the standard term? No, not new: Thalia proposed it about a year ago at the end of our discussion about how to classify temp users (T337103#9143741), no one objected, and someone else changed the wiki page accordingly about a month later.

But it was barely discussed and never announced in any meaningful way 😅 so it's no wonder you didn't know about it. Maybe I should send out some belated FYI messages!

+1 on user_is_permanent. I'll make it so in all the temp accounts work. So the booleans will now be:

event_user_is_anonymous: meaning updated to "logged-out users before temp accounts are turned on"
event_user_is_temp: temporary users
event_user_is_permanent: has a user record in the user table and user_is_temp is false

If that's not good for any reason, please let us know.

+1 on user_is_permanent. I'll make it so in all the temp accounts work. So the booleans will now be:

event_user_is_anonymous: meaning updated to "logged-out users before temp accounts are turned on"
event_user_is_temp: temporary users
event_user_is_permanent: has a user record in the user table and user_is_temp is false

Looks good to me! As I said, I would prefer event_user_is_temporary (doesn't temp seem out of place next to anonymous and permanent?), but it's not very important.

@nshahquinn-wmf I agree on temporary vs temp but the field in the mariadb schema is user_is_temp so I donno... I mean we rename a bunch of other fields here... hard call, I'll see what the team thinks.

Hm, interesting! TIL about MW's user_is_permanent status. That's new then?

Should we consider adding that to mediawiki.page_change and other event data models? cc @gmodena for T374811#10221361

I have no objections to changing the data models/serialization logic for mediawiki.page_change and non-legacy streams. I don't have any input or opinions regarding the naming of the new field.
Holler if you need support with the EventBus side of things, happy to help.

To follow up my previous comment:

Another point for discussion: the mediawiki history dumps are published as tsv files (without a header) for the community. Changing the definition of user_is_anonymous could have an impact consumers in the community? Likely it would involve a prior notice to the community, cc @KinneretG who is working on an announcement about temp accounts to the research list.

The temp feature will be turned on October 29th for the pilot wikis, so metrics/tools that use the user_is_anonymous field will be impacted. Is the plan for the new columns to be added before the October snapshot is generated? Or for the November snapshot as only a couple days in a few wikis would be affected? This info will also be part of the announcement for the research community.

Mayakp.wiki renamed this task from Temporary Accounts Initiative (IP Masking) - Add user_is_temp to data tables to Temporary Accounts Initiative (IP Masking) - Add user_is_temporary and user_is_permanent to data tables.Nov 1 2024, 8:04 PM

Updated the title based on recent discussions.

@Milimetric , can we get an update on the timelines for implementing these changes to the tables?
From T340001
On October 29, 2024 (T378334)

  • Czech Wikiversity
  • Igbo Wikipedia
  • Italian Wikiquote
  • Swahili Wikipedia
  • Serbo-Croatian Wikipedia