Page MenuHomePhabricator

Decide a standard approach for classifying temporary, IP and registered users
Closed, ResolvedPublic

Description

Background

Currently, users are classified as either anonymous (unregistered) or registered in a number of places, including flags on log entries, flags returned by APIs, filters in the user interface, analytics schemas, and data products such as dashboards and reports.

How should temporary users fit into this? Should they count as anon users? Should they count as registered users? Should they have their own flag?

We should try to decide this once and keep it consistent everywhere.

A previous discussion took place at mw:Talk:User_account_types.


Decision record

(template)

Title: Decide a standard approach for classifying temporary, IP and registered users

Authors: Thalia Chan, Daniel Kinzler, Neil Shah-Quinn

Status

  • Proposed on 2023-07-28

Decision-making process

  • Discussions take place on T337103 and mw:Talk:User_account_types
  • A group of stakeholders meet and reach a consensus on the current proposal
  • The proposal is shared with the developer community
  • Depending on feedback, the proposal is adopted or updated and resubmitted

Stakeholders

  • IP Masking program
  • Analytics
  • API users
  • Community software developers
  • MediaWiki developers

Context and problem statement

Currently, users are classified as either anonymous (unregistered) or registered in a number of places, including flags on log entries, flags returned by APIs, filters in the user interface, analytics schemas, and data products such as dashboards and reports. How should temporary users fit into this? Should they count as unregistered users? Should they count as registered users? Should they count as neither?

The nature of temporary accounts will change over time. The way they are implemented allows them to behave a lot like registered users (have passwords, email addresses, user groups, etc), but at the point when we deploy much of this behaviour will be "switched off", so they will more closely resemble unregistered users (summarized in mw:User_account_types ). Over time we expect to gradually "switch on" some of these features, to allow them to more closely resemble registered users.

Risks and mitigations

A major risk is that users of the affected APIs and features make assumptions which will now be broken, causing unexpected behaviours. These assumptions will break differently depending on how temporary users are categorized, as tabulated in T337103#8946439.

We can't avoid breaking some assumptions, but we can mitigate the risks by communicating clearly with the community about the new categorisation and broken assumptions.

Options considered

  • Categorize temporary users as unregistered users (a.k.a. anonymous users, IP users)
  • Categorize temporary users as registered users
  • Categorize temporary users in a separate category (neither unregistered nor registered)

Decision

Temporary users should be considered as something separate from unregistered or registered users, because:

  • There is no way to categorize temporary users that will preserve all or even almost all existing assumptions
  • It's dangerous to treat temporary users as either registered or unregistered, because their capabilities could change significantly in the future
  • Temporary users are so paradigm-breaking that we should not try to make the change seamless for API or data consumers. Instead, we want each consumer to stop and think how they want to handle temporary users.

Consequences

Temporary users are considered as separate from registered and unregistered users, and this will be reflected in API flags, RecentChanges filters, and analytics schemas, for example. Downstream features that rely on these are audited by their maintainers and updated if necessary.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Ideally, this would be decided by the team that introduced the respective flag since what's useful to show would depend on how the flags are being used.
But I don't know what's technically feasible. Would it be better to decide once how users are being flagged and keep it consistent across different features/flags?

Is it possible to generate a list of where the flags occur so we can assess what all will be impacted by this decision?

Ideally, this would be decided by the team that introduced the respective flag since what's useful to show would depend on how the flags are being used.

This sounds sensible, but it seems unlikely that the various flags were introduced by teams that still exist.

But I don't know what's technically feasible. Would it be better to decide once how users are being flagged and keep it consistent across different features/flags?

I agree - I think this would be easier to understand and more maintainable than having several teams make different decisions.

How APIs use anon to mean an IP user, based on searching for 'anon' in /core/includes/api:

  • ApiMain accepts assert param to define allowed callers (anon, user, or bot)
  • ApiQueryContributors returns anons count
  • ApiBlockInfoTrait returns blockanononly flag
  • ApiQueryBlocks accepts anononly param
  • ApiFeedRecentChanges accepts hideanons param

Additionally, the following return an anon flag:

  • ApiQueryImageInfo
  • ApiQueryLogEvents
  • ApiQueryRecentChanges
  • ApiQueryRevisionsBase
  • ApiQueryUserInfo
  • ApiQueryWatchlist

Some other places where 'anon' currently means an IP user. These are less important to update than the APIs, since they are internal implementations, but they could cause some confusion.

Configuration

  • DefaultUserOptions['watchlisthideanons']
  • DisableAnonTalk
  • RCFeeds[<somefeed>]['omit_anon']
  • RateLimits[<action>]['anon'] (also used by GrowthExperiments, PageTriage)
  • UniversalLanguageSelector: ULSAnonCanChangeLanguage
  • VisualEditor: VisualEditorDisableForAnons

Database table field names

UI messages
Many message keys use "anon". Some of these messages are currently being displayed to temp and anon users; others have separate temp versions. There are way too many to list here, but they're easily discoverable by searching through i18n JSON files.

These are less important to update than the APIs, since they are internal implementations, but they could cause some confusion.

While true, its worth noting that discrepancies in the definition even at this level are likely to lead to new bugs and improperly classified data in the future, unless the inconsistent definitions are very well documented and will be understood by all engineers working with them.

E.g. A MW dev creating a new MW PHP side instrumentation may choose to produce data with fields named on the understanding they have how how MW works. Analysts using this data (possibly years in the future) will have a different understanding, leading to incorrect analysis of the data.

In ideal world, yes. But the amount of work needed to make this happen, for example rename db fields or changes in many definitions across the codebase or API deprecations and such is so massive that it isn't really justified by the gains mentioned here. The middle ground here is to update documentation to reflect the changes on how mw sees users and a mw dev building an instrumentation is expected to read the documentations of methods before calling them.

The middle ground here is to update documentation

unless the inconsistent definitions are very well documented

Ya, agree!

Pulling this out of Slack so it's public. @Mayakp.wiki, other folks from Product Analytics, and I have been thinking about how to adapt our public API endpoints and dashboards as well. For now, we're going to aggregate any activity or counts of "temp" users and "IP" users under the umbrella "anon". This is temporary while we figure out if temporary user accounts are going to behave substantially different from logged out users. Internally we can differentiate between the users and if that difference is worth bubbling out to our users, we'll update the APIs and dashboards.

The main realization is that this difference is most likely to not materialize, so it's ok to wait a bit.

For now, we're going to aggregate any activity or counts of "temp" users and "IP" users under the umbrella "anon".

This means that users of your public API endpoints will get different results for 'anon' than users of MediaWiki API endpoints.

IIUC, E.g. RecentChanges filtering for edits by Registered users will include edits by Temp users, but AQS edit count summaries will not.

(unless RecentChanges has already been modified to do the same as you say. But s/RecentChanges/OtherTool or API/.

Thanks everyone for the input so far. Here are some of my thoughts. I'm warming to the idea of giving temporary users a separate flag, but also not categorizing them as either "registered" or "anonymous" (using the definitions of anonymous/registered from before temporary accounts existed).

Are temporary users more like "anonymous" or "registered"?

Currently, they're arguably more like "anonymous" users. But this could change in the future: the infrastructure is in place to make them more like "registered" users.

The user types table is helpful for comparing these three types of user. At the moment temporary users can't set a password, use preferences, belong to user groups, user watchlists, etc.

However, this is according to the MVP definition of how temporary accounts will work, when we first deploy. All the infrastructure is in place for them to be able to do have all these various features in the future (i.e. they have a row in the user table), and this is by design, as requested by @Niharika and the other Product & Design folk working on this.

If we decide to flag temporary users as "anonymous", that might make sense now but it might make less sense in the future. However, if we flag them as "registered", that makes less sense now.

If we give them a separate flag, this allows for their definition to change, and means also that the definitions current definitions of "registered" and "anonymous" won't change.

Can we disrupt as few people as possible?

We've had some discussion about how people might be using the "anonymous" and "registered" definitions, and whether we can use this information to disrupt as few end-users as possible.

Examples (just two of many possible examples):

  • if "anonymous" usually means "didn't make any effort to create an account" then it would be least disruptive to call temporary users anonymous
  • if "registered" usually means "represents a single person, who can receive notifications" then it would be least disruptive to call temporary users registered

However, it's difficult to determine what is least disruptive. Even if we do a detailed analysis of downstream products/features/tools that use these definitions, and even if that analysis discovers that it's overwhelmingly less disruptive one way or another, it's still the case that temporary accounts could change in the future, so whatever disruption we're putting off now could occur later.

Instead, it might be better to make sure we communicate really clearly about temporary users, and provide a separate flag for temporary users, so downstream features can decide themselves how to categorize them. Some might prefer to categorize them with "anonymous" users, some with "registered" users, some separately.

Can we avoid breaking assumptions about users?

In short, no: we can't avoid breaking any assumptions at all. But we should understand what the assumptions are, and which we might/will break. Whatever approach we take in the end, we should communicate the broken assumptions very clearly.

Some assumptions that currently exist (adapted from a conversation with @daniel):

AssumptionIf we flag temp users as "anonymous"If we flag temp users as "registered"If we give temp users a separate flag (neither "anonymous" nor "registered")
If the user is flagged as "anonymous", I know they didn't create an accountHoldsHoldsHolds
If the user is not flagged as "anonymous", I know they did create an accountHoldsBrokenBroken
If the user is flagged as "anonymous", I know their user name is an IP addressBrokenHoldsHolds
User ID 0 is equivalent to "anonymous"BrokenHoldsHolds
User ID nonzero is equivalent to "registered"BrokenBrokenBroken
If the user is flagged as "anonymous", I know they're not "registered"HoldsHoldsBroken
If the user is flagged as "anonymous", I know they don't have password, preferences, user groups, watchlist, etcHolds for now, but may changeHoldsHolds
If the user is flagged as "registered", I know they have password, preferences, user groups, watchlist, etcHoldsBroken for now, but may changeHolds
I can assume the same things about an "anonymous" user as I did beforeBrokenHoldsHolds
I can assume the same things about a "registered" user as I did beforeHoldsBrokenHolds

(Note: other user types such as system, external, bot, etc are ignored in this table.)

The final two rows show that, by introducing a separate flag for temporary users, we don't change what was previously meant by an "anonymous" or "registered" user. The assumption that we do break is that all users are either "anonymous" or "registered", so "not anonymous" no longer implies "registered".

Thank you @Tchanders for the concise summary of the situation!

I like the idea of treating temp users be neither reigtsered nor anonymous. It will require quite a bit of surveying MediaWiki to make sure we handle the new distinction correctly in all cases. But I agree that it's the most future proof option, and the one that is tleast likely to break things downstream.

I noticed that in MediaWiki, the use of User::isRegistered is much preferred over User:isAnon. However, in the API, there is no mention of "registered", there is just an "anon" flag. So anything interested in registered users will user the anon (or "hideanon") flags, which will not apply to temp users.

My 2c: Trying to consider a temporary user a registered user or anonymous user (=unregistered) feels like forcing a square peg into a round hole. It's none of these groups, we simply need to split our definition to three groups: Registered, temporary, unregistered.

+1. Also here's a relevant comment with some context about other user types: https://phabricator.wikimedia.org/T336176#8836758

On reflection, I also support the idea of treating IP users as neither registered nor unregistered (anonymous). Socially, that has the benefit of sending a very clear, wide signal that temp users are a totally new thing and that old assumptions and models should be reviewed.

From an analytics standpoint, that would mean:

  • updating dashboards and APIs that allow splitting between registered and unregistered users to add temp users as a third group
  • being careful not to use an umbrella term when combining groups in reports (e.g. if we count temp and registered users, we call it "temp and registered users" rather than "non-anonymous editors")

For the most part, this would not affect our standard metrics since they generally include only registered users. Although it would be possible to start counting and including temp users in a way that wasn't possible with unregistered users, that's a separate discussion. There are a few metrics focused on unregistered users, and it would be natural to change them to focus on temp and unregistered.

The only thing I would add is that we should take this opportunity to very consistent and intentional about the three terms we use. Specifically, we should decide whether to use "IP" or "unregistered" and stick to it ("anonymous" is not great for reasons that many other people have mentioned).

nshahquinn-wmf renamed this task from Decide how to distinguish between temporary, IP and named users in various flags to Decide how to distinguish between temporary, IP and registered users in various flags.Jul 1 2023, 12:15 AM
nshahquinn-wmf updated the task description. (Show Details)

Repeating something I previously posted on Slack:

It seems like there's broad agreement that:

  • there is no way to categorize temp users that will preserve all or even almost all existing assumptions
  • it's dangerous to treat temp users as either registered or unregistered, because their capabilities could change significantly in the future
  • temp users are so paradigm-breaking that we should not try to make the change seamless for API or data consumers. Instead, we want each consumer to stop and think how they want to handle temp users.

@Tchanders and @daniel both agreed with this.

Current status (as I understand it): we are waiting for input from @Bmueller. After that, @Tchanders will draft a decision record which we will share more widely for feedback.

nshahquinn-wmf renamed this task from Decide how to distinguish between temporary, IP and registered users in various flags to Decide a standard approach for classifying temporary, IP and registered users.Jul 21 2023, 12:37 AM
nshahquinn-wmf updated the task description. (Show Details)

Has a decision been made here?
The Growth team likely needs a final decision before working on this task: T342390: Newcomers and Registered RC filters fetch temp users

Has a decision been made here?
The Growth team likely needs a final decision before working on this task: T342390: Newcomers and Registered RC filters fetch temp users

@KStoller-WMF Thanks for flagging that task. This has just gone out to a couple of mailing lists, so we'll allow a bit of time for feedback on this task first

I'd say that ideally we'd break them out for clarity, but if we had to choose whether to slot them in as either anon or registered in existing APIs/logging then they're much more like anons.

In practice, unless we revisit the restrictions on preference-setting, temporary users behave as "anons who we know have edited at least once before", without any implied commitment to the wiki beyond what any other anon might have -- particularly since the accounts are temporary and device/browser-locked, so we can't say whether any given anon has been a temporary user before....

In the olden days, 'anon' meant 'there's no user_id and no matching user table row, so you can't look them up in the database except for other edits via the same IP which might not be theirs". This definitely does not apply to temporary users, as I understand, who have a user_id and can have their individual contributions looked up.

I'm not sure it's either possible or desirable to "decide this once and keep it consistent everywhere", because different use cases will have different things they're looking for.

Off the top of my head, my own inclination is to leave stuff like this to user group/permission keys -- being a temp user should confer a specific group membership, and that group can set a specific permission key which can be checked in the API or HTML-side JavaScript. If you want to include, or exclude, temp users you check for that permission key.

If we don't have good efficient ways to check a specific group membership in API lookups, perhaps that's an area to look at. :)

But I guess we could also have a one-off specific temporary user property to query; I'm not sure it's any more/less efficient as such.

Not having a software-level vision for temp users makes answering the question harder. Do we think of IP masking as something that Wikimedia wikis will use but most others won't? Or the default setting for MediaWiki (eventually)? Or something you can't even disable?

Also we should probably consider how to end up with a nomenclature that's intuitive in the long term. E.g. the commonsense meaning of "registered" is that the user went through some kind of registration workflow. Calling temp users registered would IMO be pretty confusing to someone who is new to MediaWiki at a future point in time when IP masking is the new normal and we don't care about pre-IP-masking meanings anymore.

"Classification" in which context?

  • From a users perspective there is not really a difference between anonymous (a.k.a. "ip") and temporary (a.k.a. "unnamed"). There is especially nothing the user can do about this. The wiki's configuration dictates which of the two gets assigned to users without an account. So for users the new "temporary account" system is the same as what we previously called "anonymous". In a sense the new system is even more anonymous and I would continue calling it like that.
  • From a devs perspective there is not much of a difference between "named" and "unnamed" user accounts. The outlier here is the ip address. This leads to a fundamentally different classification than above.

Sorry if I'm late to the show. But unfortunately the ticket appears to confuse these two scenarios. Or I get it wrong.

In all contexts it can be relevant to be able to distinguish all three types.

  • Users care almost exclusively about the difference between registered (a.k.a. "named") users that actively created an account and everybody else that didn't do that, yet. I feel like it makes a lot of sense to continue calling this concept "anonymous".
    • However, there will be two types of anonymous users: The old ones with visible ip addresses, and the new "masked" ones. I'm not sure if it's even necessary to actively distinguish these two. It should be obvious just from the anonymous user's name.
  • Devs should almost always use User::isNamed(), with rare exceptions. In other words: What devs care about is if a user does have access to personal preferences, a talk page, and so on. So far we named this concept "registered" (as in "actively created an account"), and I think we can continue to do so. Along with "named" as a slightly less ambiguous alias.
    • "Anonymous" became rather ambiguous in this context, and is rarely a useful term now. At least in the code. I really think we should deprecate and phase out isAnon() and possibly even isRegistered() in favor of isNamed() and possibly something honest like isIP().

"Classification" in which context?

  • From a users perspective there is not really a difference between anonymous (a.k.a. "ip") and temporary (a.k.a. "unnamed"). There is especially nothing the user can do about this. The wiki's configuration dictates which of the two gets assigned to users without an account. So for users the new "temporary account" system is the same as what we previously called "anonymous". In a sense the new system is even more anonymous and I would continue calling it like that.
  • From a devs perspective there is not much of a difference between "named" and "unnamed" user accounts. The outlier here is the ip address. This leads to a fundamentally different classification than above.

Sorry if I'm late to the show. But unfortunately the ticket appears to confuse these two scenarios. Or I get it wrong.

The point is that the concepts should be the same for devs and users, to avoid confusion and miscommunication.

In all contexts it can be relevant to be able to distinguish all three types.

Yes, indeed. This however implies that in all contexts, we are breaking assumptions of code that so far only knows about two types.

Yea, that's apparently what needs to happen.

  • While the word "anonymous" can mean different things, I think it should be used in a way that's most meaningful for users, i.e. for users that have not set an account name. Which includes temporary users. That's different from how we used the word before.
  • I think the word "registered" should only be used for users that actively registered and intentionally created a (named) account. This is also different from how the word is currently used in the code, where it includes temporary accounts.

Personally I find isAnon most problematic, which is why I already started phasing it out – conveniently aligning with the UserIdentity interface.

I also disagree with "temporary", by the way. These accounts are not temporary: They will never be deleted. Or will they? But it's not the end of the world. From the perspective of most code it's fine to think of these accounts as "temporary".

My suggestion is to go with "named", "unnamed", and "ip". Don't overthink it.

Re: "temporary", my understanding is that this is referring to how they're only temporarily (and per-device) linked to that user -- the only reason the user has access to them is because of the cookie, and once that cookie expires they'll never be able to access that account again.

I don't actually like the named/unnamed distinction, because the temporary accounts do have usernames. They're semi-randomly-assigned gibberish, to be sure, but they're still unique names, and people in discussions are going to use them to refer to that account. (I.e. "what's the name of that unnamed account?" would have an unintuitive answer...)

Re: "temporary", my understanding is that this is referring to how they're only temporarily (and per-device) linked to that user -- the only reason the user has access to them is because of the cookie, and once that cookie expires they'll never be able to access that account again.

That sounds correct to me.

I don't actually like the named/unnamed distinction, because the temporary accounts do have usernames. They're semi-randomly-assigned gibberish, to be sure, but they're still unique names, and people in discussions are going to use them to refer to that account. (I.e. "what's the name of that unnamed account?" would have an unintuitive answer...)

Same, thanks for articulating this. I also find isNamed() to be confusing for this reason.

"auto-created" might be a useful description of temporary users, except that is what we already use in the context of full user accounts that visit wikis where they don't have a local account.

Given that we are implementing a bunch of restrictions on what temporary users can do, maybe a wording distinction like "regular" vs "limited" is closer to what we are using these account types for, in practice. So e.g. $user->isRegularAccount() and $user->isLimitedAccount() where "limited" also implies the temporary, time limit, as well as the restrictions on things that the user can do.

Not having a software-level vision for temp users makes answering the question harder. Do we think of IP masking as something that Wikimedia wikis will use but most others won't? Or the default setting for MediaWiki (eventually)? Or something you can't even disable?

I can answer this from the AHT/IP Masking perspective, having spoken to @Niharika about this. For the IP Masking project we're only going as far as to make temporary accounts a configurable option, and to switch them on for our production sites. There's no ask for us to ensure that the software can't display IPs as usernames, just to make sure that WMF sites don't.

From the future-of-MediaWiki perspective, I guess it's up for debate. I do agree it makes the question harder, but for now at least I think we need to think of IP users as something that can exist in MediaWiki.

Thanks everyone for the comments.

Categorization

Summary: We have consensus that temporary users are separate category.

I'm not seeing anyone strongly opposing the general idea that temporary users be considered as something separate from (1) users who are recognised by their IP address, and (2) users who have been through a full registration process.

It seems that we have a consensus to start working on plans to flag temporary users separately in the various places where we might need to. I suggest that we can bring that part of the discussion to a close.

Terminology

Summary: We have yet to decide terminology. Let's try to decide here. If we can't, let's spin this out to a separate discussion.

We have yet to agree on the actual terminology used to refer to these user types. Here's where I think the discussion stands. I've added some proposals in italics based on the above discussion - comments welcome:

DescriptionSuggestionsState of discussionMy proposal (documentation)My proposal (code)
Contributors who are recognised by their IP addressIP, Anonymous, Logged-out, UnregisteredThere seems to be a consensus to use 'IP users'IP usersRename User::isAnon to User::isIP or remove it altogether
Users whose accounts were autocreated when performing some actionTemporary, Unnamed, LimitedAlthough some have disagreed, 'temporary users' seems to have mostly been acceptedTemporary usersKeep User::isTemp, or rename to User::hasTempAccount
Users who have been through a full registration processRegistered, Named, RegularContentious - there's a lot of opposition both to 'registered' and to 'named'Permanent usersRename User::isNamed to User::isPermanent or User::hasPermanentAccount
Users who have a row in the user tableRegistered, Something involving 'account'Contentious - the meaning of 'registered' is disputedUsers who have an accountRename User::isRegistered to User::hasAccount, or remove it and check User::getID instead

Where would we see this terminology?

  • PHP classes (e.g. TempUserCreator)
  • Method names in various classes, e.g. User::isRegistered, User::isNamed, User::isTemp, User::isAnon - may need renaming/removing
  • Filter names, API param names, database fields - if can't be renamed, need to be documented well
  • Documentation, e.g. https://www.mediawiki.org/wiki/User_account_types
  • Tasks and discussions about these user types

Rename User::isRegistered to User::hasAccount, or remove it and check User::getID instead

User::exists() would match the naming convention used for other "smart record" type classes.

Rename User::isRegistered to User::hasAccount, or remove it and check User::getID instead

User::exists() would match the naming convention used for other "smart record" type classes.

Thanks for this. I like the idea of using exists, consistent with other classes, and also moving away from user-type specific terms ('anon' and 'registered' cause confusion, and isIP is also confusing since users with no entry in the user table might not be IP users...)

I'm going to close this task, based on no further feedback following T337103#9143741.

Categorization:

  • Let's adopt the decision as described in the task description

Terminology:

  • I'll start filing some tasks in line with the above discussion, including a tracking task (link TBC).