Page MenuHomePhabricator

MediaWiki user types
Open, Needs TriagePublic

Description

MediaWiki has various user types/classifications: system, temp, anonymous, imported, etc. The way these user types are identified varies between hardcoded and configurable user name patterns, and database table booleans. This makes it difficult to add or modify user types.

The current (2023) example of this problem is the need for the IP masking project to be able to identify temp users. More context on this problem can be found in T332420: How should temporary users be recognised? and T333223: Adding user_is_temp to the user table and other linked tickets.

It is clear that there is a need to address this problem in a more holistic and general way that will make it easier for teams that need to add MediaWiki user types, as well as make it easier to export and use this information to data systems outside of MediaWiki (for analytics, search indexes, public dumps) etc.

A team needs to own this decision and documented the chosen solution. A short term temporary and long term solution would be fine, as long as the intention is well documented.

Possible solutions

Do nothing

Teams have to figure this out ad hoc.

Standardized process for adding boolean types to user table

MediaWiki documentation on how to add new user types by making schema changes to add new boolean fields to the user table.

user / actor type field

Add an enum or string field to the user or actor tables that can be used to classify users. This makes it possible to add new user types without schema changes, and makes it easier to automate exporting this data without requiring code to understand all the possible user types MediaWiki has.

other?

[... please add other possible solutions]

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I'm particularly interested in this decision because Event Platform is about to 'release' a new mediawiki.page_change stream, in which we are modeling user types as booleans. If MediaWiki were to model user types in a more flexible and comprehensive way, we'd want the event data model to match this.

Making event schema changes like these are difficult, because doing so is basically a breaking API change. It will be less work for us later if we can decide on a user type data model solution in MW core soon.

One thing to note is that some of these user types (temp users, system users, normal users) are stored in both the user table and the actor table, while others (anonymous users, imported users) only have an actor table record.

I support the effort to introduce the concept of a user type (or, more appropriately, as Tgr pointed out, actor type) into MediaWiki. Currently, code that needs to distinguish between different kinds of users needs to pull the relevant information together from different places, which is error prone and hard to maintain. It is also currenlty undocumented which user types exist, or what guarantees (or lack thereof) comes with each kind.

Intuitively, I would say that the type should be stored in the actor table, so we can query/filter by it. But ultimately, I don't care as much about how or where this information is stored, as I care about how it is accessed. There should be a single source of truth for it. Currently, MediaWiki doesn't have a good place for it. We'd probably want to introduce a UserRecord (or ActorRecord?) interface, similar to PageRecord and RevisionRecord.

Introducing the new user type could be done iteratively. For example, the field could use the value 100 for "logged in user", and 0 for "anonymous". To represent temporary users, some of the 100s would be converted to 101s. And to represent system users, some of the 0s become 1. This way, an increasing number of user types can be differentiated, without the need to cover everything at once.

While I definitely support bringing some order to this mess, I need to warn over a lot of conceptual complexities on the way. Let me give you an example: You mentioned bot as a user type. Bot is not a user type, it's user right, it's a user group and it's a change flag.

  • It's a user group: There is a group of users that *currently* have the bot right.
  • It's a user right: Admins also have that right
  • It's an action flag: An admin or a bot can mark their edit as a bot edit or not.

Complexities:

  • Many members of bot group lose their right if they are inactive for certain period of time (as a wiki policy, that duration depends on wiki's decision)
  • Bots sometimes make action non-flagged: My bot has a ML-based vandalism revert functionality and its reverts are not marked as bot so they show up in RC.
  • Admins can make bot actions, if you want to mass revert someone. You can go to someone's Special:Contributions and add ?bot=1 and when clicking on rollback, they will be marked as bot.
  • People can give bot right to themselves and run bots
  • People run bots but without flags but with low frequencies. Sometimes under their own account.
  • many more.

And this is just on "bot" type. There are all sorts of other complexities such as how you'd set an imported user that's also a temp user? Adding a full column for system users is quite wasteful. A lot of these distinctions are not mutual exclusive.

My recommendation is to look at each case separately and decide what to do with each one based on complexity and specifics of that case. For example, bots shouldn't have a marker in the database at all. As it's more fluid and non-binary than it looks.

Leaving aside bot, which isn't really a user type, I think the meaningful types are normal user, system user, temp user, imported user, (maybe) central user:

  • temp user: matches the temp username pattern (configurable, currently an * prefix, that will probably change). Has no auth options other than session cookie and/or user token.
  • system user: created by User::newSystemUser. Has no auth options whatsoever.
  • imported user: has a > in the name. (The part before it is either an interwiki prefix or a wiki ID, it's not consistent.) No user record.
  • central user: does have a central account but no local account. Currently these have neither a user or an actor record (so are somewhat off scope), but can be sometimes represented by a User object.
  • normal user: not any of the above.

A normal user is by definition not any other type. A central user could be a temp or system user but central users are not represented in the local DB anyway so it doesn't really matter. The other three are theoretically non-exclusive but it doesn't really matter:

  • An imported user has no user entry so whether or not we declare it a system user or a temp user makes no difference. It's not a real account, all the special behaviors related to temp and system accounts are irrelevant. In practice, post-SUL we shouldn't have imported users of these unless content is imported from a non-Wikimedia wiki in which case we don't care about tempness / systemness.
  • Turning a temp user into a system user is just not something that realistically happens, even though it's technically possible. Since system accounts cannot be logged into, if a temp account did get turned into a system account somehow, it wouldn't be a temp account anymore for practical purposes.

So a shortint in the actor table would work fine.

I agree that "bot" is not a user type. The unfortunate reality is that "bot" a permission that *allows* a given user to flag edits as bot edits. It does not mean that the user neccessarily "is" a bot. In how for it does or doesn't imply a "real" bot account is up to individual communities to decide. MediaWiki doesn'currently doesn't suporrt the concept of a "bot user", just of a user who may perform automated ("bot") actions.

Changing this would indeed require quite a bit of effort.

Removing bot from the description. I think I put it in there because we have an is_bot boolean in the event schema.

What I'm hearing is that a user's 'user type' cannot really change (under normal cicumstances), right?

It can change when a normal user is converted into a system user (User::newSystemUser with the 'steal' flag) although it's probably very rare in practice.

An imported user changing into a normal user makes sense conceptually (the user is now active on the wiki and wants to assume ownership of their contributions) although we don't provide any such functionality currently.

I don't think any of the other transitions are realistic or meaningful.

Okay, Event Platform needs to release the mediawiki.page_change.v1 stream, and I'd love to do the right thing in its schema before we release it. To anticipate a change like this in the future, I think I'd like to change the user type booleans to a list<string> field with restricted values. That way we can anticipate the ability for users to have multiple types, and be a little bit more future proof with respect to whatever is decided here.

I know most of you probably don't care what the event schema is, but let me know if you have any objections to this.

Anybody got a quick and easy better name for this concept than 'user type'? Somehow 'type' doesn't seem quite right.

  • properties (maybe conflicts with user pref/settings?)
  • kind
  • class
  • family
  • genus
  • quality
  • trait
  • characteristic
  • attribute

Looking at the User class, there seem to be quite a few different things that a user could be or properties a user could have, even if they are different than the lower level ways described above. E.g. isNewbie, isLocked, isHidden, etc.?

After learning about the complexity (types of users, cases when we have no record in user, bots handling) I understand why something is needed to expose a well-defined User representation. Plus if I remember right, some permissions can be wiki-based too (eg on one wiki user can have different set of permissions per wiki) which adds extra complexity.

@Mooeypoo and @CCicalese_WMF worked on Authentication project and they have pretty good insights on User management, maybe they can suggest something.

@Ottomata I like the idea of having a list<string> per user. This list could open us for easy changes, for example:

  • admin -> a global admin
  • en_wiki:admin -> admin on en wiki
  • group:something -> member of a given group
  • imported -> imported user

and the list could go on (just an idea). And then we could have some factories that based on this list could define objects Factory::getUserGroupsFromEvent( event ).
Also, after long thinking, this list looks a bit like a metadata.

Thanks @pmiazga

I like the idea of having a list<string> per user

FWIW, Likely this would just be the event schema implementation; whereas MediaWiki might have something like an int bitmask, or a list of enum constants or something.

Also, after long thinking, this list looks a bit like a metadata.

Interesting, this sounds like it would open the 'user types' idea up to all of the different properties a user might have: groups, wiki auth permissions, etc. This would definitely make the proposal much more complicated, as e.g. a user or another wiki admin could change a user's list of properties. Sounds maybe cool, but I'm not sure it is worth the complexity?

No strong opinions either way, but it seems to me that the examples that @pmiazga provides aren't all user types. "admin", "group:something", or "en_wiki:admin" are not really user types. They're memberships or privileges, which can grow and change as needed. Isn't there another way to represent those? I would imagine that a user type is something that's largely permanent, while privileges (through memberships) are expected to change.

@Ottomata I like the idea of having a list<string> per user. This list could open us for easy changes, for example:

That's essentially what user groups are, or could be made to be.

Having a single user type that is mutually exclusive makes it much easier to implement behavior that is specific to a given type of user. If you have to consider all possible combination of types, things get a lot mroe complicated.

Don't get me wrong, the "set of types" approach will work, ot covers all information conveyed by the "single type" approach. But The handing is exponentially more complex.

No strong opinions either way, but it seems to me that the examples that @pmiazga provides aren't all user types. "admin", "group:something", or "en_wiki:admin" are not really user types. They're memberships or privileges, which can grow and change as needed. Isn't there another way to represent those? I would imagine that a user type is something that's largely permanent, while privileges (through memberships) are expected to change.

I just shared the idea. This seems like an easy solution to work with and something that could be easily extended, and flexible enough to support user types, permissions, groups, flags, etc. I can see problems with that solution too - filtering, data size as such string could grow a lot, and lang support (if something is a word, someone might translate it/passed translations ).

FWIW, Likely this would just be the event schema implementation; whereas MediaWiki might have something like an int bitmask, or a list of enum constants or something.

It doesn't matter how MediaWiki implements those, as this would be an implementation detail. What we're looking for is something easy to read/work with. I have a couple of concerns and probably with more time I could come up with more problems with passing those details as a list of strings.

Don't get me wrong, the "set of types" approach will work, ot covers all information conveyed by the "single type" approach. But The handing is exponentially more complex.

Could you elaborate little bit here, please - on why it would be exponentially more complex? I totally understand that it would be more complex, MW would have to provide some Adapters to handle this list or update Domain models based on the list. But those should be easy to test. Handling conflicts (like two groups that are mutually exclusive) has to be handled anyway - and this most likely is already handled in MW. The Adapters might be dump and expect the level above (Models/Services) to take care of data consistency.

Could you elaborate little bit here, please - on why it would be exponentially more complex?

If the behavior associated with these types is indpendent of eachother, then there isn't much of a problem. But if they interact, the number of combinations to consider grows exponentially with the number of types supported.

With a single-type system, you can for instance easily pick a UI or data schema appropriate for a spism to mix and match UIs and schemas. This is very flexibl, but also very complex.

Similarly, if we need to store the type in a relation database, storing a multi-value is much more expensive and much slower to query.

MediaWiki were to model user types in a more flexible and comprehensive way, we'd want the event data model to match this.

FYI, After thinking and waffling more about this, we've decided (once again) to stick with the boolean fields in the event data model for now. I'm going to update the event schema docs to point to this ticket, with the intention of hopefully eventually coalescing on a cleaner data model that matches what MW decides to one day do.

Here is a reason why working on this sooner rather than later is important.

Analysts and users will call temp users 'unregistered', but MW will call them 'registered'. It looks like 'temp' is a very bad term for what these users are.

This is going to confuse a lot of people.