Page MenuHomePhabricator

Adding user_is_temp to the user table
Closed, ResolvedPublic

Description

I'm opening up this task to discuss why we should or shouldn't do this, as part of the IP Masking project.

Adding user_is_temp

In the proposed IP Masking architecture, the User::isNamed and User::isTemp functions use a name pattern to establish whether a record in the User table is temp or regular. There are two main downsides of this, for consumers of replicated data and bulk access patterns. Consumers of replicated data would have to copy the name pattern and make sure to stay in sync with any changes (however infrequent). And queries selecting a set of users and wishing to filter by their temporary / permanent nature would have to apply this simple name pattern logic. We think a user_is_temp column would help with these kinds of problems and also be generally useful. However, we acknowledge that this is not a simple issue and open it up for analysis here. Please do feel free to edit the analysis section, especially to compact information from comments below.

Analysis

  • a conversation on user types between @tstarling and @daniel is referenced here: T308017#8309324, might be useful to link/summarize here
  • from @Ladsgroup, about the schema change itself
    • would take about 2-3 months to apply (MariaDB can't do tricks like fast defaults)
    • would add <<insert calculation here... 50MB?>> of storage (but this is probably smaller than the additional space temp users are going to take in the user table in general
    • has a process that we could drive / coordinate patches / etc, for context about 90 of these were done in 2022
  • [Data Engineering] Without this column, we would have to build events that capture the is_temp information as close to a user account being created as possible, so we don't run the risk of propagating bad data downstream. So we would probably build a copy of the user table externally and keep it updated with a Lambda mix of batch and real-time jobs. This would take some doing, and would slow down our ability to adapt to the IP masking changes. We could just use the name pattern, but we try hard to limit replicating logic from MW, preferring data instead.
  • for context, around half of actors are anonymous in any given database, with more in enwiki and less in arwiki, for example
  • one use case I thought we could have for user_is_temp is to facilitate deletion of temporary users after some period of inactivity (5 years?). There were two interesting problems with this, worth noting:
    • for all tables that join to actor instead of user, this is possible, but there are still tables joining directly to user. So that code would have to be refactored before we could delete records from the user table
    • even if we could delete users, the user_is_temp column would not help compared to the name pattern, because there wouldn't be an index on it and reading+applying the pattern would be almost just as fast as reading and checking a boolean

Background on Data Pipelines

We use data from mediawiki (MW) replicas to build data pipelines serving a variety of use cases. So far this has been mostly reports and dashboards, but we're starting to build pipelines that serve production use cases. We rely on the MW schema and adapt our extraction jobs when it changes. With our events work, we hope to find better ways to subscribe to changes MW wishes to make public. But for the foreseeable future, we will continue to want a parallel batch pipeline for accuracy and reliability. We currently differentiate between anonymous and registered user for most of our editing metrics. We are initially guessing that anonymous and temporary users will be somewhat similar, but if big differences arise in these populations, we will re-evaluate.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Yup, adding a new column is not that hard. There are some documentation on this https://wikitech.wikimedia.org/wiki/Schema_changes and https://www.mediawiki.org/wiki/Manual:Schema_changes

It'll take a bit but not much. We can speed it up (by doing db switchovers back to back) if needed. If it goes in before dc switchovers, it can be done way faster. But before doing so, I rather pass it by Tim beforehand.

I'll note that from a product perspective even though we would *like* to not change the prefix pattern on the temp user, it still is a config variable that is subject to change in the future depending on user feedback. We will do our very best to do due diligence and conduct user testing before deciding the prefix but like we discovered with the asterisk, mediawiki is full of surprises (context: T332805: Decide the prefix character for temporary usernames).

I think going with a dedicated column to indicate temp users is the more reliable approach.

Yup, adding a new column is not that hard. There are some documentation on this https://wikitech.wikimedia.org/wiki/Schema_changes and https://www.mediawiki.org/wiki/Manual:Schema_changes

It'll take a bit but not much. We can speed it up (by doing db switchovers back to back) if needed. If it goes in before dc switchovers, it can be done way faster. But before doing so, I rather pass it by Tim beforehand.

There's no need for switch overs. Adding a column is an online DDL so it can be also done on the primary masters

oh thanks. That means it'll take a month or two at most (after schema change patch getting merged in core)

@Milimetric We created https://phabricator.wikimedia.org/T332420 to talk about how to identify temp users but I failed to notify everyone involved. Given that both tickets are addressing the same topic, what would you like to do? Should we merge them? Which one should we remove?

A dedicated, indexed column would make life simpler - without an index there are some tricky edge cases you need to handle in queries. E.g. with a naive query, paging the list of registered users will die horribly as you get to usernames starting with + (the next character after *) as the query would have to walk through all temp user rows first. There's probably no way to write a performant user iteration query without hardcoding knowledge of the specific username pattern used.

I assume adding a new index would be significantly more effort than just adding a new column, though.


Given how time-consuming it is to do DB changes, IMO we should be forward-looking and make it an actor_type column instead:

  • Make it an enum instead of a bool since we already have multiple user types (anonymous, temp, normal, imported, central, system) and might well have more in the future. In terms of DB usage it wouldn't make much difference, booleans will be mapped to TINYINT anyway.
  • Make it a field on the actor table instead of the user table as some user types result in an actor row but no user row so this makes it more generic. For most purposes, there isn't much difference between adding the field to user vs. actor (it's one more or fewer join, but joins are cheap).

we should be forward-looking and make it an actor_type column instead

+1

However, in a relevant conversation in T308017#8309324, @tstarling said:

I don't think it's future-proof to add user_type since if we decide we want such a concept in MW core, the details may differ from what you decide on here.

Daniel backed off from asking for a user type after he saw the code. I implemented it, but he didn't like it.

In the newly designed event state change model, we went with individual boolean fields for each of the current user types. It is still not TOO late for us to change this to an enum, if MW could soon settle on this.

CC also @gmodena

@Tgr:

  1. on the lack of an index in my proposed user_is_temp column. Our purpose is solely to have a record of whether or not a user account is temporary. We copy and query this data in a totally different system, so the performance doesn't affect us. In suggesting it, I assumed it would not affect the existing or planned queries against the user table as those could just ignore any new nullable columns. If queries would need to use this column, I assume the price of building an index would be worthwhile. But please let me know if I'm missing something.
  1. on the actor_type proposal. I like it from a data correctness / representation perspective, but it's really not my say. If other MW folks think it would be a good idea, then great. @tstarling and @Tchanders might have thoughts. For us, ultimately, it accomplishes the same goal: it gives us a concrete, historically stable user type, just like user_is_temp would (in combination with the current way we infer "anon" and "registered" from the schema).

@Ladsgroup pointed out elsewhere that LIKE queries where the wildcard is the last thing in the expression can use the index, so my statment above about performance problems with iterating registered users isn't true. I guess if someone configured the anonymous username pattern in a sufficiently weird way on their wiki, it could happen, but that's not something to worry about. So no need for an index on user_is_temp.

+1 to T333223#8766755. I'm mostly speaking as a consumer of the Toolforge replicas, where we won't have the handy MW methods like User::isTemp. Having to use a LIKE / REGEXP to fetch temporary accounts might still be efficient enough, but feels quite odd. It would also mean, as currently implemented, we couldn't build a tool intended to work on any MW installation, as the pattern for temporary accounts is configurable. Some sort of boolean/enum column, be it on user, actor, or even a dedicated new table seems more appropriate to me and would make it easier for existing tools to be updated.

IMO we should be forward-looking and make it an actor_type column instead

What is the timeline on making this decision? If this becomes the chosen decision, we might like to change our mediawiki/entity/user event data model before we go to far with it.

Someone needs to own the decision. I can try to take a look and make the decision but 1- It needs to be explicitly given to me, I don't want to go around make decisions on behalf of teams 2- I need to talk to Daniel and Tim on why they decided against it in the first place and whether those information/decision still makes sense (with new usecases, etc.)

A note on how long it takes to run a schema change: A similar one took 23 days: T333332 I could have gone faster though. Noting that schema change is different than data migration, data migrations take way longer

I'm mostly looking for a decision ASAP so we can adjust our data model now rather than later when people are actually using it. :)

I've read the discussion in T308017: Design Schema for page state and page state with content (enriched) streams and I think user_is_temp field on user table is the better option than:

  • user types field on user table: because as Tim said, we really don't know what the future user types might look like. Schema changes are not that hard to implement and so there is higher risk of complicating the schema more than it needs to be.
    • That being said, any read to this field inside mw should be encapsulated into one single class/function so refactor in the future wouldn't be messy.
  • or similar field on actor table:
    • Actor table is taller than user table and adding a column is going to add way more junk space to the table than user.
    • On top of that, letting temp user set emails or have pref or get notification is in horizon and would make it quite complicated to implement if the field is on actor table.
    • On top of that, concept of being a temp user is conceptually coupled to the concept of user and not actor. Like a simple question: Is an IP in actor table a temp user or not? Technically isn't but it also can be considered a temp user too given all of the similarities? To avoid all of the confusion and logical weirdness. It's better to make it bound to user table instead.

Hope that'd be useful

Thanks @Ladsgroup . I'd be happy to go with this, but before we do, I'd like to hear from @tstarling and/or @daniel first, since they originally decided against it. Do either of you have any outstanding concerns regarding adding a user_is_temp field, in light of the discussion on this task?

Thanks @Ladsgroup . I'd be happy to go with this, but before we do, I'd like to hear from @tstarling and/or @daniel first, since they originally decided against it. Do either of you have any outstanding concerns regarding adding a user_is_temp field, in light of the discussion on this task?

No, I don't have any outstanding concerns. Adding user_is_temp should be fine. The reason I decided against adding it originally is because it was work that nobody was asking for. Now that someone is asking for it, that rationale is no longer relevant.

@tstarling do you have a preference for user_is_temp boolean vs user_type with value?

@tstarling do you have a preference for user_is_temp boolean vs user_type with value?

Maybe this comment would answer your question?

I think is_temp is fine. I don't think it's future-proof to add user_type since if we decide we want such a concept in MW core, the details may differ from what you decide on here.

Right, but in that comment 'here' is in relation to the event schema we were designing. I think we are considering such a concept in MW core now?

I believe it's still relevant here, it's out of scope for the work at hand. If we wanted to add a user_type attribute, we would need much more work and discussion than the conversation we just had on this ticket. It's something that should happen on a separate workstream so it doesn't affect this project.

If we are thinking about adding user_type, I'd rather add actor_type to the actor table. That way, we'd finally have a good way to identify system users (so far, it's "one of the names on this configurable list over here") and external (imported) users (currently identified by a predix). The set of actor types should include: anon, system, external, temp, and regular.

If we wanted to add a user_type attribute, we would need much more work and discussion than the conversation we just had on this ticket. It's something that should happen on a separate workstream so it doesn't affect this project.

Maybe, but it is pretty relevant to how this project would implement this. Perhaps that's motivation to do it now? A user type would solve this problem more comprehensively, rather than another one off boolean.

That said, I'm not doing the work! ;) But, I would like to know if a user/actor type will be added asap, so I can change our event model of this before its too late.

I think the folks driving IP masking should also push for a decision on this.

I brought up the user_type/actor_type flag because my perception was that schema changes of that kind are prohibitively slow, so not doing it now when we need to do a schema change anyway would be a missed opportunity. @Ladsgroup says that isn't the case, so we can drop this topic IMO, and return to it once there is a use case.

That way, we'd finally have a good way to identify system users (so far, it's "one of the names on this configurable list over here")

They are identifiable via User::isSystemUser(). See also T212720: System users should be in a dedicated user group for some dedicated discussion.

They are identifiable via User::isSystemUser()

Right, but this has the same problem as a user is temp regex: it is only available in MediaWiki PHP. If we want to know if a bulk of edits have been made in the past by a system user when looking at a dump or a processed SQL snapshot (e.g. mediawiki history), we'd have to write custom code that implements whatever MW does.

List of system users is quite small and doesn't change often. We can simply expose it via an API call (if it's not already) so you wouldn't need to rely on mw logic.

The conversation about user_type/actor_type is out of scope for the work that we are currently doing. The problem we are trying to solve with this ticket is to identify temporary users on the database, which is solved by the user_is_temp column and we are going to go ahead with that solution. If someone wants to pick up the problem of how to identify all types of users/actors, we don't expect it to have a negative impact on our work, but it's a much larger problem than the one we are trying to solve here and we won't be looking into it. Please, file another ticket for that discussion and keep this ticket for future reference for the user_is_temp column and any discussion related to it.

The conversation about user_type/actor_type is out of scope for the work that we are currently doing. The problem we are trying to solve with this ticket is to identify temporary users on the database, which is solved by the user_is_temp column and we are going to go ahead with that solution. If someone wants to pick up the problem of how to identify all types of users/actors, we don't expect it to have a negative impact on our work, but it's a much larger problem than the one we are trying to solve here and we won't be looking into it. Please, file another ticket for that discussion and keep this ticket for future reference for the user_is_temp column and any discussion related to it.

If you introduce user_is_temp now, and someone introduces actor_type later, this may mean that we will want to remove user_is_temp again, to avoid duplication and inconsistency.

Introducing an actor_type field doesn't mean it has to provide info about all types of users immediately. It could, at first, just be 0 for non-temp and 1 for temp (ot something). This would make it easy to add information about other user types later.

This seems more future proof, and no extra work. Am I missing something?

This seems more future proof, and no extra work. Am I missing something?

We went back and forth on this a bit, there are cases that it might not be the case. For example, if the future user type wouldn't be mutually exclusive to a temp user so then you have to add multiple values for that same user type to reflect that. For example, if being X would be a user_type but applicable to both temp and non-temp users, you'd end up with actor_type being 0 for non-temp, 1 for temp, then 2 for non-temp that's X, and 3 for temp that's X. And unless we are sure about future cases and how they would look like, we can't say that's future proof. I wrote a bit about it in T333223#8802705

This seems more future proof, and no extra work. Am I missing something?

We went back and forth on this a bit, there are cases that it might not be the case. For example, if the future user type wouldn't be mutually exclusive to a temp user so then you have to add multiple values for that same user type to reflect that.

I agree that introducing such a field would be comitting to the idea that user types are mutually exclusive. There could be no such thing as an "imported temp user" or "registered system user".

I think making this commitment would be a good thing. But I have not investigated whetehr it would conflict with any existign logic. I do not believe so, but I am not 100% certain. Checking that would indeed be some extra work that would have to be done.

It seems that both user.user_is_temp vs actor.actor_type (bool) have the same fundamental problems:

  • They're both oddly specific to temp users
  • They'll both raise the question of why user types were partially implemented (the answer being a sort of historical artefact, that we felt there was enough of a need to flag temp users at this time, but not enough of a need to resource / wait for designing a way to distinguish all user types)
  • They both introduce yet another way to recognise a type of user (check the name pattern (e.g. for a '>'), check some config, and now check a DB field...)
  • They both introduce the logical weirdness of assigning users who are neither temporary accounts nor fully registered accounts an is_temp flag (e.g. in either table, system users would be flagged as "not temp") - a little like the hot dog / not hot dog app from Silicon Valley.

It sounds like either option could create more work later, but that either way it's not a huge amount of work, according to @Ladsgroup. In which case I have no strong opinions over which one we pick, though perhaps user.user_is_temp has the slight advantage as it will take up less space, being on a shorter table?

Please, file another ticket for that discussion and keep this ticket for future reference for the user_is_temp column and any discussion related to it.

@JayCano okay, done: T336176: Define MediaWiki user types

To all: please edit ^ at will, my understanding of the problem and solutions is not as strong as yours.

Change 918513 had a related patch set uploaded (by Tchanders; author: Tchanders):

[mediawiki/core@master] schema: Add user_is_temp column to the user table

https://gerrit.wikimedia.org/r/918513

Change 918514 had a related patch set uploaded (by Tchanders; author: Tchanders):

[mediawiki/core@master] Set user.user_is_temp when creating a temporary user

https://gerrit.wikimedia.org/r/918514

I've added a couple of patches for adding user.user_is_temp, in case the discussions on T336176: Define MediaWiki user types don't get anywhere in time for Data Engineering and the other stakeholders who want temporary users flagged in the DB.

Change 918513 merged by jenkins-bot:

[mediawiki/core@master] schema: Add user_is_temp column to the user table

https://gerrit.wikimedia.org/r/918513

Here is a reason why working on T336176: Define MediaWiki user types sooner rather than later is important.

Analysts and users will call temp users 'unregistered', but MW will call them 'registered'. It looks like 'temp' is a very bad term for what these users are.

This is going to confuse a lot of people.

TBH, any new concept is going to be confusing for a while.

Change 991836 had a related patch set uploaded (by Tchanders; author: Tchanders):

[mediawiki/core@master] Reword documentation for user.user_is_temp to clarify purpose

https://gerrit.wikimedia.org/r/991836

Change 991836 merged by jenkins-bot:

[mediawiki/core@master] Reword documentation for user.user_is_temp to clarify purpose

https://gerrit.wikimedia.org/r/991836

Change 918514 merged by jenkins-bot:

[mediawiki/core@master] Set user.user_is_temp when creating a temporary user

https://gerrit.wikimedia.org/r/918514

This is done I think

@Ladsgroup Is this added to the user tables in All the wikis? Do you know exactly when this was completed? Is Jan 22, 2024 the correct date based on the merge comment?
IIUC we will not be seeing any data in the user table for this field as the program is not yet rolled out.

The schema was deployed long time ago, we just started writing to the field but the defualt value was 0 and now we are still writing 0 all the time so the change is technically noop. We don't have any temp users yet so everything should be zero.

Where are you not seeing the field? It exists in mariadb, if you're looking for hadoop, that's a different story.

thanks @Ladsgroup. yes I'm able to query the mariadb table for a few wikis and can see '0' in the user_is_temp column. Ive opened T356701 and requested the Data Platform Engineering team to add this field to the tables required for data analysis.