I'm opening up this task to discuss why we should or shouldn't do this, as part of the IP Masking project.
Adding user_is_temp
In the proposed IP Masking architecture, the User::isNamed and User::isTemp functions use a name pattern to establish whether a record in the User table is temp or regular. There are two main downsides of this, for consumers of replicated data and bulk access patterns. Consumers of replicated data would have to copy the name pattern and make sure to stay in sync with any changes (however infrequent). And queries selecting a set of users and wishing to filter by their temporary / permanent nature would have to apply this simple name pattern logic. We think a user_is_temp column would help with these kinds of problems and also be generally useful. However, we acknowledge that this is not a simple issue and open it up for analysis here. Please do feel free to edit the analysis section, especially to compact information from comments below.
Analysis
- a conversation on user types between @tstarling and @daniel is referenced here: T308017#8309324, might be useful to link/summarize here
- from @Ladsgroup, about the schema change itself
- would take about 2-3 months to apply (MariaDB can't do tricks like fast defaults)
- would add <<insert calculation here... 50MB?>> of storage (but this is probably smaller than the additional space temp users are going to take in the user table in general
- has a process that we could drive / coordinate patches / etc, for context about 90 of these were done in 2022
- [Data Engineering] Without this column, we would have to build events that capture the is_temp information as close to a user account being created as possible, so we don't run the risk of propagating bad data downstream. So we would probably build a copy of the user table externally and keep it updated with a Lambda mix of batch and real-time jobs. This would take some doing, and would slow down our ability to adapt to the IP masking changes. We could just use the name pattern, but we try hard to limit replicating logic from MW, preferring data instead.
- for context, around half of actors are anonymous in any given database, with more in enwiki and less in arwiki, for example
- one use case I thought we could have for user_is_temp is to facilitate deletion of temporary users after some period of inactivity (5 years?). There were two interesting problems with this, worth noting:
- for all tables that join to actor instead of user, this is possible, but there are still tables joining directly to user. So that code would have to be refactored before we could delete records from the user table
- even if we could delete users, the user_is_temp column would not help compared to the name pattern, because there wouldn't be an index on it and reading+applying the pattern would be almost just as fast as reading and checking a boolean
Background on Data Pipelines
We use data from mediawiki (MW) replicas to build data pipelines serving a variety of use cases. So far this has been mostly reports and dashboards, but we're starting to build pipelines that serve production use cases. We rely on the MW schema and adapt our extraction jobs when it changes. With our events work, we hope to find better ways to subscribe to changes MW wishes to make public. But for the foreseeable future, we will continue to want a parallel batch pipeline for accuracy and reliability. We currently differentiate between anonymous and registered user for most of our editing metrics. We are initially guessing that anonymous and temporary users will be somewhat similar, but if big differences arise in these populations, we will re-evaluate.