Page MenuHomePhabricator

How should temporary users be recognised?
Open, Needs TriagePublic

Description

Problem Statement

Our current implementation of temporary user accounts (part of IP Masking) identifies these temporary accounts by using a regex against the username. This decision was made because adding a column such as is_temp to the user table would be a significant schema change, which would require in a public consultation and potentially dealing with performance issues due to the user table being queried and joined frequently on the wikis. There is prior practice in MediaWiki of using regexes for identification purposes, for example identifying external users.

In conversation with Data Engineering, they pointed out that identifying users with a regex poses challenges to their process. In particular, we are expecting to see 230k new users every month and processing an unbounded regex can impose significant delays on the data pipelines.

Other problems that were initially identified were the possibility of having to deal with different regexes for different wikis and the possibility of changing the regex at some point in the future. These issues are no longer considered as such given that we use CentralAuth across wikis and having different patterns for different wikis would break this compatibility. Changing the regex in the future would break history and it’s something we will not allow.

Potential Solutions

There are currently two solutions that have been proposed:

  1. Add a column on the user table to identify temp users.
  2. Add a field on the user created event to identify temp users.

The first solution would be the most reliable one, however, as we mentioned earlier, it would require coordinating with WMF, community and third-party developers to make sure that the column would not break anyone’s work. There is also the possibility of incurring a performance hit for the users, given that the user table is frequently queried and joined.

Additionally, all the work we’ve done in the last few months for IP Masking relies on the username regex to identify temp users. Adding a column to the user table would require to re-architect the current solution or becoming comfortable with having two sources of truth for the temp status that could potentially get out of sync.

The second solution would be simpler to implement in that we would not have to have a public consultation and would not require us to reconsider our work. However, this solution would just move the table with the temp status to our data lakes, which could lead to inconsistencies between the wiki DBs, the event log, and the data lakes. There is also the risk of losing events or events being mishandled.

Questions

  1. Are there any other problems with identifying users by their username regex?
  2. Are there any other solutions we haven’t considered?
  3. What are the estimated costs of adding an extra column to the user table? (e.g. performance loss, effort required to accommodate the solution, etc)
  4. What would be the expected costs and the performance impact of processing the regex on our data lakes, leaving things as they are?
  5. If we decided to use a flag on the creation event, what are the risks and how can we address them?