Page MenuHomePhabricator

Don't use actor IDs for private CheckUser events when these actions are performed by IP addresses
Open, Needs TriagePublic3 Estimated Story Points

Description

The CheckUser tool allows actions to be performed by IP addresses that leave no on-wiki record outside the CheckUser tables, except in the actor table. This can occur on events such as a failed login attempt to an account and password reset requests.

These events should not create a temporary account for the user, as they have made no on-wiki write action outside what is stored in CheckUser for only 3 months. However, currently this means that an CannotCreateActorException exception will be thrown if an IP has a failed login attempt to an account or even attempting to reset their password.

An example of these entries is shown below:

image.png (192×1 px, 82 KB)

To avoid this issue CheckUser should make the actor column in the cu_private_event table nullable and then write NULL if the actor ID could not be generated. Using NULL over 0 makes it clearer in non-MediaWiki contexts, and if we consider this as a foreign key we should be using NULL to indicate it is not relevant (as 0 when the foreign key constraints are applied will cause an error).

Acceptance criteria

Blockers for testwiki release of temporary accounts:

  • Update the the cupe_actor column in the cu_private_event table to be nullable
  • Write NULL to cupe_actor if the actor is an IP address, does not have an actor ID(?), and temporary accounts are enabled
  • Stop writing to cu_changes for events which require cuc_actor to be NULL (as the column is not nullable)
  • Handle a value of cupe_actor being NULL in Special:CheckUser

Blockers for a production release of temporary accounts:

  • Handle a value of cupe_actor being NULL in Special:Investigate and the CheckUser API

Related Objects

Event Timeline

Discussed with TSP team and decided that we will now write to the actor table when temporary accounts are enabled. Instead we will either:

  • Create a special system user which is given an actor ID. For all actions performed by an IP this actor ID is assigned to the actor column. When CheckUser displays the results the extension will display the IP address in cuc_ip (or other table specific IP column) instead of the username for the special system user
  • The actor row will be assigned the value 0 which will indicate to CheckUser that action was performed by an IP and to instead use the IP in the IP column.

This is being done to avoid creating new actor rows when autocreation of temporary accounts is enabled as the actor table is public.

Going with the idea to use a value of 0 for the actor column to avoid the need to create a system user that would appear in Special:ListUsers. Using 0 as the value is not necessarily unprecedented as this was used for the cuc_user column before actor migration to indicate an IP address (and external users).

Change 1000082 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/CheckUser@master] [WIP] Don't acquire actor IDs for IP addresses with temp accounts enabled

https://gerrit.wikimedia.org/r/1000082

Change 928962 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/CheckUser@master] Move multiple Hooks.php methods to CheckUserInsert service

https://gerrit.wikimedia.org/r/928962

Dreamy_Jazz renamed this task from Determine who will be the performer of failed login attempts if automatic creation of temporary accounts is enabled to Use a system user prevent saving IP addresses to the actor table.Feb 14 2024, 8:26 PM
Dreamy_Jazz renamed this task from Use a system user prevent saving IP addresses to the actor table to Use a system user prevent saving IP addresses to the actor table for private CheckUser events.
Dreamy_Jazz updated the task description. (Show Details)

Change 928962 merged by jenkins-bot:

[mediawiki/extensions/CheckUser@master] Move multiple Hooks.php methods to CheckUserInsert service

https://gerrit.wikimedia.org/r/928962

  • The actor row will be assigned the value 0 which will indicate to CheckUser that action was performed by an IP and to instead use the IP in the IP column.

I've been having a little look into this following discussions on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/1000082, in particular the concerns expressed about Special:Investigate by @Dreamy_Jazz.

I've had a quick look at the CompareService, haven't had time to look at the TimelineService yet...

  • CompareService uses actor_name in a WHERE clause in this function, but only if a second param is passed to the function. The function never seems to be called with the second param, so looks like we could remove this altogether.
  • CompareService groups by username, IP address and user agent, which would essentially become IP address and user agent if the username is null. This seems OK to me. Without temp accounts, failed logins from the same IP address would also group together (there are no degrees of freedom between the username and the IP address because the username is the IP address, so grouping by the username and IP is the same as just grouping by the IP).

With a very minimal change applied (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/1006993) the Compare table looks something like this:

image.png (435×487 px, 38 KB)

image.png (435×487 px, 38 KB)

It would be helpful to show the IP address in the Username column, so that the tools would work:

image.png (225×249 px, 14 KB)

This could lead to a little bit of confusion when temp accounts are first enabled, because there would still be IP actors in the results and the grouping by name, IP and agent would group ('null', '127.0.0.1', 'agent') and ('127.0.0.1', '127.0.0.1', 'agent') differently - but in the table they would have the same user name, IP and agent. But after $wgCUDMaxAge had passed, we shouldn't get results for ('127.0.0.1', '127.0.0.1', 'agent') any more.

I've had a quick look at the CompareService, haven't had time to look at the TimelineService yet...

I've looked into this now and I think this is looking doable too. TimelineRowFormatter needs a change along the lines of this update in order to avoid an error, and we could similarly show the IP address as the name.

Having discussed the proposed solution of using a system user, we decided that this was too hacky a solution considering that:

  • Outside of MediaWiki contexts, DB queries have to bear this special use case in mind
  • New interfaces to the CheckUser interfaces will need to bear in mind this special case

Furthermore, using an actor ID of 0 is not good either as the actor column is expected to be a valid ID in the actor table. This also is a special usecase which has the same issues as above but also means we need to use a LEFT JOIN. While we can probably reasonably expect that no rows in the actor table have a NULL actor_name column, we cannot assume this for the actor_user column (which is currently NULL if the performer does not have a row in the user table). As such, this adds complication in a slightly less bad way to the system user approach.

Therefore, choosing between the existing solutions is kind of choosing the less worst/hacky option.


That brings me to the idea that avoids these hacky problems but needs a schema change, which I have discussed and expanded on with @Tchanders. We could instead create a new result table for CheckUser which stores log events where we cannot have an actor ID for the performer (for the time being only if the action was performed by an IP address when temporary accounts are enabled). We should not need a table for edits as all edits should already exist in the revision table and therefore have an actor ID (we should therefore also ensure we first call ::findActorId instead of ::acquireActorId for IPs when temporary accounts are enabled for this situation).

The schema of the new table would be similar to cu_private_event but would have no actor ID column and instead store the IP address used in hex form (as we have determined that the ip and ip_hex columns do not provide any different information and only using one should reduce the overall size of the table on disk). We should be able to do this for the other IP based columns of other CheckUser tables. We can convert the hex to a human-readable form easily, so there is little need to store the human-readable form in the DB.

The schema would likely look like the following (the exact table name is up for debate). We may also want to take this opportunity to better normalise the table by creating a table for user agent strings and/or XFF column (T326379 / T305930).

MariaDB [my_database]> describe cu_private_event_no_actor;
+-------------------+---------------------+------+-----+---------+----------------+
| Field             | Type                | Null | Key | Default | Extra          |
+-------------------+---------------------+------+-----+---------+----------------+
| cupeno_id         | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| cupeno_namespace  | int(11)             | NO   |     | 0       |                |
| cupeno_title      | varbinary(255)      | NO   |     |         |                |
| cupeno_log_type   | varbinary(32)       | NO   |     |         |                |
| cupeno_log_action | varbinary(32)       | NO   |     |         |                |
| cupeno_params     | blob                | NO   |     | NULL    |                |
| cupeno_comment_id | bigint(20) unsigned | NO   |     | 0       |                |
| cupeno_page       | int(10) unsigned    | NO   |     | 0       |                |
| cupeno_timestamp  | binary(14)          | NO   | MUL | NULL    |                |
| cupeno_ip_hex     | varbinary(255)      | YES  | MUL | NULL    |                |
| cupeno_xff        | varbinary(255)      | YES  |     |         |                |
| cupeno_xff_hex    | varbinary(255)      | YES  | MUL | NULL    |                |
| cupeno_agent      | varbinary(255)      | YES  |     | NULL    |                |
| cupeno_private    | mediumblob          | YES  |     | NULL    |                |
+-------------------+---------------------+------+-----+---------+----------------+

Thanks for the summary @Dreamy_Jazz . What I like about this approach is that it doesn't have to lie about there being an actor.

We discussed that this will work with Special:CheckUser, but not with Special:Investigate or the API until they are migrated to the multi-table schema. However, these migrations do need to be done regardless of whether we add this table.

I'll also note that we discussed creating a temporary account for each unsuccessful login (similar to what is being proposed in T334623 and T358806), but that there would be far too many. It's also unnecessary, as we're in CheckUser so the IP address doesn't need hiding.

Dreamy_Jazz renamed this task from Use a system user prevent saving IP addresses to the actor table for private CheckUser events to Don't use actor IDs for private CheckUser events.Mar 6 2024, 11:08 AM
Dreamy_Jazz updated the task description. (Show Details)
Dreamy_Jazz updated the task description. (Show Details)
Dreamy_Jazz renamed this task from Don't use actor IDs for private CheckUser events to Don't use actor IDs for private CheckUser events when these actions are performed by IP addresses.Mar 6 2024, 1:58 PM
Dreamy_Jazz updated the task description. (Show Details)

In {T359309#9615111}, the DBAs declined the creation of the new table recommending that we instead allow the cupe_actor column to be NULL.