Page MenuHomePhabricator

How should we represent actors that do something loggable but don't create a temp account?
Closed, ResolvedPublic

Description

This question has arisen a few times and been discussed on other tasks: T334623, T353953.

On those tasks, we've discussed separate solutions for each feature, since CheckUser and AbuseFilter save user information and/or create logs using their own tables. However, some features log via core's ManualLogEntry (e.g. SpamBlacklist - T358806), so some solution is needed centrally.

Background

With temp accounts enabled, it is possible for a person who is logged out to do something on the wiki that results in a log being made, of which they are the performer, but doesn't result in a temporary account being made. Who should be logged as the performer?

We can't log the IP (the current status quo), because we can't save a new IP actor in MediaWiki (ActorStore throws an error).

We can't log a temporary account because it hasn't been made yet.

Possible solutions

Log the temporary user as the performer

Options 2 and 3 from T334623#9653885:

2. Create account at beginning of internalAttemptSave and don't worry about detached user accounts

  • Create account at beginning of internalAttemptSave, before any hooks/constraints

Advantages:

  • Creates a user for an edit attempt which allows for easier logging constraint checks

Disadvantages:

  • Something on the order of 55k accounts created per month, of which some percentage won't be used at all
  • Poor UX for users who failed to edit (but are actually logged-in). They'll see some form of CentralAuth / session errors, unless we implement some fix around it. They'll also not be able to visibly see that they're logged in on receiving an error (unless we do something about that too)
  • Inability for these users to be seen as the same user when visiting other wikis

3. Create account at beginning of internalAttemptSave and find way to perform top-level login on failure

  • Create account at beginning of internalAttemptSave, before any hooks/constraints
  • Perform a top-level login on failed edits

Advantages:

  • Same as 2, but we also ensure the user is attached to loginwiki, and that there is an improved UX in failure modes

Disadvantages:

  • Something on the order of 55k accounts created per month, of which some percentage won't be used at all
  • Wikitext editors lose their changes on failed attempts (due to top level redirect)
  • Modifying EditPage.php to support parsing error messages into query params for top-level login will be painful and clunky. Same goes for API driven editing.
  • Would need to modify wikitext editor UX and API driven editing (VisualEditor/DiscussionTools/Wikibase) UX to support the top-level redirect workflows on edit failures

It may be the case in the future that CentralAuth's SSO changes (T345249, T348388), in which case option 3 becomes more palatable, and would require less refactoring.

Log the IP actor as the performer

This would involve saving the IP address to core's actor table. I don't think this is a good idea, but including it here to be exhaustive.

The actor table would need to be regularly purged to remove the names. MediaWiki would go back to saving IP addresses by default (whereas without this, you'd only save IP addresses in special extensions like CheckUser and AbuseFilter). Anything that joins on the actor table would need to be updated to handle a missing name.

Define a new type of actor for unsaved temp users

This new type of actor would have a row in the actor table, but not in the user table (like IP actors, system actors). This idea is explored in T334623#9656689.

Conceptually, we do have a new type of actor: one that represents a real person (so can't be a system user[1]), who does something loggable (so must have a row in actor), but can't be an IP actor (because the IP can only be kept temporarily), and can't be a temp user (because they didn't do a $wgAutoCreateTempUser['actions'] action).

Actor/user "types" in MediaWiki are currently defined by checking their usernames (until T336176) - either by matching them to a regex (IP, temp) or a config (system), so in practice this might mean being able to recognize a new pattern. Or solving T336176 first.


[1] We explored representing these actions using a new system user, but decided that it would be confusing to have a single system user represent a very large number of actions performed by lots of real users - e.g. CheckUser checks would reveal many unrelated IP addresses for what looked like one single user, and analytics would find that one user did a high proportion of actions.

Event Timeline

Change #1014002 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/core@master] [WIP] Temp accounts: Allow logging non-existent temp accounts

https://gerrit.wikimedia.org/r/1014002

The idea of storing a row in actor but not user (https://gerrit.wikimedia.org/r/1014002) is interesting.

I had a quick test with the patch and here are some (non-exhaustive!) thoughts:

Works well

MW is happy to handle a userless actor that's not an IP, and so a lot of things work nicely with this patch, e.g.:

  • The entry is saved to checkuser_log_event with an actor ID, so IP reveal works
  • LogPager works with the actor ID, so Special:Log?type=spamblacklist&user=<theUnsavedTempUsername> finds the log entry
  • Page exports can handle an actor with a non-IP username but no user: export/XmlDumpWriter.php

Needs some work

We're changing some assumptions about temp/anon/registered users though, e.g. these would now treat this user as anon, whereas running the name through User::isRegistered would return true:

The block link in the log entry takes you to Special:Block in an error state, since the user doesn't exist, so we'd need to remove that.

The idea of storing a row in actor but not user (https://gerrit.wikimedia.org/r/1014002) is interesting.
....
Needs some work

We're changing some assumptions about temp/anon/registered users though, e.g. these would now treat this user as anon, whereas running the name through User::isRegistered would return true:

Agreed.

On commonswiki the user Conversion script has an actor row but no user row as shown below. As such having a username which isn't an IP address in the actor_name column when actor_user is NULL isn't entirely unprecedented.

wikiadmin2023@10.64.130.11(commonswiki)> select * from actor where actor_name = 'Conversion script';
+----------+------------+-------------------+
| actor_id | actor_user | actor_name        |
+----------+------------+-------------------+
| 23191898 |       NULL | Conversion script |
+----------+------------+-------------------+
1 row in set (0.001 sec)

Isn't the actor_user != 0 comparison wrong? The actor_user column can be any positive integer other than 0, or NULL. As such, this seems to not filter out anything?

Isn't the actor_user != 0 comparison wrong? The actor_user column can be any positive integer other than 0, or NULL. As such, this seems to not filter out anything?

I filed T360925 for that.

Change #1014476 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/core@master] Linker: Hide block and talk page links for unsaved temp accounts

https://gerrit.wikimedia.org/r/1014476

Change #1014481 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/core@master] [WIP] Linker: Display unsaved temp accounts differently

https://gerrit.wikimedia.org/r/1014481

The block link in the log entry takes you to Special:Block in an error state, since the user doesn't exist, so we'd need to remove that.

I've done that in this patch. I also explored modifying the appearance of unsaved temp user accounts in this patch.

Example of unsaved temp user in Special:Log. The link on the username goes to https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Temporary_accounts#Unsaved:

image.png (494×1 px, 98 KB)

Hello everyone,

I was pinged by @kostajh to take a look at this task and provide feedback. Personally, I'm against using a separate type of actor for unsaved edits. Certain parts of MediaWiki might expect logged events to have "full actors" (convertable to an User object) assigned to them, and they would be unable to work with a "partial temporary account". This might result in certain tools breaking for those partial temporary accounts.

For example, the patchset Kosta linked makes it impossible to checkuser for the "partial temporary account" (a cu_changes row is logged, but CheckUser does not recognize this new type of actor yet), see screenshots below:

image.png (574×2 px, 168 KB)

image.png (1×2 px, 198 KB)

In CheckUser's case, the row would get displayed, but only when querying via IP. However, the displaying would not show the "unsaved" remark we can see in Special:AbuseLog (because of missing handling in CheckUser itself, I presume). As a result, the partial and full temporary accounts look exactly the same in Special:CheckUser, but only one of them can be checkusered on directly. This is definitely a bug, and a super-confusing one (especially to someone who would be unaware of the change that's proposed here), as the issue is not apparent unless one dives fairly deeply into how everything works and integrates together.

Likely, there are also other tools that would break with the introduction of the new type of actor. Such breakage might be often non-apparent (unless one's testing against a particular workflow or is systematically going through all usages). We would likely need to conduct a full audit of all the codebases, to ensure each works fine with partial temporary actors (similar to what we did/are doing with temporary accounts themselves). This seems like a lot of work.

I recommend using a full temporary account for unsaved edits, as they're unlikely to trigger similar issues. While it is true they likely won't be used, I do not consider that a problem (majority of regular accounts are unused as well).

Hope this helps.

Hello everyone,

I was pinged by @kostajh to take a look at this task and provide feedback. Personally, I'm against using a separate type of actor for unsaved edits. Certain parts of MediaWiki might expect logged events to have "full actors" (convertable to an User object) assigned to them, and they would be unable to work with a "partial temporary account". This might result in certain tools breaking for those partial temporary accounts.

For example, the patchset Kosta linked makes it impossible to checkuser for the "partial temporary account" (a cu_changes row is logged, but CheckUser does not recognize this new type of actor yet), see screenshots below:

image.png (574×2 px, 168 KB)

image.png (1×2 px, 198 KB)

In CheckUser's case, the row would get displayed, but only when querying via IP. However, the displaying would not show the "unsaved" remark we can see in Special:AbuseLog (because of missing handling in CheckUser itself, I presume). As a result, the partial and full temporary accounts look exactly the same in Special:CheckUser, but only one of them can be checkusered on directly. This is definitely a bug, and a super-confusing one (especially to someone who would be unaware of the change that's proposed here), as the issue is not apparent unless one dives fairly deeply into how everything works and integrates together.

Likely, there are also other tools that would break with the introduction of the new type of actor. Such breakage might be often non-apparent (unless one's testing against a particular workflow or is systematically going through all usages). We would likely need to conduct a full audit of all the codebases, to ensure each works fine with partial temporary actors (similar to what we did/are doing with temporary accounts themselves). This seems like a lot of work.

I recommend using a full temporary account for unsaved edits, as they're unlikely to trigger similar issues. While it is true they likely won't be used, I do not consider that a problem (majority of regular accounts are unused as well).

Hope this helps.

Thanks for this.

In light of the reasoning I think I would oppose doing anything other than just creating the account. The issues with a top-level redirect are unfortunate, but are not as much of an issue as not having a user account.

In CheckUser's case, the row would get displayed, but only when querying via IP. However, the displaying would not show the "unsaved" remark we can see in Special:AbuseLog (because of missing handling in CheckUser itself, I presume).

I agree that is confusing, but I don't think it would be that difficult.

Likely, there are also other tools that would break with the introduction of the new type of actor. Such breakage might be often non-apparent (unless one's testing against a particular workflow or is systematically going through all usages). We would likely need to conduct a full audit of all the codebases, to ensure each works fine with partial temporary actors (similar to what we did/are doing with temporary accounts themselves). This seems like a lot of work.

I am not sure about that. It seems like the use case here is pretty well contained, and we already have cases of actors existing without a user, so I don't think it is entirely a paradigm shift. I do agree that if we had to do a full audit of actor/user handling code and tooling, this approach would be more trouble than it is worth.

In light of the reasoning I think I would oppose doing anything other than just creating the account. The issues with a top-level redirect are unfortunate, but are not as much of an issue as not having a user account.

It is a really poor UX, though, and it seems like it would be a lot of work to build something that is presentable to a good faith user. I'd worry that any good faith user who inadvertently triggers a constraint that fails an edit would be put off by the UX such that they'd abandon their editing work.

The only solution I could think of to improve the UX would be to create the account even earlier (opening the editor creates a temp account and does the top level redirect; making an API edit requires an API call to obtain a temp account first). That has a drawback of removing the context for the temp account creation (the proposed text edit) which makes anti-abuse tooling around temp account creation harder, but as we are not concerned about unused temp account creations, I guess that would be OK.

Likely, there are also other tools that would break with the introduction of the new type of actor. Such breakage might be often non-apparent (unless one's testing against a particular workflow or is systematically going through all usages). We would likely need to conduct a full audit of all the codebases, to ensure each works fine with partial temporary actors (similar to what we did/are doing with temporary accounts themselves). This seems like a lot of work.

I am not sure about that. It seems like the use case here is pretty well contained, and we already have cases of actors existing without a user, so I don't think it is entirely a paradigm shift. I do agree that if we had to do a full audit of actor/user handling code and tooling, this approach would be more trouble than it is worth.

Maybe I am missing some, but AFAIK, the only no-user actor which can trigger logged actions is an IP actor (which we're moving away from). There are special-purpose actors for imported revisions, which are fairly rare and they also do not make sense in most contexts (such as in Special:CheckUser or Special:Log), while the newly-proposed actor type would need to be supported there.

Unless I am missing something, this proposal also means a single event can be represented in two possible ways. If I first make a regular edit logged-out, I get a temporary account, and if I then make something prohibited, it would be (presumably) attributed to that temporary account. However, if I do not make the regular edit first, then it gets attributed to the special actor type, rather than to a temporary account. I am worried this would result in confusing outputs, which would be hard to explain (without understanding the technical background).

I'd also like to highlight that using the new actor type would make it significantly harder to use the resulting CU data in Special:CheckUser. If I understand the plan correctly, the special actor would only have a single action logged (=whichever action was prohibited). Since to declare something a CU match, multiple datapoints are needed (to avoid mistakes), it would be fairly difficult to draw conclusions from the special actor (as that'd only have a single event logged on it).

Likely, there are also other tools that would break with the introduction of the new type of actor. Such breakage might be often non-apparent (unless one's testing against a particular workflow or is systematically going through all usages). We would likely need to conduct a full audit of all the codebases, to ensure each works fine with partial temporary actors (similar to what we did/are doing with temporary accounts themselves). This seems like a lot of work.

I am not sure about that. It seems like the use case here is pretty well contained, and we already have cases of actors existing without a user, so I don't think it is entirely a paradigm shift. I do agree that if we had to do a full audit of actor/user handling code and tooling, this approach would be more trouble than it is worth.

Maybe I am missing some, but AFAIK, the only no-user actor which can trigger logged actions is an IP actor (which we're moving away from). There are special-purpose actors for imported revisions, which are fairly rare and they also do not make sense in most contexts (such as in Special:CheckUser or Special:Log), while the newly-proposed actor type would need to be supported there.

I was referencing what is in T360870#9658668, though, I imagine that Commons "Conversion script" actor doesn't generate log entries.

Unless I am missing something, this proposal also means a single event can be represented in two possible ways. If I first make a regular edit logged-out, I get a temporary account, and if I then make something prohibited, it would be (presumably) attributed to that temporary account. However, if I do not make the regular edit first, then it gets attributed to the special actor type, rather than to a temporary account. I am worried this would result in confusing outputs, which would be hard to explain (without understanding the technical background).

IMO the distinction make sense: failed first edit attempt results in the partially saved temp account, and if the same session results in a successful edit, it is converted to a regular temp account.

If we create temp accounts on all edit attempts, we'll still need to explain the meaning of temp accounts without any attached edits to functionaries.

I'd also like to highlight that using the new actor type would make it significantly harder to use the resulting CU data in Special:CheckUser. If I understand the plan correctly, the special actor would only have a single action logged (=whichever action was prohibited). Since to declare something a CU match, multiple datapoints are needed (to avoid mistakes), it would be fairly difficult to draw conclusions from the special actor (as that'd only have a single event logged on it).

Taking a step back, I am wondering why one would want to use CheckUser on the partially saved temp account, given that CheckUser users would already have access to the IP used in the attempt. Is it possible that this is functionality we don't particularly need for this actor type?

To clarify my position on this: I'd be happy to go with the "create temp accounts for all edit attempts" approach if we can figure out a way to make a non terrible UX for good faith users who trigger an AbuseFilter/SpamBlacklist or EditConstraint in core. It just seems that the actor type to represent partially saved temp accounts is overall less work.

Maybe I am missing some, but AFAIK, the only no-user actor which can trigger logged actions is an IP actor (which we're moving away from). There are special-purpose actors for imported revisions, which are fairly rare and they also do not make sense in most contexts (such as in Special:CheckUser or Special:Log), while the newly-proposed actor type would need to be supported there.

I was referencing what is in T360870#9658668, though, I imagine that Commons "Conversion script" actor doesn't generate log entries.

Aha, thanks for the clarification. I am wondering why Conversion script is recorded in this way at Commons. It appears to be created as a system account, which should have – in theory – an user row attached. Considering the script is very old (predating the actor system by a long shot), I'm not sure how much it makes sense to rely on how it looks in the actor table now.

FWIW, this actor has a log entry, which displays as if it was an IP-initiated logged entry (with the contributions or the block links saying "This user does not exist"). In my opinion, this signals that while the actor exists, it is apparently not working properly (and not frequently used), so it likely doesn't make a lot of sense to use as an example. There are newer system accounts, which do have a full row in user (Maintenance script, AbuseFilter's system user, etc.).

Either way: Even if actor_user = NULL makes sense for system accounts, those accounts make no sense in Special:CheckUser, since they always edit from 127.0.0.1. Since nothing but system accounts (or sysadmin-level actions) use that IP (or even that /8 range), it is highly unlikely that system accounts would ever show up in investigations. As far as I know, the only way this happened in the past is through a software bug, which resulted in certain CU events being attributed to loopback and/or private addresses, which did end up confusing checkusers.

Unless I am missing something, this proposal also means a single event can be represented in two possible ways. If I first make a regular edit logged-out, I get a temporary account, and if I then make something prohibited, it would be (presumably) attributed to that temporary account. However, if I do not make the regular edit first, then it gets attributed to the special actor type, rather than to a temporary account. I am worried this would result in confusing outputs, which would be hard to explain (without understanding the technical background).

IMO the distinction make sense: failed first edit attempt results in the partially saved temp account, and if the same session results in a successful edit, it is converted to a regular temp account.

It does make sense if you know the technical background behind it. Without knowing about the implementation details (and without knowing what the end user actually did on their end), it is difficult to understand why the same action is represented in two different ways (as the order of execution, which is the deciding factor here, is unknown to whoever is looking at the logs).

If we create temp accounts on all edit attempts, we'll still need to explain the meaning of temp accounts without any attached edits to functionaries.

Why? In my opinion, accounts that do not have any edits connected to them are pretty normal (in fact, 71% of enwiki accounts have edit count of zero, see P60385), and having temporary accounts behaving in a similar way would not be a surprise, I think. Plus, just like regular accounts, it would always have a logged entry associated with it.

Taking a step back, I am wondering why one would want to use CheckUser on the partially saved temp account,

I agree it likely doesn't make a sense to start a CheckUser investigation on the partially saved temporary account itself. However, partially saved temp accounts will eventually come up in the CheckUser investigations that are carried out for other reasons.

Generally, CheckUsers first do a "breadth first search" using the IP data (direct IP match, range search). During that BFS search, they start on one or two accounts they wish to investigate (the starting nodes), query range(s) it edits on (the edges), get to a bunch of new accounts (newly discovered nodes) and continue until sufficient data gathers at the CheckUser's screen (after that, they analyze what they see, draw conclusions and close the investigation).

During that process, partial temporary accounts will show up, and checkusers will likely attempt to query them to see any additional data they might have. To support that workflow, CheckUser should support partial temporary accounts, so that if you run into them, it operates normally.

given that CheckUser users would already have access to the IP used in the attempt.

Can you clarify what you mean by this, please?

Is it possible that this is functionality we don't particularly need for this actor type?

For the reasons I described above, I do think Special:CheckUser needs to handle partial temporary accounts correctly, as they would come up in investigations.

Side note: CU was meant to be an example, not the only issue. The point is that anything that works with actors potentially needs to be fixed to work with partial temporary accounts (=accounts that are temporary, but not registered or named), in a similar manner like things needed to be fixed for full temporary accounts.

To clarify my position on this: I'd be happy to go with the "create temp accounts for all edit attempts" approach if we can figure out a way to make a non terrible UX for good faith users who trigger an AbuseFilter/SpamBlacklist or EditConstraint in core. It just seems that the actor type to represent partially saved temp accounts is overall less work.

We have a very similar position, except from a different angle :). I'm not in principle opposed to partial temporary accounts if that's what will turn out to be the easiest solution for this problem. However, the resulting product needs to have a reasonable UX for the functionaries who would be consuming that data later on, and I'm afraid partial temporary accounts make that very challenging to do.

and we already have cases of actors existing without a user, so I don't think it is entirely a paradigm shift

For my own curiosity, I did a (poor) attempt on finding such actors:

mysql:research@dbstore1007.eqiad.wmnet [cswiki]> select * from actor where actor_user is null and actor_name not like "%:%" and actor_name not like "%.%" and actor_name not like "%>%" limit 100;                                                
+----------+------------+------------------------------------------+
| actor_id | actor_user | actor_name                               |
+----------+------------+------------------------------------------+
|   463290 |       NULL | Template namespace initialisation script |
|   797707 |       NULL |                                          |
|   832583 |       NULL | Conversion script                        |
|  1298157 |       NULL | Unknown user                             |
+----------+------------+------------------------------------------+
4 rows in set (1.989 sec)

mysql:research@dbstore1007.eqiad.wmnet [cswiki]>

The is not like conditions in the SQL correspond to IP actors (IPv6 and IPv4 and imported users (> is used for separating the interwiki prefix for remote users).

Unless the conditions filter out other cases, this means the only actors without an user are IP actors, imported users and historical system users. Out of those, only IP actors represent human actions (imported users represent an artificially inserted action into the wiki's past, while maintenance users represent a system-taken action).

and we already have cases of actors existing without a user, so I don't think it is entirely a paradigm shift

For my own curiosity, I did a (poor) attempt on finding such actors:

mysql:research@dbstore1007.eqiad.wmnet [cswiki]> select * from actor where actor_user is null and actor_name not like "%:%" and actor_name not like "%.%" and actor_name not like "%>%" limit 100;                                                
+----------+------------+------------------------------------------+
| actor_id | actor_user | actor_name                               |
+----------+------------+------------------------------------------+
|   463290 |       NULL | Template namespace initialisation script |
|   797707 |       NULL |                                          |
|   832583 |       NULL | Conversion script                        |
|  1298157 |       NULL | Unknown user                             |
+----------+------------+------------------------------------------+
4 rows in set (1.989 sec)

mysql:research@dbstore1007.eqiad.wmnet [cswiki]>

The is not like conditions in the SQL correspond to IP actors (IPv6 and IPv4 and imported users (> is used for separating the interwiki prefix for remote users).

Unless the conditions filter out other cases, this means the only actors without an user are IP actors, imported users and historical system users. Out of those, only IP actors represent human actions (imported users represent an artificially inserted action into the wiki's past, while maintenance users represent a system-taken action).

Thanks for checking that.

We have a very similar position, except from a different angle :). I'm not in principle opposed to partial temporary accounts if that's what will turn out to be the easiest solution for this problem. However, the resulting product needs to have a reasonable UX for the functionaries who would be consuming that data later on, and I'm afraid partial temporary accounts make that very challenging to do.

Ack. Let me go back to the create-temp-user-on-save-attempt approach and see if I can better clarify some of the issues there (e.g. which might be blockers to the implementation and which are nice-to-haves).

Change #1014002 abandoned by Kosta Harlan:

[mediawiki/core@master] Temp accounts: Allow for attributing logs to unsaved temp accounts

Reason:

We are back to Ib6765f828681e70d798363338910a54c7de4ed67. This patch works well for this particular use case but introduces more complexity in adding a potentially confusing actor type, which would require auditing a bunch of code and documentation.

https://gerrit.wikimedia.org/r/1014002

Change #1014476 abandoned by Kosta Harlan:

[mediawiki/core@master] Linker: Hide block and talk page links for unsaved temp accounts

Reason:

https://gerrit.wikimedia.org/r/1014476

Change #1014481 abandoned by Kosta Harlan:

[mediawiki/core@master] [WIP] Linker: Display unsaved temp accounts differently

Reason:

https://gerrit.wikimedia.org/r/1014481

Ack. Let me go back to the create-temp-user-on-save-attempt approach and see if I can better clarify some of the issues there (e.g. which might be blockers to the implementation and which are nice-to-haves).

This was done in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1008530

I think this particular task can skip QA, since it was about figuring out a solution. The solution was "Log the temporary user as the performer", by creating a temporary account on edit attempts, not just edit success.