Page MenuHomePhabricator

RFC: Create temporary accounts for anonymous editors
Open, Needs TriagePublic

Description

  • Affected components: MediaWiki core, CheckUser extension.
  • Engineer for initial implementation: TBD.
  • Code steward: TBD.

Motivation

The existing method of attributing edits from anonymous users to their current IP address seems inadequate. Because:

  1. Exposure of a user's IP address to the public is a privacy problem (e.g. prosecution by a repressive regime, public embarrassment, stalking and harrassment, revealing real-world identity and location; see also Exposure of user IP addresses.)
  2. Edits from the same anonymous session cannot reliably be found by other users, due to varying IP addresses. This makes makes it difficult to review content and deal with on-wiki abuse. ("User contributions", and user blocking).
  3. The user cannot easily find their own edits. ("My contributions").
  4. The user cannot reliably communicate to others, or be communicated with, or receive notifications ("the talk page problem").

IP addresses change regularly for various reasons:

  • IPv6 users regularly change IP addresses due to SLAAC (even when their location does not change).
  • Mobile users regularly change IP addresses when moving closer to another cell tower.
  • Users regularly change IP addresses when switching between networks (cellular to WiFi and between WiFi, e.g. home WiFi, cellular, train WiFi, office WiFi).

Also, when an IP editor is asked to register and does so, they get detached from their former contributions.

Requirements

(Specify the requirements that a proposal should meet.)

  • Edits by unregistered users are attributed to an identifier that is not based on personal information (such as IP address or Geo location).
  • Edits by unregistered users are attributed to an identifier that remains consistent within a browser session.

Exploration

Proposal

Attribute edits by unregistered users to a session ID instead of the current IP address.

Open questions:

  1. What will the session ID be based on?
    • The first IP address used during that session. (Rejected, per privacy reasons)
    • Auto-increment? UUID? Random? Random human-readable (e.g. diceware)?
  2. To what extend should these sessions act like real account?
  3. Should these be convertible to real accounts? If so, under what circumstances do we allow that, and how would that work?
  4. How can anti-abuse tools and workflows be adapted?

Benefits:

Prior art:

Related:


Original task description at T172477 by @tstarling

In T171382 it was asserted that some IPv6 users regularly change IP addresses within a /64 block, due to SLAAC (RFC 4862). As such, the existing method of attributing edits to anonymous users seems inadequate.

I did some queries on recent anonymous IPv6 edits in the enwiki recentchanges table. My impression is that this does indeed happen, but the problem is worse than described: some IPv6 users use a mobile connection, and in fact routinely move around a block much larger than /64.

I've long dreamed of attributing anonymous edits to a session ID instead of an IP address, since this would fix T20981: Allow anonymising of unregistered users ("IP editors") and T12957: Allow logged in user to reclaim previous anon edits, but due to abuse control considerations, it seems unlikely that this will win community support. This proposal is a compromise, fixing only one of those two bugs, by attributing edits to a session ID which is publicly associated with the first IP address used during that session.

I mean the term "session" loosely, this might be an ID associated with a long-lived cookie.

The proposal in detail:

  • On page save, if there is no existing session:
    • Create the session, and store the current IP address in the session
    • Search the actor table (T167246) for this IP address, and add a suffix to the IP address so as to make a unique username.
    • Create the actor row. actor_text would be the suffixed IP address and actor_user would be NULL.
  • On account creation, attributing the existing edits in the same session to the newly created account could be as simple as updating actor_user and actor_text in the existing actor row.
  • Blocks would be applied to the session via its public identifier (the suffixed IP), solving T152462: Add cookie when blocking anonymous users.
  • When an anonymous session is blocked, an autoblock would be applied to the last IP address actually used by the anonymous user in question, exactly analogous to the way logged-in users are blocked.

As an alternative, suffixing of the IP address could be omitted. In that case, to be feasible, I think you would have to have a single actor row per IP address, so you would not be able to solve T152462 or T12957. But at least you could have fewer user talk pages for anons who regularly migrate to a different IP address.

This was discussed on IRC, the log is at https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-08-02-21.05.log.html


Original task description at T133452 by @Tgr

For anonymous edits, MediaWiki makes available the IP addresses to everyone forever, which is a poor privacy practice, and can cause various problems to the user, from public embarrassment to outing to being prosecuted by a repressive regime. In the various discussions about this (see Exposure of user IP addresses for an overview) one option that came up was to automatically create temporary accounts for anonymous users and allow them to be converted to real accounts later. This task is for the discussion of the technical and social feasibility of that option.

See also:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Claiming this for MWPT at least for some infrastructure work.

This task is for the discussion of the technical and social feasibility of that option.

I think the improvement this will have in terms of privacy has tremendous merit, but on the surface the social feasibility is in my opinion low. This will dramatically impair counter-vandalism. Echoing T133452#2248903, being able to see long-term abuse from a given IP or range is critical on big wikis like enwiki. Instead of being able to take one single action to stop the disruption, we'll be playing an endless game of whac-a-mole, one account at a time (though not as bad with autoblocks). Much of the work would have to be shifted to CheckUsers, who even then would only be able to see 90 days back, and not be able to tell that there has been years of continued abuse, worthy of a lengthy block. In order to feasibly keep up with the abuse (the need for range blocks, especially), we'd need to appoint a LOT more CheckUsers, which sort of defeats the purpose of it being a highly restricted right.

As another example, I sometimes use AbuseFilter to prevent harassment (or any disruption, for that matter) from a particular IP range (because the abuser only edits while logged out). We can effectively stop the abuse and still allow others in those IP ranges to continue editing freely. Similarly, you may block an abusive range instead of using page protection, where the latter shuts out innocent editors. This will not be possible if everyone is behind an account, no? And what about the account creation throttle? If I continually open up the wiki in private browsing mode, can I continue to keep making these pseudo-accounts, and convert them to real accounts?

My thoughts are to simply make the fact that you're editing as an IP more prominent, perhaps even requiring confirmation. I realize the specifics of the temporary account system still aren't well-defined, but if it will mean we can't see a long-term contributions from a single end user or IP range, I think the Foundation should be prepared for the possibility that the community will not be able to keep up with the influx of abuse.

I think the improvement this will have in terms of privacy has tremendous merit, but on the surface the social feasibility is in my opinion low. This will dramatically impair counter-vandalism. Echoing T133452#2248903, being able to see long-term abuse from a given IP or range is critical on big wikis like enwiki. Instead of being able to take one single action to stop the disruption, we'll be playing an endless game of whac-a-mole, one account at a time (though not as bad with autoblocks). Much of the work would have to be shifted to CheckUsers, who even then would only be able to see 90 days back, and not be able to tell that there has been years of continued abuse, worthy of a lengthy block. In order to feasibly keep up with the abuse (the need for range blocks, especially), we'd need to appoint a LOT more CheckUsers, which sort of defeats the purpose of it being a highly restricted right.

Why does CheckUser need to be a highly restricted right? Anyone who edits logged-out by accident immediately has their IP address exposed to the public, which implies that we don't really value IP address privacy that highly. So why do we need to put up high barriers against giving access? Better to at least have some sort of password protection rather than just giving that information to everyone. It could just be given to all admins who sign an NDA.

Why should the expiry time have to stay at 90 days? This figure was set with little community consultation.

Currently, if someone edits logged-out by accident, with a talk page signature, the best we can do is oversight the whole revision. My idea is to split IP addresses into a separate table so that it's not so awkward to delete that information or control access to it.

Using the IP address as the username was a terrible user interface idea. IPv6 addresses in particular are ridiculously long and opaque. This is not a friendly UI design. I want to hide that detail from non-technical users.

I understand that some people think CheckUser access should be restricted on the basis that privacy is for "us", not for "them". The trouble is, vandals create accounts too, and registered users edit logged-out by accident. It's not really a clean us-versus-them separation.

Why does CheckUser need to be a highly restricted right? Anyone who edits logged-out by accident immediately has their IP address exposed to the public, which implies that we don't really value IP address privacy that highly. So why do we need to put up high barriers against giving access?

It gives you access to seasoned editor's IP addresses, too. Of course everyone deserves just as much privacy, but we could lose long-term prolific editors to outing. I certainly wouldn't mind more CUs as it is, but I'm not sure all are going to be OK with that. Currently the policy is quite strict, for better or worse. Logged out users at least are meant to be aware that their contributions are being recorded as their IP. If they don't know, that can easily be improved, no? I think requiring confirmation on the first edit per session would suffice, and that should help with people accidentally editing while logged out. The extra clicks might mean some abandon their edit, but all things considered it's probably still an improvement, and an experiment we could conduct now without too much developer resources, I assume.

Why should the expiry time have to stay at 90 days? This figure was set with little community consultation.

Can we increase this now? :) If we had temporary accounts and no IP edits, we'd need the CU logs to go back maybe a year or so (complete guess). Beyond that we could probably go off of the block log to deduce that there's long-term abuse. It's still tricky for ranges, since the contributions could be scattered across a large number of IPs. The block logs currently don't do a great job at reporting blocks of subranges and individual IPs therein.

Using the IP address as the username was a terrible user interface idea. IPv6 addresses in particular are ridiculously long and opaque. This is not a friendly UI design. I want to hide that detail from non-technical users.

I agree, it seems mighty odd, but it sort of worked out in terms of counter-vandalism! I don't think StackOverflow (perhaps a bad example) and other sites with temporary accounts have to deal with the abuse we do.

The only thing I beg for is this part:

if it will mean we can't see a long-term contributions from a single end user or IP range, I think the Foundation should be prepared for the possibility that the community will not be able to keep up with the influx of abuse

There's a lot of things that will need improving. Having to go through the CheckUser interface is going to slow down the workflow and make day to day counter-vandalism quite a pain. Every time a "temporary user" vandalizes, am I meant to run checks and see if it's an IP or range I should block? The underlying IPs would really need to exposed automatically, built right into Special:Contributions. I think we should also rework AbuseFilter so that CUs can implement filters to act on specific IPs or ranges (this would be splendid to have now, even, e.g. account creation). Mind you also that non-admins (not just non-CUs) do a lot of work reporting long-term abuse and identifying abusive ranges.

Overall, if we're given sufficient counter-vandalism tools, then I think the temporary account system makes perfect sense. It's hard to weigh having bad privacy (although accounts are an option) with a clean wiki, versus good privacy with a wiki that possibly can no longer be relied upon. We should think long and hard about this, and with ample input from the people who devote significant time and energy ensuring we have a stable wiki for all to enjoy.

I'm increasingly getting the feeling that the only way this has a chance is if it is maximally conservative. We could attribute anonymous edits to a temporary account, but continue to display the IP address of anonymous users publicly.

There is the ip_changes table which currently supports IP range queries in Special:Contributions for anons only. We could carry on populating that. It has a key on rev_id so we could join on it in Special:Contributions and action=history to get an IP address for display. We could even use a tooltip, for UseMod nostalgia. In ChangesList, rc_ip could be used for display instead, avoiding the join for efficiency.

So Special:Contributions for an IP address would show you all the old anonymous edits for that IP address, plus any made by new temporary accounts using that IP address.

Krinkle added subscribers: Milimetric, Anomie, daniel and 9 others.

This task is effectively a superset of T172477. I've merged that into here, tagged as RFC, and incorporated part of its task description here (the problem statement).

Moving to "under discussion" on the RFC board, since this is well fleshed out, and has seen some discussion in the past.

I don't have any thoughts or opinion on this, just a question. Does this or T172477 have a dependency or interaction with T167246 ? E.g. anonymous editors would be handled as a separate type of actor, or something. Or is it a completely separate issue? Thanks.

Does this or T172477 have a dependency or interaction with T167246 ? E.g. anonymous editors would be handled as a separate type of actor, or something. Or is it a completely separate issue? Thanks.

Probably, yes. It was certainly on our minds when we talked about the actor work back in the day. Not sure how real-world implementation would turn out, though.

I think the more interesting question is when anonymous user accounts should be created. We cannot create them for visits which don't result in a page save attempt or similar, for obvious scaling reasons. If we create them on write (ie. when an actor ID needs to be inserted somewhere), the user will be detached from their contribution if the user agent does not persist the session (e.g. browsing with cookies disabled) without us being able to detect it beforehand and warn them. If we create them just before write (e.g. whenever a CSRF token is obtained, like the user opening the edit form), that means doing stateful work on GET.

Krinkle renamed this task from Create temporary accounts for anonymous editors to RFC: Create temporary accounts for anonymous editors.Apr 4 2020, 2:33 AM
Pppery added subscribers: Liuxinyu970226, Pppery.

Sorry, misclicked.

If we create them on write (ie. when an actor ID needs to be inserted somewhere), the user will be detached from their contribution if the user agent does not persist the session (e.g. browsing with cookies disabled) without us being able to detect it beforehand and warn them.

I don't imagine that a warning would have much effect anyway. Any user worried about getting detached from their contributions would presumably create an account.

Why does CheckUser need to be a highly restricted right? Anyone who edits logged-out by accident immediately has their IP address exposed to the public, which implies that we don't really value IP address privacy that highly. So why do we need to put up high barriers against giving access?

It gives you access to seasoned editor's IP addresses, too. Of course everyone deserves just as much privacy, but we could lose long-term prolific editors to outing. I certainly wouldn't mind more CUs as it is, but I'm not sure all are going to be OK with that. Currently the policy is quite strict, for better or worse.

What if we had a cloak flag which could be set on user accounts, hiding them from normal CheckUser results? It could be granted to users like a group, and given liberally to good-faith users. There could be a public request process on-wiki, and a private process, say in OTRS.

Then you would have a basic CheckUser right (checkuser-uncloaked) which would be given to admins. The traditional checkuser right would allow functionaries to determine the IP address of cloaked users.

The idea and terminology is inspired by Freenode.

Having to go through the CheckUser interface is going to slow down the workflow and make day to day counter-vandalism quite a pain. Every time a "temporary user" vandalizes, am I meant to run checks and see if it's an IP or range I should block? The underlying IPs would really need to exposed automatically, built right into Special:Contributions.

How about if we expose uncloaked IP addresses to people with checkuser-uncloaked via Special:Contributions and RC. Access to cloaked contributions would still require that you go to Special:CheckUser and enter a reason.

Every time a "temporary user" vandalizes, am I meant to run checks and see if it's an IP or range I should block? The underlying IPs would really need to exposed automatically, built right into Special:Contributions.

I support Tim's idea of a cloak. I think it would make a good transitional phase at the very least, as it allows us to decouple the problem of access to IP information from the problem of having something more stable and anonymous as the main and only representation of an IP editor. E.g. the new "temporary user" would solve a lot of problems with regards to unstable IPs (e.g. talk pages, contributions continuity over IP changes, possibility to upgrade a tempory user into a real user etc.). It would also solve the problem of IP data being forever public. The other problems regarding counter-vandalism and transparency etc could remain as today, since we would not limit access to the IP info very much at first.

Beyond the transitional phase though, I think we can do better in the long run, and that there really shouldn't be any need for counter-vandalism to involve IP addresses. I hope that in time when that is addressed, these can then be folded back into CheckUser essentially.

@MusikAnimal Would you agree that this need is no different for registered users? What is the difference between a newbie account with username today, and a "temporary user" we assign to an IP user in the possible future? We don't expose their IP to patrollers today, right?

Note, I don't deny the need you describe. I get it. (Also as being maintainer of RTRC, GUC, and CVNBot.) Rather, I think the focus on the IP information is a bit too short-term and is a distraction from the issue that really we just have very bad counter-vandalism tooling as soon as someone signs up. This is a problem today as well. I think it's worth focussing on that and exploring more the space of how we can empower patrollers and admins to do more with less. For example, we have autoblock today, which acts on IPs without admins needing to know the IP they are acting on.

Instead of a cloak flag what if it was just a new user right (e.g. hide-ips) that could be assigned to any user group, like extended confirmed users? That way each wiki could tailor it to their specific anti-vandalism needs and capacities.

Note, I don't deny the need you describe. I get it. (Also as being maintainer of RTRC, GUC, and CVNBot.) Rather, I think the focus on the IP information is a bit too short-term and is a distraction from the issue that really we just have very bad counter-vandalism tooling as soon as someone signs up. This is a problem today as well. I think it's worth focussing on that and exploring more the space of how we can empower patrollers and admins to do more with less. For example, we have autoblock today, which acts on IPs without admins needing to know the IP they are acting on.

It's true that the value reviewers get from the IP does not really come from staring at a bunch of numbers, and maybe we are placing too much emphasis on that. What they really want is the reverse DNS, whois, proxy checks, etc. They want to know more about the context of a contribution, to help them decide how to respond to it. Is it COI? Is it a child in school? Is it a known troll hopping around a mobile ISP? Informing admins who want to place range blocks is one aspect, but that's not a merely binary decision -- the admin needs to set the reason text and the block options, which may well depend on what sort of range it is.

Rather, I think the focus on the IP information is a bit too short-term and is a distraction from the issue that really we just have very bad counter-vandalism tooling as soon as someone signs up. This is a problem today as well. I think it's worth focussing on that and exploring more the space of how we can empower patrollers and admins to do more with less.

@Krinkle - That's exactly what the Anti-Harassment Tools team is working on. You can read more about it at https://meta.wikimedia.org/wiki/IP_Editing:_Privacy_Enhancement_and_Abuse_Mitigation#Tools and https://meta.wikimedia.org/wiki/IP_Editing:_Privacy_Enhancement_and_Abuse_Mitigation/Improving_tools.

If we use seamless signup like this, we'd be giving "anons" a user ID. That would mean that all filters that distinguish between anon edits and logged in edits would no longer function. We'd have to provide filters based on user group (or absense of user group) instead. And we'll have to ensure such filters don't create performance issues for heavy duty queries (e.g. recentchanges).

The cloak idea sounds like a solid compromise and a step in the right direction. If at least admins can freely see IPs of "unregistered" users, and use the same tools we have today, we'll probably be okay. I think for now the overall focus should be not to get rid of IPs, but to reduce their visibility and the need to see them.

It's true that the value reviewers get from the IP does not really come from staring at a bunch of numbers, and maybe we are placing too much emphasis on that. What they really want is the reverse DNS, whois, proxy checks, etc. They want to know more about the context of a contribution, to help them decide how to respond to it. Is it COI? Is it a child in school? Is it a known troll hopping around a mobile ISP? Informing admins who want to place range blocks is one aspect, but that's not a merely binary decision -- the admin needs to set the reason text and the block options, which may well depend on what sort of range it is.

Yes, precisely. It being an IP address is irrelevant, rather it's the information we get from the IP that we care about. Often it's the only means to establish a pattern of abuse, and blocking the range is the only means to stop it. That's why I kind of liked T227733: Draft: Masking IP addresses for increased privacy, as it seemingly wouldn't break our workflows, instead focusing on obfuscating the IPs.

Proxies are another major point – we really should be blocking those globally. That seems like something we could do fairly easily now (i.e. promote enwiki's proxy blocking bots to the steward level), and would get rid of one of the problems that we currently can only solve by looking at IPs.

Given there's a large active community discussion about this problem on wiki as Kaldari pointed to, it might be a good idea not to fork the discussion.

There's an aspect to the UI design which I'm calling "cowbell", by which I mean the highly visible ways in which we tag the contributions of certain users as needing extra review. The current system has two kinds of cowbell: having a name which is a bunch of numbers, and having a name which is a red link. Having a name which follows a pattern like "Anon 12345" is a kind of cowbell. If the pattern is localised by the content language, global sysops and other small wiki patrollers may have trouble identifying anonymous users. We could have extra CSS or icons to assist in understanding.

If something is useful as a cowbell, it makes sense for it to be usable as a filter in RC and watchlists. As Daniel points out, that has traditionally been the case with anonymous users.

You can't filter by whether the user page exists, which reflects the fact that the red link cowbell is the unloved result of a UI accident. Nobody wants to actually edit the user page of a user whose edits they are reviewing, which is supposedly the purpose of red links. Anonymous user links currently go to the contributions page, which is more useful. In the current proposal, if User::isAnon() is true, links would naturally go to the contributions page, as they do for UseMod imports. If User::isAnon() is false (the automatic user creation variant of the proposal), then user links in changes lists would naturally be red.

My point is that we should reconsider the styling of usernames in change lists as part of this work.

One potential solution we can borrow from Google docs is to assign random names to users. This would be trickier at our scale than on a document shared with a few dozen people, but could be possible. Of course, we allow actual users to have any name they want, so styling still comes into play.