Page MenuHomePhabricator

Create temporary accounts for anonymous editors
Open, Needs TriagePublic

Description

Problem statement

The existing method of attributing edits from anonymous users to their current IP address seems inadequate. Because:

  1. Exposure of a user's IP address to the public is a privacy problem (e.g. prosecution by a repressive regime, public embarrassment, stalking and harrassment, revealing real-world identity and location; see also Exposure of user IP addresses.)
  2. Edits from the same anonymous session cannot reliably be found by other users, due to varying IP addresses. This makes makes it difficult to review content and deal with on-wiki abuse. ("User contributions", and user blocking).
  3. The user cannot easily find their own edits. ("My contributions").
  4. The user cannot reliably communicate to others, or be communicated with, or receive notifications ("the talk page problem").

IP addresses change regularly for various reasons:

  • IPv6 users regularly change IP addresses due to SLAAC (even when their location does not change).
  • Mobile users regularly change IP addresses when moving closer to another cell tower.
  • Users regularly change IP addresses when switching between networks (cellular to WiFi and between WiFi, e.g. home WiFi, cellular, train WiFi, office WiFi).

Also, when an IP editor is asked to register and does so, they get detached from their former contributions.

Proposal 1

Attribute anonymous edits to a session ID instead of the current IP address.

Open questions:

  1. What will the session ID be based on?
    • The first IP address used during that session. (Rejected, per privacy reasons)
    • Auto-increment? UUID? Random? Random human-readable (e.g. diceware)?
  2. To what extend should these sessions act like real account?
  3. Should these be convertible to real accounts? If so, under what circumstances do we allow that, and how would that work?
  4. How can anti-abuse tools and workflows be adapted?

Benefits:

See also:


Original task description at T172477 by @tstarling

In T171382 it was asserted that some IPv6 users regularly change IP addresses within a /64 block, due to SLAAC (RFC 4862). As such, the existing method of attributing edits to anonymous users seems inadequate.

I did some queries on recent anonymous IPv6 edits in the enwiki recentchanges table. My impression is that this does indeed happen, but the problem is worse than described: some IPv6 users use a mobile connection, and in fact routinely move around a block much larger than /64.

I've long dreamed of attributing anonymous edits to a session ID instead of an IP address, since this would fix T20981: Allow anonymising of unregistered users ("IP editors") and T12957: Allow logged in user to reclaim previous anon edits, but due to abuse control considerations, it seems unlikely that this will win community support. This proposal is a compromise, fixing only one of those two bugs, by attributing edits to a session ID which is publicly associated with the first IP address used during that session.

I mean the term "session" loosely, this might be an ID associated with a long-lived cookie.

The proposal in detail:

  • On page save, if there is no existing session:
    • Create the session, and store the current IP address in the session
    • Search the actor table (T167246) for this IP address, and add a suffix to the IP address so as to make a unique username.
    • Create the actor row. actor_text would be the suffixed IP address and actor_user would be NULL.
  • On account creation, attributing the existing edits in the same session to the newly created account could be as simple as updating actor_user and actor_text in the existing actor row.
  • Blocks would be applied to the session via its public identifier (the suffixed IP), solving T152462: Add cookie when blocking anonymous users.
  • When an anonymous session is blocked, an autoblock would be applied to the last IP address actually used by the anonymous user in question, exactly analogous to the way logged-in users are blocked.

    As an alternative, suffixing of the IP address could be omitted. In that case, to be feasible, I think you would have to have a single actor row per IP address, so you would not be able to solve T152462 or T12957. But at least you could have fewer user talk pages for anons who regularly migrate to a different IP address.

    This was discussed on IRC, the log is at https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-08-02-21.05.log.html

Original task description at T133452 by @Tgr

For anonymous edits, MediaWiki makes available the IP addresses to everyone forever, which is a poor privacy practice, and can cause various problems to the user, from public embarrassment to outing to being prosecuted by a repressive regime. In the various discussions about this (see Exposure of user IP addresses for an overview) one option that came up was to automatically create temporary accounts for anonymous users and allow them to be converted to real accounts later. This task is for the discussion of the technical and social feasibility of that option.

See also:

Event Timeline

Tgr created this task.Apr 23 2016, 3:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 23 2016, 3:00 PM
Zppix added a subscriber: Zppix.Apr 23 2016, 4:29 PM

I agree on this task, however... it would also mean a bot and/or server resources would be used up by temp account creation... unless you're not meaning for this to be automated which in turn brings up the issue your stating again.

ZhouZ added a subscriber: ZhouZ.
Restricted Application added a subscriber: JEumerus. · View Herald TranscriptApr 25 2016, 5:35 PM
ZhouZ moved this task from Backlog to Assigned on the WMF-Legal board.Apr 25 2016, 5:35 PM
jayvdb added a subscriber: jayvdb.Apr 26 2016, 4:02 AM

IMO this is a dup of T20981, and T2556 - i.e. this is bug 556, a verrry old request, and T95144: MediaWiki RFC: Exposure of user IP addresses

If these accounts don't get expired after a period of time, then there is the problem of shared IPs.

Like, if a vandal uses an IP to vandalise, then that shows up in the IPs contributions. If a legitimate user then edits under that IP later, i.e. in schools, then it shows up in the same contributions. If there were accounts for these, then it would look like 2 different people using the same account, which would cause confusion and annoy the legitimate users, especially if those 2 users used talk pages, because it would look like they were the same user because the (temporary) account would have a name.

Also, this would make it harder to ban problem school IPs, etc.

I think this would be hard to achieve from a technical point of view too.

Besides, users know that their IP is being published in the page history. There is a notice about it when they edit: 'Your IP address will be publicly visible if you make any edits'. They are making that decision that they want their IP published when they edit the page that they are editing.

Just my 2p.

Zppix added a comment.May 2 2016, 5:29 PM

@tom29739 agreed there aswell.. but the task creator has a point.

People can simply register to hide their IPs, people with privacy issues can ask relevant revs to be hidden/suppressed. Honestly I don't see pros while I see the biggest cons ever: the end of countervandalism as we know it.

cscott added a subscriber: cscott.Nov 18 2016, 6:06 PM

Perhaps we can think of this initially in terms of a refactoring. We have been too casual about IP information inside mediawiki. What if we took as a first step factoring out all IP-related code from the core db and pushing it into a separate db. So instead of "IP edits" we have some sort of automatically-generated pseudonym *but also record the IP address associated with this pseudonym in a separate database* (perhaps this functionality is actually in an extension, not in core mediawiki, so core mediawiki was totally "IP free"). Now we preserve all our abilities to track down sock puppets or do IP blocks, just requiring one indirection through the separate "IP database" which associates pseudonyms with IPs.

We can then take steps to further protect/limit/purge this IP address database independent of the core mediawiki database, and we don't have "hidden gotchas" in the core code because the core code doesn't manipulate IPs any more. And folks who do routine tasks like processing archive dumps of the core db don't stumble across IPs. And small or third-party or closed wikis without vandalism concerns can use "core mediawiki" without any IP tracking at all.

I dont think phab is the right place for this. This discussion has been going in circles for like 10 years, and the arguments arent usually implementation related. If you can get some set of security requirements (both cvn wise and privacy wise) out of the wikipedians, than it would be the time to have a phab discussion.

s

Huji added a subscriber: Huji.Dec 6 2016, 7:23 PM
Tgr updated the task description. (Show Details)
Nirmos added a subscriber: Nirmos.Sep 9 2017, 7:20 AM
tstarling added a subscriber: tstarling.

Claiming this for MWPT at least for some infrastructure work.

Ltrlg added a subscriber: Ltrlg.Jun 14 2018, 6:38 AM

This task is for the discussion of the technical and social feasibility of that option.

I think the improvement this will have in terms of privacy has tremendous merit, but on the surface the social feasibility is in my opinion low. This will dramatically impair counter-vandalism. Echoing T133452#2248903, being able to see long-term abuse from a given IP or range is critical on big wikis like enwiki. Instead of being able to take one single action to stop the disruption, we'll be playing an endless game of whac-a-mole, one account at a time (though not as bad with autoblocks). Much of the work would have to be shifted to CheckUsers, who even then would only be able to see 90 days back, and not be able to tell that there has been years of continued abuse, worthy of a lengthy block. In order to feasibly keep up with the abuse (the need for range blocks, especially), we'd need to appoint a LOT more CheckUsers, which sort of defeats the purpose of it being a highly restricted right.

As another example, I sometimes use AbuseFilter to prevent harassment (or any disruption, for that matter) from a particular IP range (because the abuser only edits while logged out). We can effectively stop the abuse and still allow others in those IP ranges to continue editing freely. Similarly, you may block an abusive range instead of using page protection, where the latter shuts out innocent editors. This will not be possible if everyone is behind an account, no? And what about the account creation throttle? If I continually open up the wiki in private browsing mode, can I continue to keep making these pseudo-accounts, and convert them to real accounts?

My thoughts are to simply make the fact that you're editing as an IP more prominent, perhaps even requiring confirmation. I realize the specifics of the temporary account system still aren't well-defined, but if it will mean we can't see a long-term contributions from a single end user or IP range, I think the Foundation should be prepared for the possibility that the community will not be able to keep up with the influx of abuse.

I think the improvement this will have in terms of privacy has tremendous merit, but on the surface the social feasibility is in my opinion low. This will dramatically impair counter-vandalism. Echoing T133452#2248903, being able to see long-term abuse from a given IP or range is critical on big wikis like enwiki. Instead of being able to take one single action to stop the disruption, we'll be playing an endless game of whac-a-mole, one account at a time (though not as bad with autoblocks). Much of the work would have to be shifted to CheckUsers, who even then would only be able to see 90 days back, and not be able to tell that there has been years of continued abuse, worthy of a lengthy block. In order to feasibly keep up with the abuse (the need for range blocks, especially), we'd need to appoint a LOT more CheckUsers, which sort of defeats the purpose of it being a highly restricted right.

Why does CheckUser need to be a highly restricted right? Anyone who edits logged-out by accident immediately has their IP address exposed to the public, which implies that we don't really value IP address privacy that highly. So why do we need to put up high barriers against giving access? Better to at least have some sort of password protection rather than just giving that information to everyone. It could just be given to all admins who sign an NDA.

Why should the expiry time have to stay at 90 days? This figure was set with little community consultation.

Currently, if someone edits logged-out by accident, with a talk page signature, the best we can do is oversight the whole revision. My idea is to split IP addresses into a separate table so that it's not so awkward to delete that information or control access to it.

Using the IP address as the username was a terrible user interface idea. IPv6 addresses in particular are ridiculously long and opaque. This is not a friendly UI design. I want to hide that detail from non-technical users.

I understand that some people think CheckUser access should be restricted on the basis that privacy is for "us", not for "them". The trouble is, vandals create accounts too, and registered users edit logged-out by accident. It's not really a clean us-versus-them separation.

MusikAnimal added a comment.EditedJun 26 2018, 4:23 AM

Why does CheckUser need to be a highly restricted right? Anyone who edits logged-out by accident immediately has their IP address exposed to the public, which implies that we don't really value IP address privacy that highly. So why do we need to put up high barriers against giving access?

It gives you access to seasoned editor's IP addresses, too. Of course everyone deserves just as much privacy, but we could lose long-term prolific editors to outing. I certainly wouldn't mind more CUs as it is, but I'm not sure all are going to be OK with that. Currently the policy is quite strict, for better or worse. Logged out users at least are meant to be aware that their contributions are being recorded as their IP. If they don't know, that can easily be improved, no? I think requiring confirmation on the first edit per session would suffice, and that should help with people accidentally editing while logged out. The extra clicks might mean some abandon their edit, but all things considered it's probably still an improvement, and an experiment we could conduct now without too much developer resources, I assume.

Why should the expiry time have to stay at 90 days? This figure was set with little community consultation.

Can we increase this now? :) If we had temporary accounts and no IP edits, we'd need the CU logs to go back maybe a year or so (complete guess). Beyond that we could probably go off of the block log to deduce that there's long-term abuse. It's still tricky for ranges, since the contributions could be scattered across a large number of IPs. The block logs currently don't do a great job at reporting blocks of subranges and individual IPs therein.

Using the IP address as the username was a terrible user interface idea. IPv6 addresses in particular are ridiculously long and opaque. This is not a friendly UI design. I want to hide that detail from non-technical users.

I agree, it seems mighty odd, but it sort of worked out in terms of counter-vandalism! I don't think StackOverflow (perhaps a bad example) and other sites with temporary accounts have to deal with the abuse we do.

The only thing I beg for is this part:

if it will mean we can't see a long-term contributions from a single end user or IP range, I think the Foundation should be prepared for the possibility that the community will not be able to keep up with the influx of abuse

There's a lot of things that will need improving. Having to go through the CheckUser interface is going to slow down the workflow and make day to day counter-vandalism quite a pain. Every time a "temporary user" vandalizes, am I meant to run checks and see if it's an IP or range I should block? The underlying IPs would really need to exposed automatically, built right into Special:Contributions. I think we should also rework AbuseFilter so that CUs can implement filters to act on specific IPs or ranges (this would be splendid to have now, even, e.g. account creation). Mind you also that non-admins (not just non-CUs) do a lot of work reporting long-term abuse and identifying abusive ranges.

Overall, if we're given sufficient counter-vandalism tools, then I think the temporary account system makes perfect sense. It's hard to weigh having bad privacy (although accounts are an option) with a clean wiki, versus good privacy with a wiki that possibly can no longer be relied upon. We should think long and hard about this, and with ample input from the people who devote significant time and energy ensuring we have a stable wiki for all to enjoy.

I'm increasingly getting the feeling that the only way this has a chance is if it is maximally conservative. We could attribute anonymous edits to a temporary account, but continue to display the IP address of anonymous users publicly.

There is the ip_changes table which currently supports IP range queries in Special:Contributions for anons only. We could carry on populating that. It has a key on rev_id so we could join on it in Special:Contributions and action=history to get an IP address for display. We could even use a tooltip, for UseMod nostalgia. In ChangesList, rc_ip could be used for display instead, avoiding the join for efficiency.

So Special:Contributions for an IP address would show you all the old anonymous edits for that IP address, plus any made by new temporary accounts using that IP address.

awight added a subscriber: awight.Jan 29 2019, 11:11 PM
Krinkle added a project: TechCom-RFC.EditedMar 28 2019, 5:37 PM
Krinkle added subscribers: Milimetric, Anomie, daniel and 9 others.

This task is effectively a superset of T172477. I've merged that into here, tagged as RFC, and incorporated part of its task description here (the problem statement).

Krinkle updated the task description. (Show Details)Mar 28 2019, 5:52 PM
awight updated the task description. (Show Details)Mar 29 2019, 12:25 AM
daniel moved this task from Inbox to Under discussion on the TechCom-RFC board.Apr 10 2019, 7:51 PM

Moving to "under discussion" on the RFC board, since this is well fleshed out, and has seen some discussion in the past.

94rain added a subscriber: 94rain.Apr 11 2019, 11:52 PM

I don't have any thoughts or opinion on this, just a question. Does this or T172477 have a dependency or interaction with T167246 ? E.g. anonymous editors would be handled as a separate type of actor, or something. Or is it a completely separate issue? Thanks.

Does this or T172477 have a dependency or interaction with T167246 ? E.g. anonymous editors would be handled as a separate type of actor, or something. Or is it a completely separate issue? Thanks.

Probably, yes. It was certainly on our minds when we talked about the actor work back in the day. Not sure how real-world implementation would turn out, though.

Tgr added a comment.Apr 12 2019, 7:08 PM

I think the more interesting question is when anonymous user accounts should be created. We cannot create them for visits which don't result in a page save attempt or similar, for obvious scaling reasons. If we create them on write (ie. when an actor ID needs to be inserted somewhere), the user will be detached from their contribution if the user agent does not persist the session (e.g. browsing with cookies disabled) without us being able to detect it beforehand and warn them. If we create them just before write (e.g. whenever a CSRF token is obtained, like the user opening the edit form), that means doing stateful work on GET.

Krinkle updated the task description. (Show Details)Apr 15 2019, 4:01 PM
Tgr updated the task description. (Show Details)Apr 15 2019, 6:44 PM
Samat added a subscriber: Samat.Wed, Apr 24, 10:37 AM