Page MenuHomePhabricator

RFC: Overhaul the CheckUser extension
Open, LowPublic

Description

CheckUser (CheckUser · rECHU) is probably one of the oldest extensions still working in Wikimedia Projects.

Designed c. 2005, the extension is one of the critical tools that helps us to deal with problematic cases of abuse such as sock puppetry, vandalism and spam (some history here). As time went by, the needs of the projects increased, and as we can see in the CheckUser work board, the bugs accumulated without being resolved, primarily because its code, albeit old, also seems hard for developers to read and work with (refer to T132892: CheckUser UI revamp and its related tasks, as well as the work board I've linked above).

The lack of an active maintainer for the Extension in question lowers development productivity in this area (i.e.: bug resolution, testing and extension development ex Phabricator). To resolve this issues, I think we need to think about overhauling the CheckUser extension. That overhaul should also be an opportunity for us to make the extension work with all the new features and code MediaWiki has at its current state. We can also take this opportunity to gather opinions from CheckUsers on which new functions the new CheckUser extension should have, etc.

I think that we can start this big task by making a UI revamp, and later explore if new features could be added to the extension as well. If someone or various people could also volunteer to be active maintainers of the extension, that'd also be fantastic.

I'd like to thank all of those who created the extension and have worked with it so far.
Sorry if the format is wrong or I missed something. It's the first RFC I've filed here. Let me know if there's something that needs fixing and I'll try to do that.

Best regards.

Edit: since rewritting from scratch is a bad option according to experienced developers below, modified some parts of the intro as well.

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Hello,

Let's do a simple exercise:
What is your role on the wiki where you use CheckUser?

Wikimedia Steward and CheckUser at two other projects.

What prompts you to use CheckUser? (Vandalism, suspected sockpuppetry, etc.)

To deter and prevent instances of vandalism, spam, crosswiki abuse and abuse of multiple accounts, across many Wikimedia projects due to my role as steward.

What frustrates you when you try to use CheckUser?

May I point you to the workboard?

The only specific complaint I see listed here is that the interface is bad, which is a very valid complaint. If that's so, there should be a task for re-designing the interface, with an assessment of who uses the tool and what the tool is used for so that the interface can be optimized for its use cases.

I refer you to the introduction of this RFC. There is a bug for this T132892: CheckUser UI revamp.

This is a very strong assertion, and I do not see specifically why this is the case. Is there something that makes it structurally incompatible with MediaWiki? Does it not work at all? Does it have an unacceptable false positive or false negative rate?

There have been attempts to create a global checkuser tool. I can't quote vermatim from the source because it's prohibited, but we've been told that with the current state of the tool is impossible to do so.

So, to recap, we have two great issues: the first is the UI redesign which will be a great improvement, but we also need the second, which is a global CU tool; and to the best of my knowledge, the second one can't happen just with just improving the UI.

An analysis of the age and turnover of the bugs tagged CheckUser would make this a much stronger RFC. Would you be willing to do that work?

Yes, I can create a table listing the open bugs with the date of creation. So far, we've 42, which are (ordered by ticket number):

It'll get me some more time to get the creation dates of the bug, as many of them were filled under the old Bugzilla service.

Best regards.

List of pending tasks ordered by ticket number
TaskCreated (y/m/d)
T11858: Recursive checkuser feature needed2007-05-09
T16699: More versatile searching in CheckUser log2008-07-01
T18306: Allow user to specify block settings from CU form2008-11-11
T21796: Checkuser watchlist feature2007-07-17
T21964: Provide a link to the user on account creation2009-07-28
T24119: Enhance Blocking and tagging from checkuser interface: add new page2010-06-16
T24120: Enhance Blocking and tagging from checkuser interface: add dropdowns2010-06-16
T24890: Checkuser mass blocker does not affect already blocked users2010-03-29
T26231: Allow CheckUsers to filter checks for account creations only2010-07-02
T26232: Allow CheckUsers to filter results to new accounts only2010-07-02
T26411: List recent User-Agents for a user or IP2010-07-17
T26552: Wiki-email to store a hash of the message text2010-07-26
T27053: Replace user / talk pages feature of "Get users" creates a user/talk page if it doesnt exist2010-09-04
T29431: Drop legacy installer script from CheckUser extension (update.php hooks should already exist)2011-02-15
T33712: CheckUser User-Agent check feature2011-10-14
T36372: New User hook broken for PG, breaks normal logging2012-02-13
T36438: Reverse DNS lookup2012-02-16
T39612: CheckUser displays 2001:6a0:200:121::2/128 and 127.0.0.1/32 on PostgreSQL2012-06-15
T39613: CheckUser records XFF: 0 on PostgreSQL2012-06-15
T39971: No results on CheckUser query from 127.0.0.1 on PostgreSQL2012-06-26
T41013: Incomplete i18n for log entries in CheckUser2012-08-03
T46800: IPv6 data loaded from recentchanges table is in compact instead of full address format2013-02-08
T49505: CheckUser results (userlinks) cannot be customized2013-04-22
T54849: Checkuser option "get users" should point out when user password is changed2013-08-14
T61677: FlaggedRevs reviews should be visible in CheckUser2014-01-05
T65011: the messages checkuser-email-action and checkuser-reset-action need to support GENDER2014-03-24
T69811: CheckUsers "Get users" design needs improvements2014-07-10
T94735: Rewrite Special:CheckUser using server side templates2015-04-01
T99056: Improve GENDER support for checkuser logs2015-05-14
T106282: action=query list=checkuserlog API response is inconsistent with other action=query responses2015-07-19
T107651: Mark accounts that hit limit of account creation in one IP for functionaries2015-08-01
{T112751}2015-09-16
T113205: checkuserlog API documentation references non-existent "culstart" and "culend"2015-09-21
T117801: Checkuser should provide option to expose registered email address2015-11-04
T121190: Add an option to show only IPs or accounts in get users2015-12-11
T122296: Allow checking by last 64 bits of an IPv6 address2015-12-23
T123114: CheckUser stores two rows for account creations on WMF wikis2016-01-08
T125324: Allow checkusers to compare the watchlists of two accounts 2016-01-31
T131207: Create a checkuser entry for global rename requests2016-03-29
T132892: CheckUser UI revamp2016-04-28
T139809: Bad output of AbuseFilter blocks in CU get edits2016-07-08

Since some of them are quite old, I think that we need to clear that list of no longer valid bugs, just in case the issues reported got already resolved. Then focus on the UI and i18n bugs IMHO.

"Rewrite from scratch" is one potential solution to big problems with an existing codebase. I'd still love to see an improved explanation providing specific problems with the current extension.

@MarcoAurelio: T139810#2444212 states that "CU is obsolete, and has flaws that require a major rewrite to fix it." This feels like a statement that welcomes more elaboration - due to which specific architecture issues or specific bugs / missing functionality did you come to this conclusion?

Relatedly, T139810#2485937 asks "What's the most important problem to solve from your perspective?" and T139810#2487919 asks "What frustrates you when you try to use CheckUser?"
These are good questions, as every non-trivial piece of software has bugs and unfulfilled feature requests. Hence 'just' looking at the entire list of open tasks does not provide a strong argument for the proposed 'solution' (rewriting from scratch; which might introduce new regressions as explained in the link in T139810#2443633).

T139810#2443688 mentions that the "most useful addition would be a global checkuser". Feasibility whether this could be achieved via changes to the existing codebase should be discussed in a dedicated task. (Is T131207: Create a checkuser entry for global rename requests about this?)
And the initial description also mentions T132892: CheckUser UI revamp.
Currently these seem to be the only two specific issues addressing the question.

(In addition, there might be (a perception of?) a low level of maintainership. This has been challenged by provided numbers for Git activity in T139810#2460521.)

I also don't think we need to start from scratch and write a new extension. Trying to do that would almost certainly be more work than working on refactoring and improving the current code. It's also not like CheckUser is the only extension deployed on our wikis with bugs that are not quickly fixed (and so needing completely new extensions). The lack of an active maintainer also doesn't seem to be much of an issue during this year so far as I have received reviews quicker than previously. I think I will have some free time in the upcoming weeks so assuming that others (both devs and users) are interested in this, I can attempt to work on improving the extension as promised on the other task.

At Wikimania I had the opportunity to sit down with @Ajraddatz and understand how he uses CU, and got some good notes and workflow ideas from it.

Is it possible to share it with us? Maybe in a private paste if it cannot be made public.

(In addition, there might be (a perception of?) a low level of maintainership. This has been challenged by provided numbers for Git activity in T139810#2460521.)

I'm not sure what data git-blame uses to create such info, but I think that table is not accurate since nearly all commits are translation updates to the i18n files which a bot do import from translatewiki.net. "Human" commits to the repo are just 36 this year and not 840 as mentioned by Tim above; however I accede to the point that my affirmation that "nobody" works on the extension was not accurate; and of course let me say again that the contributions are appreciated.

"Rewrite from scratch" is one potential solution to big problems with an existing codebase. I'd still love to see an improved explanation providing specific problems with the current extension.

Well, I think we all concurr that the current status of the UI needs an update and that we also would benefit from new features in the future, which are being discussed separately. Can we focus on UI for now at least, if a rewrite from scratch is not optimal (I don't know, I'm not an engineer again).

T139810#2443688 mentions that the "most useful addition would be a global checkuser". Feasibility whether this could be achieved via changes to the existing codebase should be discussed in a dedicated task. (Is T131207: Create a checkuser entry for global rename requests about this?)

Nope, global CU is a wanted new tool. I'm not sure why there's not a task created about it but work was done. Sadly the entire project is deceased. What is being requested there is that the software should create a CU entry and register global rename requests requested through Special:GlobalRenameRequest.

I don't think listing ever single task in the E:CheckUser workboard is the best way forward for this task.

I think the list should be split up in the goals, for example, Tasks that should be worked in the Short/Medium/Long terms, It should also be looked at what would have benefits to the end users of the tools, back-end users (eg: ops team for maintenance scripts) and non wmf-cluster users.

Also we should talk about the scope of the tasks that should be looked at, For example there are several PostgreSQL related tasks so those probably shouldn't be classed as a priority in a WMF priority schedule (If they are fixed in the course of other tasks and improvements such as Database abstraction improvements, great!) as that DB back-end is not used on the cluster.

There is also a couple of tasks that at quick scan, I don't believe would align with our Privacy Policy, Legal guidelines or have consensus (but could be desired by other external users) so they should be prioritized lower as well.

MarcoAurelio renamed this task from RFC: New CheckUser extension to RFC: Overhaul the CheckUser extension.Jul 31 2016, 2:55 PM
MarcoAurelio updated the task description. (Show Details)
MarcoAurelio updated the task description. (Show Details)

I am pragmatic (that was a disclaimer), and therefore I see little value discussing whether we should rewrite from scratch or modify the existing code when the end product is unclear. I think instead of using this task to discuss philosophical questions about rewrite vs modify or is anyone contributing to this code at all, we should use it to reach consensus about what CU tool should look like. And depending on what we agree on, we can decide if a rewrite is needed or not.

Again, I am pragmatic! So I don't see much value in telling others what we should do; I see value in actually doing that thing, in this case in actually drawing a picture of what CU should look like.

I think the CU tool should allow the following (it currently does not):

  • Sorting the results by IP (I often want to find the high level "ranges" used by a user, that gives me an easy way to screen if two users are similar at all (if their ranges are not even close, then they are not). Right now, results are always returned sorted by date; I need to be able to sort by IP.
  • A list of distinct UAs used by a user. In fact, I think the "get IPs" function should be replaced by a "get summary" page which shows to you distinct IPs (sortable by time or by IP), distinct UAs (sortable by time or UA), and distinct IP-UA combos (sorted by time).
  • More decision support for UAs is needed. I don't think it'll be that much extra time to replicate what http://useragentstring.com/ or similar websites do. In the ideal world, I want CU tool to analyze the UA for me and already show me information like:
    • What browser, OS, etc does this UA represent
    • What year and month was that particular browser version released, and what year and month was a next version released (I find the date at which a user upgrades their browser a good clue for matching accounts or rejecting their similarity)
    • Does it look like a valid or a forged UA?
  • I know I might be asking for something that will never happen but: more decision support for IPs! I want CU tool to already tell me which country an IP belongs to, which ISP, etc. I know this information changes over time, and is not publicly and freely available, but I think at the very least, the CU code should be modified such that you could "extend" it by providing a CSV file containing IP-to-country and/or IP-to-ISP mappings. That way, we don't need to publish such a mapping as part of the code, but major users of the CU tool such as WMF can pay for proprietary lists like that and install them for their wiki. I envision a day that I run the CU tool and don't have to run fifty WHOISes right after.

I can think of many other front- and backend improvements. I encourage us to discuss those here and mock up the CU 2.0 rather than discussing what to do with old code.

Hi @Huji, thanks for the many clarifications, and bringing this back to a specific design. Given that this is not going to be an easy RFC (where "easy" means: very clear question, very little prose necessary), it seems like this should be drafted on mediawiki.org. Here's a deep link to the submission template. We can use this task to keep track of the state of the RFC.

daniel subscribed.

Dropping this off the RFC board, as per discussion during the ArchCom meeting. There is no architectural or strategic/cross-cutting technical issue to be discussed here. Once there are concrete proposals for implementations, these can be covered by RFCs, if appropriate.

The intent behind this ticket seems to be to get resources to overhaul an old part of the software, which currently has no owner, and has been unmaintained for a while. This is indeed a problem, but it's not one that can be solved by an ArchCom RFC. It's an organizational issue, not a technical one. RFCs should come with a commitment of resources for implementation, they cannot be used to acquire resources.

I think the fact that admin tools currently have no clear owner among the WMF engineering teams is an issue that needs to be addressed by the VP of Product.