Page MenuHomePhabricator

RFC: Overhaul the CheckUser extension
Closed, ResolvedPublic

Description

CheckUser (CheckUser · rECHU) is probably one of the oldest extensions still working in Wikimedia Projects.

Designed c. 2005, the extension is one of the critical tools that helps us to deal with problematic cases of abuse such as sock puppetry, vandalism and spam (some history here). As time went by, the needs of the projects increased, and as we can see in the CheckUser work board, the bugs accumulated without being resolved, primarily because its code, albeit old, also seems hard for developers to read and work with (refer to T132892: CheckUser UI revamp and its related tasks, as well as the work board I've linked above).

The lack of an active maintainer for the Extension in question lowers development productivity in this area (i.e.: bug resolution, testing and extension development ex Phabricator). To resolve this issues, I think we need to think about overhauling the CheckUser extension. That overhaul should also be an opportunity for us to make the extension work with all the new features and code MediaWiki has at its current state. We can also take this opportunity to gather opinions from CheckUsers on which new functions the new CheckUser extension should have, etc.

I think that we can start this big task by making a UI revamp, and later explore if new features could be added to the extension as well. If someone or various people could also volunteer to be active maintainers of the extension, that'd also be fantastic.

I'd like to thank all of those who created the extension and have worked with it so far.
Sorry if the format is wrong or I missed something. It's the first RFC I've filed here. Let me know if there's something that needs fixing and I'll try to do that.

Best regards.

Edit: since rewritting from scratch is a bad option according to experienced developers below, modified some parts of the intro as well.

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I also don't think we need to start from scratch and write a new extension. Trying to do that would almost certainly be more work than working on refactoring and improving the current code. It's also not like CheckUser is the only extension deployed on our wikis with bugs that are not quickly fixed (and so needing completely new extensions). The lack of an active maintainer also doesn't seem to be much of an issue during this year so far as I have received reviews quicker than previously. I think I will have some free time in the upcoming weeks so assuming that others (both devs and users) are interested in this, I can attempt to work on improving the extension as promised on the other task.

At Wikimania I had the opportunity to sit down with @Ajraddatz and understand how he uses CU, and got some good notes and workflow ideas from it.

Is it possible to share it with us? Maybe in a private paste if it cannot be made public.

(In addition, there might be (a perception of?) a low level of maintainership. This has been challenged by provided numbers for Git activity in T139810#2460521.)

I'm not sure what data git-blame uses to create such info, but I think that table is not accurate since nearly all commits are translation updates to the i18n files which a bot do import from translatewiki.net. "Human" commits to the repo are just 36 this year and not 840 as mentioned by Tim above; however I accede to the point that my affirmation that "nobody" works on the extension was not accurate; and of course let me say again that the contributions are appreciated.

"Rewrite from scratch" is one potential solution to big problems with an existing codebase. I'd still love to see an improved explanation providing specific problems with the current extension.

Well, I think we all concurr that the current status of the UI needs an update and that we also would benefit from new features in the future, which are being discussed separately. Can we focus on UI for now at least, if a rewrite from scratch is not optimal (I don't know, I'm not an engineer again).

T139810#2443688 mentions that the "most useful addition would be a global checkuser". Feasibility whether this could be achieved via changes to the existing codebase should be discussed in a dedicated task. (Is T131207: Create a checkuser entry for global rename requests about this?)

Nope, global CU is a wanted new tool. I'm not sure why there's not a task created about it but work was done. Sadly the entire project is deceased. What is being requested there is that the software should create a CU entry and register global rename requests requested through Special:GlobalRenameRequest.

I don't think listing ever single task in the E:CheckUser workboard is the best way forward for this task.

I think the list should be split up in the goals, for example, Tasks that should be worked in the Short/Medium/Long terms, It should also be looked at what would have benefits to the end users of the tools, back-end users (eg: ops team for maintenance scripts) and non wmf-cluster users.

Also we should talk about the scope of the tasks that should be looked at, For example there are several PostgreSQL related tasks so those probably shouldn't be classed as a priority in a WMF priority schedule (If they are fixed in the course of other tasks and improvements such as Database abstraction improvements, great!) as that DB back-end is not used on the cluster.

There is also a couple of tasks that at quick scan, I don't believe would align with our Privacy Policy, Legal guidelines or have consensus (but could be desired by other external users) so they should be prioritized lower as well.

MarcoAurelio renamed this task from RFC: New CheckUser extension to RFC: Overhaul the CheckUser extension.Jul 31 2016, 2:55 PM
MarcoAurelio updated the task description. (Show Details)
MarcoAurelio updated the task description. (Show Details)

I am pragmatic (that was a disclaimer), and therefore I see little value discussing whether we should rewrite from scratch or modify the existing code when the end product is unclear. I think instead of using this task to discuss philosophical questions about rewrite vs modify or is anyone contributing to this code at all, we should use it to reach consensus about what CU tool should look like. And depending on what we agree on, we can decide if a rewrite is needed or not.

Again, I am pragmatic! So I don't see much value in telling others what we should do; I see value in actually doing that thing, in this case in actually drawing a picture of what CU should look like.

I think the CU tool should allow the following (it currently does not):

  • Sorting the results by IP (I often want to find the high level "ranges" used by a user, that gives me an easy way to screen if two users are similar at all (if their ranges are not even close, then they are not). Right now, results are always returned sorted by date; I need to be able to sort by IP.
  • A list of distinct UAs used by a user. In fact, I think the "get IPs" function should be replaced by a "get summary" page which shows to you distinct IPs (sortable by time or by IP), distinct UAs (sortable by time or UA), and distinct IP-UA combos (sorted by time).
  • More decision support for UAs is needed. I don't think it'll be that much extra time to replicate what http://useragentstring.com/ or similar websites do. In the ideal world, I want CU tool to analyze the UA for me and already show me information like:
    • What browser, OS, etc does this UA represent
    • What year and month was that particular browser version released, and what year and month was a next version released (I find the date at which a user upgrades their browser a good clue for matching accounts or rejecting their similarity)
    • Does it look like a valid or a forged UA?
  • I know I might be asking for something that will never happen but: more decision support for IPs! I want CU tool to already tell me which country an IP belongs to, which ISP, etc. I know this information changes over time, and is not publicly and freely available, but I think at the very least, the CU code should be modified such that you could "extend" it by providing a CSV file containing IP-to-country and/or IP-to-ISP mappings. That way, we don't need to publish such a mapping as part of the code, but major users of the CU tool such as WMF can pay for proprietary lists like that and install them for their wiki. I envision a day that I run the CU tool and don't have to run fifty WHOISes right after.

I can think of many other front- and backend improvements. I encourage us to discuss those here and mock up the CU 2.0 rather than discussing what to do with old code.

Hi @Huji, thanks for the many clarifications, and bringing this back to a specific design. Given that this is not going to be an easy RFC (where "easy" means: very clear question, very little prose necessary), it seems like this should be drafted on mediawiki.org. Here's a deep link to the submission template. We can use this task to keep track of the state of the RFC.

daniel subscribed.

Dropping this off the RFC board, as per discussion during the ArchCom meeting. There is no architectural or strategic/cross-cutting technical issue to be discussed here. Once there are concrete proposals for implementations, these can be covered by RFCs, if appropriate.

The intent behind this ticket seems to be to get resources to overhaul an old part of the software, which currently has no owner, and has been unmaintained for a while. This is indeed a problem, but it's not one that can be solved by an ArchCom RFC. It's an organizational issue, not a technical one. RFCs should come with a commitment of resources for implementation, they cannot be used to acquire resources.

I think the fact that admin tools currently have no clear owner among the WMF engineering teams is an issue that needs to be addressed by the VP of Product.

Trust and Safety Product Team actively maintains CheckUser and has made numerous improvements to it over the last year. While there is still a lot that can be done to improve the extension, I think the way forward is to continue with individual feature requests and bug reports, rather than look at an overhaul. As such, I'm going to close this task, but if you disagree, please reopen and let's discuss.