Page MenuHomePhabricator

Phabricator should suggest possible duplicates when creating a new task
Open, LowPublicFeature

Description

Wikimedia’s issue tracker used to be powered by Bugzilla. With Bugzilla, when we created a new report, we got a list of possible duplicates. I find it useful. It might be even more useful for new users filing what is going to be probably an obvious duplicate, so that less duplicates are reported.

Context:

It seems 6.57%~ of tickets are marked as duplicates in Wikimedia Phabricator (see comment from aklapper).

What is needed

Recognition of linguistic/semantic similarity or something, ideally both:

  • Right after a title is entered
  • After the description is entered (when the report is complete)

Old Upstream ticket: https://secure.phabricator.com/T4828

Details

Reference
fl74

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Imagine a world where James Forrester spends his day just marking my tasks here as duplicate.

Restricted Application removed a subscriber: Mjbmr. · View Herald TranscriptApr 13 2016, 12:11 AM

This project is selected for the Developer-Wishlist voting round and will be added to a MediaWiki page very soon. To the subscribers, or proposer of this task: please help modify the task description: add a brief summary (10-12 lines) of the problem that this proposal raises, topics discussed in the comments, and a proposed solution (if there is any yet). Remember to add a header with a title "Description," to your content. Please do so before February 5th, 12:00 pm UTC.

Last update in https://secure.phabricator.com/T4828#106500 from Apr 11 2015:

If we did https://secure.phabricator.com/T7805 first and got a generic ApplicationSearch endpoint out of it, I'd be open to writing this as an extension CustomField and then disavowing all knowledge of it. The results UI wouldn't be custom, but maybe that's fine. We might need to pay down some infrastructure debt to let installs put this immediately underneath the "Title" field, I think a couple of the fields are still hard-coded.

This suggestion got the most votes in the Developer Wishlist (with a quite confident lead). @Qgil @greg is either of you interested in making this an official WMF goal?

I'm already working on phabricator search stuff so this is not that far out of scope...

@mmodell @greg I am assuming that you have this ball in your court. If you think that sponsoring the development of this feature might help, please let me know. Depending on the cost, we might be able to cover it during this fiscal year (before the end of June).

(I still would like to see the completion of T136213 before jumping on new funded tasks, though).

Thinking through the ways of addressing this. Will post more when we're ready to commit :)

Qgil raised the priority of this task from Lowest to Low.Apr 20 2017, 8:12 AM

Open, Low :P

But seriously, not on RelEng's radar (and our Q2 goals are already fleshed out and too many ;) ). Looks like not on upstream's current plans either (based on lack of updates there).

We really need this. I just merged the seventh duplicate task of T259565, and they were all created in a span of few hours, some just minutes apart. It would really be good if Phabricator is to be smarter on this end. (I was about to create duplicate of this task too, before recalling that I saw something like this.)

Jdlrobson subscribed.

Is now a good time to reconsider low? The lack of this does create a lot of work for triaging bugs and managing fragmented conversations - in particular for user facing products

As long as I neither see a good NLP / AI algorithm for the English language (more relevant to me) nor much research that an implementation to suggest potential duplicates significantly lowers the number of created duplicates (less relevant to me) this feels low/lowest priority to me. Maybe it's just my usually disappointing personal (anecdotal) experience in Gitlab and Bugzilla instances with such "proposals" which makes me relucant.
An algorithm might be way more successful if it gave way more weight to recently created (or edited?) tickets, I guess?
In any case, regarding WMF I doubt that there are currently resources to tackle such a huge project. :-/ Feels like upstream territory.

I just merged the seventh duplicate task of T259565

Looking at the task summary I have problems to find language based patterns that would have allowed proposing the "right" existing ticket.
I see the root parse three times, I see the word flow in two ticket summaries, four times mobile, and history in five. Hmm. Some cross-matching but still quite mixed for having 10 items in the pool.

T259565: [Regression] Unparsed wikitext in various JavaScript messages
T259696: Footnote in Flow messages in not parsed
T259602: Last edit indicator is broken on Minerva skin
T259601: History box error on Mobile Web for enwiki
T259584: Link to history broken
T259583: Revision History not accessible on mobile
T259581: Mobile page history "footer" showing raw URL
T259575: [regression -wmf.2] Homepage - SE filter "Create a new article" description displays ulr -encoded text not a link
T259580: "flow-wikitext-editor-help-and-preview" message is broken on flow pages on all wikis
T259571: Page history log bug
T259579: "Last modified" footer on mobile unparsed date and user links

Maybe we could do something as simple as showing a list of all the most recently submitted tasks on the submission page? That might catch some things.

Maybe we could do something as simple as showing a list of all the most recently submitted tasks on the submission page? That might catch some things.

I don't think many people want to get a list of 50 tasks into their face and then spend time reading that list every single time.
It might catch a few things.
It will also condition basically everybody to scream and quickly scroll down.

Looking at the last 10000 tickets created, 4.19% of tickets marked as a duplicate.
Might be biased (too recently created to have been triaged?), so looking at all tickets created since launching Phab, 6.57% of tickets are marked as duplicates.

SELECT t.status,COUNT(t.id) FROM phabricator_maniphest.maniphest_task t WHERE t.id > 249776 GROUP BY t.status;
+-----------+-------------+
| status    | COUNT(t.id) |
+-----------+-------------+
| declined  |         213 |
| duplicate |         419 |
| invalid   |         324 |
| open      |        5158 |
| resolved  |        3800 |
| stalled   |          86 |
+-----------+-------------+

SELECT t.status,COUNT(t.id) FROM phabricator_maniphest.maniphest_task t WHERE t.id > 75682 GROUP BY t.status;
+-----------+-------------+
| status    | COUNT(t.id) |
+-----------+-------------+
| declined  |       12500 |
| duplicate |       12088 |
| invalid   |        9638 |
| open      |       36926 |
| resolved  |      112026 |
| stalled   |         901 |
+-----------+-------------+

Any changes that would make this possible? ^_^

See my previous comments here; has some situation changed, or have new arguments arisen?

Another root problem: it's difficult, if not impossible for newcomers, to get a "big picture" after landing on a bug reporting form. That of course causes some of these duplicates.

I very like the idea of "minimal forms", but, I'd like to reduce this feeling of «Welcome in this form, put your complaint/request here in this box, we'll find duplicates for you».

In the case of forms with at least one Tag, it would make sense if there was a way to visit that Tag. I'm definitely taking it very far, but it's a problem. A side-effect is to help people to be curious and discover other things and help each other "finding many friends on this journey" or stuff like that.

Chealer renamed this task from Phabricator should suggest possible duplicates when creating a new task to Phabricator does not warn about (apparent) duplicates when creating a new task.Jul 8 2025, 2:42 PM
Chealer updated the task description. (Show Details)

@Chealer: Please do not rename feature request titles in this way. It's not a bug ("does not warn"). It's a feature ("should suggest"). Thanks for your understanding.

Aklapper renamed this task from Phabricator does not warn about (apparent) duplicates when creating a new task to Phabricator should suggest possible duplicates when creating a new task.Jul 8 2025, 3:13 PM
Chealer changed the subtype of this task from "Task" to "Feature Request".Jul 8 2025, 3:35 PM

@Aklapper: This is not a feature, but―like the vast majority of tickets―an issue report, with an implicit (sometimes explicit) request to solve that issue. The corrected title does not imply this issue is a bug any more than "Andre does not do my dishes" is a bug. It merely describes the current situation.
We try to keep reports functional and focused on problems. An even better title would be "ITS (Phabricator) allows creation of too many duplicate tickets". Analyzing which solutions are best is a second step.
Thanks for your understanding

Hi maintainers,

I’m Safrin, a beginner contributor interested in Phabricator / Wikimedia tooling. I’ve read through the full discussion and understand the concerns around NLP quality, feasibility, and maintenance cost.

Instead of attempting a full AI-based duplicate detection, I was thinking of starting with a very small and practical step, such as:

While typing the task title, show a small list of recently created tasks with simple keyword similarity (e.g. trigram / basic search ranking).

This would be only a suggestion panel, not enforcement.

The goal would be to reduce very obvious duplicates created within short time windows.

I would like to first prototype this as a minimal UI + search integration experiment and share results before proposing anything larger.

If this sounds reasonable, I’d be happy to work on a proof-of-concept or discuss the best direction to start.

Thanks for your time and guidance.

Hi and welcome! Wikimedia Phabricator is an instance of the Phorge software (with some customizations), so any such potential development should happen in the upstream codebase at https://we.phorge.it/ and not here in Wikimedia Phabricator.
There seems to be no ticket about this feature proposal in upstream yet: https://we.phorge.it/maniphest/query/S.1.IWmOkHzE/#R
This is not a good beginner task, unless you are willing to spend a lot of time trying to understand the existing Phorge codebase.

(e.g. trigram / basic search ranking).

FYI the search code within Phorge already supports stemming etc.

Personally I don't see convincing reasons to implement this (see previous comments) but that is just my own opinion as I do not expect many people to spend time reading through yet another list of items, or to read instructions in general, based on experience.