Fix provided search results in Wikimedia Phabricator
Closed, ResolvedPublic

Description

Now that we have 75k tasks and a direct comparison with Bugzilla, it is clear that our Maniphest search doesn't perform at an acceptable level. The fact that such search is powered by Elasticsearch makes this problem more puzzling, because we know that the engine is capable of great results.

Whether this is a Phabricator problem to be resolved upstream, something wrong in our instance, or both, we need to do something about it.

For starters let's attach here specific issues as blocking tasks, so we have a better idea of what kind of problem we have.

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
demon added a comment.Jan 16 2015, 5:32 PM

I'm pretty sure it's fallout from T75743. I don't think it's working quite as well in practice as it did in testing.

@Chad: Does anybody have capacity to investigate and fix this in the next one or two weeks?

Currently it is impossible to search for words at all (e.g. have to fall back to using old-bugzilla for searching for duplicates).

demon added a comment.Jan 28 2015, 9:05 PM

It's just a matter of backing out upstream D11011. We'll lose the fix for T75743 but I think that's better than the status quo.

atgo added a subscriber: atgo.Jan 30 2015, 3:45 PM
Joe raised the priority of this task from High to Unbreak Now!.Feb 5 2015, 4:21 PM

I have large difficulties working with phabricator day to day.

This ticket has been around long enough.

Please ubreak now.

Qgil added a comment.Feb 5 2015, 8:57 PM

What if we use the default backend instead of Elasticsearch? The problems we are having currently are supposed to be fixed in Phabricator's native backend, and we are not gaining anything from having Elasticsearch. Upstream won't take bugs specific to the Elasticsearch backend.

At the beginning we decided to go for Elasticsearch under the assumption that we could leverage our expertise and obtain better results. However, in practice this puts all the pressure on top of exactly 1,5 people (mainly @chad, also @Manybubbles to a certain extent). These guys are hyperbusy in several critical Wikimedia missions, and I'm the first one not willing to ask them to put more of their time here. To be more precise, if I had a minute of Chad, I would put it first on the Diffusion / gitblit-deprecate project (which is what he is doing).

So what about this: install Phabricator's default search backend, and if there are any bugs we can report them upstream.

demon added a comment.Feb 5 2015, 9:16 PM

I thought the reason for it was that Elasticsearch supported things that Mysql just wouldn't support? I can't find the task offhand where it was originally requested.

Tgr added a comment.Feb 6 2015, 2:29 AM
In T75854#783942, @Qgil wrote:

I'm after a problem related to terms in task titles, but I haven't found the exact description of the problem yet. Basically, searching for one or more words in the title of a task brings poor results, and I even suspect no results (sometimes?). This needs specific tests, because words in the title are usually mentioned in the body as well. Has anybody tried to search for words appearing only in the title, not in the body?

Here is a recent example: I searched via the quick search for "sentry coverage". The only ticket actually containing those two words in its title is result number 62. None of the top 10 results have both words in their text (including body); I'm pretty sure none of the other results don't either because "sentry" is a pretty specific term and the few tickets related to it have nothing to do with coverage. Result number 4 does not include either word and seems to be related in absolutely no way to the query.

In general, searching based on anything but projects is beyond useless, and I have taken to the habit of listing all tasks of a project and then using the browser's builtin search (Ctrl-F) tp try tp find what I am looking for.

Is there a way to see what request is sent to elasticsearch and play with alternatives?

jayvdb added a subscriber: jayvdb.Feb 6 2015, 4:12 AM

have to fall back to using old-bugzilla for searching for duplicates

Time to reconsider T934? (And some other old-bugzilla changes which I proposed somewhere, related to T240#833612.)

What if we use the default backend instead of Elasticsearch?

Seems a sensible short-term solution.

I don't think we are going to ever stop using old-bugzilla search, because phabricator is intrinsically and proudly less semantic. So, we must admit that using old-bugzilla is integral part of the phabricator workflow, and spend the little effort needed to make it easier to use.

and we are not gaining anything from having Elasticsearch.

The default backend now supports stemming / substrings? I probably missed that change.

Aklapper renamed this task from Fix search in Wikimedia Phabricator to Fix provided search results in Wikimedia Phabricator.Feb 6 2015, 12:58 PM

As per initial description, this task is about search results, not about the search UI. Clarified the summary.

scfc added a subscriber: scfc.Feb 6 2015, 1:34 PM
Tbayer added a subscriber: Tbayer.Feb 6 2015, 6:32 PM
Tgr added a comment.Feb 6 2015, 11:10 PM

Another recent example: searching for "boomerang" (a JS library that I remembered mentioning in some discussion but had otherwise no clue where that discussion happened). The actual ticket is around place 50; not a single one in the first 20 search results contains that word. Instead they have something remotely similarly sounding (like "zoom" or "book") in a prominent field (title or description). So I would guess the problem (or part of it) is that the search score cutoff for word similarity should be much sharper.

awight added a subscriber: awight.Feb 8 2015, 5:38 PM

I'm having a great time with our Phabricator instance, it's really made work more pleasant and sane, but yeah this bug is turning it into a black hole for me... Especially since I don't have permissions to create tags!

Qgil added a comment.Feb 9 2015, 10:45 AM

and we are not gaining anything from having Elasticsearch.

The default backend now supports stemming / substrings? I probably missed that change.

This is fair, and the task to follow this feature upstream is https://secure.phabricator.com/T6740

If you are not convinced about changing the backend (I have no solid arguments myself other than conservativism), then an option is to do what Chad said a couple of weeks ago:

In T75854#999306, @Chad wrote:

It's just a matter of backing out upstream D11011. We'll lose the fix for T75743 but I think that's better than the status quo.

Oh, I'm totally fine with temporarily changing the backend to the SQL-backed standard search. @Chad also proposes that in https://lists.wikimedia.org/pipermail/wikitech-l/2015-February/080592.html for the time being.

It's just not clear to me which steps are needed on our side to do that, and who will perform them.

Very sick today but in general I can do this, probably wed as it will
create an unknown period of search outage. I wanted to give @Springle a
heads up since any fallout likely affects him.

So sure let's do it but may be a day or two.

Elitre added a subscriber: Elitre.Feb 9 2015, 1:05 PM

Very sick today but in general I can do this, probably wed as it will
create an unknown period of search outage. I wanted to give @Springle a
heads up since any fallout likely affects him.

So sure let's do it but may be a day or two.

Did this and talked to @Springle a bit about it. We should probably raise the aria_pagecache_buffer_size but it requires restart and is not a dynamic change.

wOOT! Thanks for all the great work, search results look exceptionally healthy today.

Is the backend change live? I'm not sure I'm seeing any improvements with example searches yet. For example, this task starts with "Fix provided search". I would expect a search for those words even without quotes to yield this as the first result. However, the first set of results only match one of the words in the title, and aren't relevant to the search. So generally the ranking still seems to be in dire need of improvement.

Question to other users: How valuable are the commit results? I find that I most frequently search for tasks, and it might be reasonable to exclude commit searches from default search. But I wonder how very active devs are using the tool and how important commit searches are to day-to-day workflows.

Question to other users: How valuable are the commit results? I find that I most frequently search for tasks, and it might be reasonable to exclude commit searches from default search. But I wonder how very active devs are using the tool and how important commit searches are to day-to-day workflows.

That's T76273.

I was trying to find T69476 which is currently titled

Add wiki title case sensitivity flag (is_sensitive) to meta_p.wiki to support jbo.wp and wiktionary tools

Searching open tasks only for 'jbo wp' and 'jbo.wp' returns no results!

However 'jbo.wp case' returns results, but not T69476 , even after all 300 results:
https://phabricator.wikimedia.org/search/query/26monNitFr8X/?offset=300

After many iterations, and a few swear words, I found it at the bottom of the first page of 'jbo wp title case' .

So I went back and tried 'jbo wp case' to see if that includes it; yes, it shows it after 400 results:
https://phabricator.wikimedia.org/search/query/FDjUwEs7K.dB/?offset=400

Search has to be the worst aspect of Phab for people who use it a lot, and know what they are looking for. It is not a productivity tool; it is a time sink. ;-(
Is there a document somewhere that describes the search algorithm in depth, so we can adjust how we remember bugs and search for them accordingly.

The top maniphest adv search result for Contains Words: interwiki google search (open & stalled only) is this task (T75854).

I was searching for T28115 (Interwiki redirects via URL and other 301s are indexed by Google search), and it did not appear in the search results at all.

This task (T75854) doesnt mention Interwiki or Google at all.

I was searching for T28115 (Interwiki redirects via URL and other 301s are indexed by Google search), and it did not appear in the search results at all.

This task (T75854) doesnt mention Interwiki or Google at all.

It almost sounds like the search is doing the opposite of what it should, i.e. it ignores specific keywords and it promotes "stopwords" / very common words.

Question to other users: How valuable are the commit results? I find that I most frequently search for tasks, and it might be reasonable to exclude commit searches from default search. But I wonder how very active devs are using the tool and how important commit searches are to day-to-day workflows.

As a developer, my experience is the same. When I use Phabricator search, I am almost always searching for tasks (important note, this might change somewhat if/when we use Phabricator for code review).

If I do need to search for a commit message (or code in a commit), I normally use the git command line. I can also use Gerrit for searching commit messages (not great performance-wise), or explicitly use the Commit document type in Phabricator search.

Tgr added a comment.Feb 16 2015, 6:12 AM

Commit search in Phabricator is utterly useless as it is now. Commits are not tagged by project, so you can only search by author name and description, which means heaps of false positives. (You can filter by repo via the super-user-hostile "callsign" field in Diffusion search, but that's a different interface not affected by the default setting for general search.) Not to mention that commits only appear after they have been merged.

Once code review happens on Phabricator, there will be a new entity which is the rough equivalent of open changesets on Gerrit, so commits are not going to be the "tasks" of code review, and I will imagine they will remain useless. Either way, I don't think it makes sense to keep a setting that is confusing for new users just because it might become useful in a year or so.

In T75854#1040795, @Tgr wrote:

Commit search in Phabricator is utterly useless as it is now. Commits are not tagged by project, so you can only search by author name and description, which means heaps of false positives. (You can filter by repo via the super-user-hostile "callsign" field in Diffusion search, but that's a different interface not affected by the default setting for general search.) Not to mention that commits only appear after they have been merged.

Once code review happens on Phabricator, there will be a new entity which is the rough equivalent of open changesets on Gerrit, so commits are not going to be the "tasks" of code review, and I will imagine they will remain useless. Either way, I don't think it makes sense to keep a setting that is confusing for new users just because it might become useful in a year or so.

Agreed. I don't think commits need to be part of the default search.

Aklapper closed this task as Resolved.Feb 20 2015, 2:07 PM
Aklapper claimed this task.

As written above, default search scope (tasks vs commits) is discussed in T76273 instead.

I was trying to find T69476 [...] Searching open tasks only for 'jbo wp' and 'jbo.wp' returns no results!

Searching all documents (the default search scope) for "jbo wp" lists it on 7th place for me now that we use the MySQL backend again instead of ElasticSearch. → Seems to work now.

top maniphest adv search result for Contains Words: interwiki google search (open & stalled only) [...]
I was searching for T28115 (Interwiki redirects via URL and other 301s are indexed by Google search), and it did not appear in the search results at all.

Using the Search field in the upper right corner on https://phabricator.wikimedia.org/maniphest/ and entering "interwiki google search" I get two results: T28115 and this task. → Seems to work now.

Things seem to work now that we're back on the MySQL backend (allocating resources to deeply dive into the ElasticSearch backend to fix its obvious issues and switch back to ElasticSearch is a different task).

Hence I'm closing this task as resolved.

If there are still specific issues with the current, please bring them up in dedicated tasks with clear steps to reproduce. Thanks everybody!

Field-testing and inspection of the dozens of reports about search issues provided some information about the current search, which I tried to summarise in the docs: https://www.mediawiki.org/w/index.php?title=Phabricator%2FHelp&type=revision&diff=1650699&oldid=1650693
Please revise.

Field-testing and inspection of the dozens of reports about search issues provided some information about the current search, which I tried to summarise in the docs: https://www.mediawiki.org/w/index.php?title=Phabricator%2FHelp&type=revision&diff=1650699&oldid=1650693
Please revise.

Thank you, @Nemo_bis! Only thought: I'm not sure what "search keyword" means in this context and I'm afraid others might face the same problem. What is the difference to a "search term" entered?

What is the difference to a "search term" entered?

Dunno, I guess you're right that "search term" is the standard phrase in English.

Restricted Application added a project: Discovery. · View Herald TranscriptJun 14 2015, 12:20 AM
Nnemo awarded a token.Feb 22 2016, 9:01 PM
Nnemo added a subscriber: Nnemo.
awight removed a subscriber: awight.Mar 13 2016, 4:42 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptApr 28 2016, 3:23 AM
Restricted Application added subscribers: Southparkfan, Luke081515, TerraCodes, Urbanecm. · View Herald Transcript
Danny_B changed the status of subtask T679: Phabricator search does not search substrings from Duplicate to Resolved.Jul 5 2016, 3:40 PM
Restricted Application added a subscriber: Jay8g. · View Herald TranscriptNov 29 2016, 7:54 PM