Page MenuHomePhabricator

Re-evaluate how we implement phabricator's search engine
Closed, ResolvedPublic

Description

IMPORTANT: Experimental elasticsearch support has been enabled, to try it out please do the following. Enter something in the searchbar then you will get a url like https://phabricator.wikimedia.org/search/query/.trvn2LNszXn/#R please replace #R with ?elastic=1 which will give you elasticsearch results, example elasticsearch url https://phabricator.wikimedia.org/search/query/.trvn2LNszXn/?elastic=1

When we originally deployed Phabricator, we started out using the Elasticsearch back-end for searching phabricator Tasks / Objects. Unfortunately, after launching with Elasticsearch there were quite a few user complaints about unexpected search behavior and we eventually made the (IMO, somewhat hasty) decision to switch to using the MySQL Full-text search backend.

Now that we reached the scalability limit of MyISAM fulltext search, we've been sort of forced to switch to Innodb Fulltext search (see T146673)... The fulltext search feature in innodb is not as mature as MyISAM fulltext. The ranking algorithm is very simplistic and I expect results to be lower quality than before.

ElasticSearch should be far superior to mysql for implementing phabricator search. Lets try again to solve the problems with phabricator's elasticsearch integration rather than settling for a much worse search experience in Innodb.

A bit of history

There is quite a bit of history behind Phabricator's search implementation. Some of the more relevant tasks I was able to dig up are linked below:

Original ElasticSearch task

T95: Use ElasticSearch backend for Maniphest to get stemming feature

Stuff about switching to MySQL

T75854: Fix provided search results in Wikimedia Phabricator
T86805: Lots of unrelated results when searching for specific string

Problems after switching to Mysql

T89274: Mysql search issues flagged by Phabricator setup
T146789: Phab Advanced Search no longer showing typical results

Revisions and Commits

rPHEX phabricator-extensions

Related Objects

StatusSubtypeAssignedTask
Resolved mmodell
Resolvedjcrespo
Resolved mmodell
ResolvedNone
ResolvedAklapper
Resolved demon
Resolved demon
Resolved chasemp
Resolved chasemp
Declined chasemp
Declined chasemp
Resolved chasemp
ResolvedFriedhelmW
Resolved mmodell
DeclinedNone
Resolved ksmith
ResolvedNone
Resolved chasemp
Resolved mmodell
Resolved mmodell
Resolved mmodell
Resolved mmodell
Resolved mmodell
Resolved mmodell
DuplicateNone
Resolved mmodell
Resolved mmodell
DeclinedNone
ResolvedMarostegui
Resolved mmodell

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This would also reduce the load on the database. Also it would improve searching. There may be cons and pros but elasticsearch seems to be better now then when we tried it with phabricator in 2014.

@Paladox: I know you are trying to be helpful, but broad unspecific statements like "this would also reduce the load on the database" generally don't help when discussing and evaluating options.

I search Phabricator for "self-rename". The results are all complete garbage. Here's a screenshot:

Screen Shot 2016-11-23 at 1.47.32 PM.png (2×3 px, 575 KB)

I have to use Google search to find T17212: Allow self-renames from https://encrypted.google.com/search?hl=en&q=self-renames%20bugzilla%20wikimedia, where the first result is https://static-bugzilla.wikimedia.org/bug15212.html.

Why is Phabricator search still so awful?

It seems like it gets interpreted as self -rename (ie. "self AND NOT rename") whereas "self-rename" works (although it does not find 17212 for lack of stemming; "self-renames" does).

I found searching for the phabricator project no longer shows, but if i type in phabricatorr and then take the extra r out it finds phabricator.

It seems like it gets interpreted as self -rename (ie. "self AND NOT rename") whereas "self-rename" works (although it does not find 17212 for lack of stemming; "self-renames" does).

Can you screenshot the search results you're seeing? Both "self-rename" and "self-renames" produce the same terrible results for me.

Can you screenshot the search results you're seeing? Both "self-rename" and "self-renames" produce the same terrible results for me.

With the quotes? I get this task for "self-rename" and this and 17212 for "self-renames". (Links are for normal search but Maniphest search gives the same results too.)

It seems like it gets interpreted as self -rename (ie. "self AND NOT rename") whereas "self-rename" works (although it does not find 17212 for lack of stemming; "self-renames" does).

That was wrong. Compare self, self-rename and self -rename (limited to the renameuser project so that result size is manageable) - they all seem to return the same resultset. Compare also an unlimited search for self-rename - the first hit is T151376 which has lots of self but zero rename. Compare also self rename which gives reasonable results.

So what seems to go wrong is that

  1. the tokenizer is too aggressive, splitting self-rename into self and rename
  2. the + hack (to turn OR-based search into AND-based) is applied before full tokenization, which means that any query containing a word with a hyphen in it will revert the query to OR-based (and usually completely useless).
  3. (as we already know) the SQL-based search does not sort search results in any meaningful way.
MZMcBride raised the priority of this task from Medium to High.Nov 23 2016, 10:37 PM

Can you screenshot the search results you're seeing? Both "self-rename" and "self-renames" produce the same terrible results for me.

With the quotes? I get this task for "self-rename" and this and 17212 for "self-renames". (Links are for normal search but Maniphest search gives the same results too.)

I see the same right now. I continue to think the search index is incomplete and mentioning the task (cf. T17212#2818527) triggers it being inserted into the search index. It would be nice to audit the search index and figure out how comprehensive it is.

I tried searching with a preceding "+" and I'm fairly sure I tried with quotation marks. I also tried toggling to only open tasks.

I was also trying to do something completely unrelated. This task is too frustrating and is too much of a workflow impediment to be only normal priority. I doubt it'll make any difference, but pretending as though working search is an optional or nice-to-have feature in Maniphest is just insane.

Apparently some users have been working around this issue by using e-mail archives or by generating lists of open tasks in a Web browser window and using in-browser page search. Ugh.

So what seems to go wrong is that

  1. the tokenizer is too aggressive, splitting self-rename into self and rename
  2. the + hack (to turn OR-based search into AND-based) is applied before full tokenization, which means that any query containing a word with a hyphen in it will revert the query to OR-based (and usually completely useless).
  3. (as we already know) the SQL-based search does not sort search results in any meaningful way.

I'd add to this:

  1. Possibly incomplete search index.
  2. Issues with stemming (renames v. rename).
  3. No ability to limit search to only parts of a Maniphest task (just title, just description, just comments).

I was also trying to do something completely unrelated. This task is too frustrating and is too much of a workflow impediment to be only normal priority. I doubt it'll make any difference, but pretending as though working search is an optional or nice-to-have feature in Maniphest is just insane.

Agreed. It would be nice to see more organizational support for this (making it into a Q3 goal etc).

Apparently some users have been working around this issue by using e-mail archives or by generating lists of open tasks in a Web browser window and using in-browser page search. Ugh.

I tend to use workboards - as long as you can guess which project the task belongs to it's pretty effective.

Agreed. It would be nice to see more organizational support for this (making it into a Q3 goal etc).

We're (RelEng) drafting our goals now and this is definitely on the list and unless there is something odd in our future it will make the cut for Q3. The plan is, roughly,

  1. Identify test cases where we get bad (and good) results with the current system
  2. Setup a secondary index on ElasticSearch, either on the production Phab host or elsewhere, depending on complexity
    • This will probably not have a nice UI integration; it's purpose is simply to test known bad searches against ES with real data
  3. Do some testing, tune the ES index, assess if we should switch and how

This will hopefully reduce the (already annoying, agreed) disruption.

tl;dr: yeah, we agree.

Aklapper lowered the priority of this task from High to Medium.Nov 23 2016, 10:54 PM

Restoring previous priority as per T146843#2678289 to reflect reality.

Thanks Greg. This is much needed.

I was also trying to do something completely unrelated. This task is too frustrating and is too much of a workflow impediment to be only normal priority. I doubt it'll make any difference, but pretending as though working search is an optional or nice-to-have feature in Maniphest is just insane.

Agreed. It would be nice to see more organizational support for this (making it into a Q3 goal etc).

Since Release-Engineering-Team is the only team currently supporting Phabricator, we are considering to do a trial run of ElasticSearch as a Q3 goal for the team. We would be very dependent on support from Discovery-Search to get ElasticSearch up and running though.

Already tested on phab-01, pretty easy to switch once I figured out what setting to do it. Just there were concerns about security.

MZMcBride raised the priority of this task from Medium to High.Nov 24 2016, 12:23 AM

Restoring previous priority as per T146843#2678289 to reflect reality.

Excuse me, not having usable search is not "normal" priority. That's the reality. The linked comment is irrelevant.

FTR: High is fine and really how it's being interpreted by us. It's not UBN!, mostly due to other things we're working on before EOY that we promised, sadly. I apologize for that and for us getting in this situation now.

Looks like upstream has started working on innodb fulltext search and has also created a stemmer https://secure.phabricator.com/D16943

a stemmer would help a lot but I don't think it's enough to make innodb usable.

mmodell changed the task status from Stalled to Open.Nov 29 2016, 7:03 PM

Change 325333 had a related patch set uploaded (by 20after4):
Add search.elastic.host to settings for limited elasticsearch testing

https://gerrit.wikimedia.org/r/325333

Change 325333 merged by Dzahn:
phabricator: Add search.elastic.host for limited elasticsearch testing

https://gerrit.wikimedia.org/r/325333

I've re-read the timeline and I can't see if and how the errors are debugged. For example, in T146843#2684248, @epriestley wrote:

[…]
One possibly substantial issue I notice is that mysql.ft_boolean_syntax appears to be misconfigured or set to the default value, or possibly isn't respected by InnoDB. (There should be a setup warning that gets raised about this, but it may not be respected by InnoDB anyway.)
[…]

I did not see any reply to this.

Assuming that "reindexing" means that Phabricator takes the text of a task and builds a dataset "keyword → task", and taking my test case from T151500#2861939, https://phabricator.wikimedia.org/search/query/dxp3d.8.DDWY/#R will run some type of SQL query (or queries) which apparently does not return task 91231 (here without "T" to avoid a mention triggering something). That means:

  • Does the dataset for task 91231 include the word "databases"?
    • Yes: The query has some programming error and needs to be fixed to include task 91231.
    • No: Does "reindexing" (or editing or whatever) task 91231 cause the dataset for task 91231 to change to the correct data?
      • Yes: Something is munging or deleting data.
      • No: The task → dataset function has some error.

If there is no expertise for this part of Phabricator at WMF yet, I suggest to pay Phacility for an hour of co-debugging where someone can advise where to look and what queries to run.

Switching to a different backend just on the hunch that maybe the change will alter some mysterious code path that makes these bugs disappear and letting the Phabricator users wait for several weeks or months until that happens, is IMHO not an efficient way to fix a bug.

Does the dataset for task 91231 include the word "databases"?

Here's how to answer this question:

  • Find the task PHID for the task. You can do this by calling maniphest.search from the web UI, with {"ids":[91231]} as the constraints. In this case, the value is PHID-TASK-vebpammdqa5zxsm6xugc.
  • Query the index like this:
SELECT * FROM phabricator_search.search_documentfield WHERE phid = "<PHID of the object>"

This query requires database access to run, so I can't run it against this install, but here's similar data for a task on secure.phabricator.com:

mysql> SELECT * FROM secure_search.search_documentfield WHERE phid = 'PHID-TASK-loqyurfmfrikd727zooa'\G
*************************** 1. row ***************************
         phid: PHID-TASK-loqyurfmfrikd727zooa
     phidType: TASK
        field: titl
      auxPHID: NULL
       corpus: Make the setup warning count cache database-backed
stemmedCorpus: make the setup warn count cach databas back
*************************** 2. row ***************************
         phid: PHID-TASK-loqyurfmfrikd727zooa
     phidType: TASK
        field: body
      auxPHID: NULL
       corpus: Currently, the setup warning count is stored on the host, usually in APC. This is used to show the red "setup warnings" banner.

When multiple web nodes are in service in a cluster configuration, this warning can be cached on all of them and need to be cleared individually. This is annoying.

Better would be to make a database cache authoritative, and hit that if we see the flag in APC, then clear the flag from APC if it has already been cleared by another host.
stemmedCorpus: current the setup warn count store host usual apc thi us show red warn banner when multipl web node ar servic cluster configur can cach all them and need clear individu annoi better would make databas cach authorit hit that see flag then clear from ha alreadi been anoth
2 rows in set (0.00 sec)

The specific query to run on this install should be:

mysql> SELECT * FROM phabricator_search.search_documentfield WHERE phid = "PHID-TASK-vebpammdqa5zxsm6xugc";

(Note that mentioning the PHID of the task may also trigger mentions and reindexing, so no one should mention either T... or PHID-TASK-... in plain text for that task until the query is run, if the results are to be diagnostically useful.)

On the mysql.ft_boolean_syntax issue, the resolution was:

  • InnoDB does not support mysql.ft_boolean_syntax, nor does it have a similar flag (the boolean syntax is hardcoded).
  • This install applied a local series of patches to convert query x y into query +x +y (or similar).
  • The upstream eventually adopted more formal versions of those patches (https://secure.phabricator.com/D16938, https://secure.phabricator.com/D16939) which parse and rewrite queries more exhaustively. The unit tests have some examples of how query compilation operates in the upstream today.
  • I believe this install hasn't picked up the upstream changes yet, but either set of patches should resolve the boolean mode issue in simple cases, and the behavior with no patches, the local patches, or the upstream patches is the same for queries with only one word and no quotes or operators, like databases (sort of: upstream now does additional, fancier things to this query).

@epriestley @mmodell pulled in the recent update you did and I applied it to the test instance phab-03 and these are the error's I got T152902 so we will have to fix those before updating prod.

@Paladox: There was no need to ping epriestley about this, it's not his problem.

At this point, I'm pretty close to scraping every task, dumping them in a directory as HTML, and using grep. Some users are basically doing this already with a Bugzilla dump since Phabricator search is so deficient. It's pretty bad that we can't rely on Phabricator search.

I've now started this. I can't take the horrible search any longer. Even with the bullshit "older changes hidden" content obfuscation, I think curl and grep will work better than the built-in search functionality.

Hi, @mmodell has added experimental support for elasticsearch in phabricator.

To try it out you type in the search bar click enter then after that append your url

for example your url may look like

https://phabricator.wikimedia.org/search/query/.trvn2LNszXn/#R

Replace #R with ?elastic=1 which will use elasticsearch.

https://phabricator.wikimedia.org/search/query/.trvn2LNszXn/?elastic=1

@mmodell hi elasticsearch seems to be getting better results probably want to add a little check box or add a setting in user preference that allows it to be enabled without them having to keep adding ?elastic=1 to the url please.

@mmodell hi elasticsearch seems to be getting better results probably want to add a little check box or add a setting in user preference that allows it to be enabled without them having to keep adding ?elastic=1 to the url please.

No, we'll just switch over to the Elastic Search backend by default for all users once it's ready.

Hi, @mmodell has added experimental support for elasticsearch in phabricator.

To try it out you type in the search bar click enter then after that append your url

for example your url may look like

https://phabricator.wikimedia.org/search/query/.trvn2LNszXn/#R

Replace #R with ?elastic=1 which will use elasticsearch.

https://phabricator.wikimedia.org/search/query/.trvn2LNszXn/?elastic=1

Was this code [security] reviewed being lunched into the production instance? With the countless amount of private & confidential tasks we have around for not only SRE but other areas such as acl*WMF-FR, matters around indexing tasks is something we need to be extremely careful about.

@Peachey88 would you be able to try searching for this task whilst logged out please? Just to see if they act like before and are not found in the search index as you said.

@mmodell has started implementing a setting called ElasticSearch under the developer panel. It's experimental and will allow users to enable elastic search without us needing to do this globally for now. It will be enabled globally soon. This will allow users to help test and report back. This will also allow us to fix any bugs or do any improvements.

The commits that implement this are rPHAB766f8c8bab0783dbc8a7ec5bb04b33f9ff86a215 and rPHEX227e1568b7b71f26d8f8950a3e62f545d295886e

Phabricator has been updated.

You can find the new pref option to enable elasticsearch in settings->Developer Settings

I've added T156905 as a subtask, because I think it directly relates to search functionality. Please have a look at it and see how it relates (search crashes database). Remove if unnecessary.

This project is selected for the Developer-Wishlist voting round and will be added to a MediaWiki page very soon. To the subscribers, or proposer of this task: please help modify the task description: add a brief summary (10-12 lines) of the problem that this proposal raises, topics discussed in the comments, and a proposed solution (if there is any yet). Remember to add a header with a title "Description," to your content. Please do so before February 5th, 12:00 pm UTC.

Tgr updated the task description. (Show Details)

Is this still actionable or basically done?

Our hands were forced, we are now on elasticsearch. Some of the subtasks are still active, however.