Page MenuHomePhabricator

[epic] CirrusSearch: Abandon query_string
Closed, DuplicatePublic

Description

We've been patching and patching and patching to work around Lucene's query_string being finicky for a year and a half now. It must stop. We should write our own query parser which replicates the old behavior but supports things like "functio* progra*" and doesn't blow up on "foo OR".

I _think_ the right place to do this is an Elasticsearch plugin. And its a good choice after Elasticsearch 1.6 gets index sealing. I _think_. We could also do this in PHP but there are few good parser generators in PHP and in Java they are legion. Also, we'd still have to do Elasticsearch work to expose any extra kinds of queries. OTOH doing it in a plugin means we'll have to decide how/if we are going to support clients that don't install the plugin.

This is a good chunk of work - like a month minimally. And it sucks out a huge chunk of technical debt. But ultimately it only fixes a few super duper corner cases.
Stakeholder:

  1. Cirrus engineers
  2. Search power users
  3. Elasticsearch community at large

Benefit:

  1. Cirrus issues are easier to fix
  2. Certain queries that looks like they should work will work/can be made to work. For example "flat fo*" OR "bumby fo*". This is not hypothetical. People try these and the don't work for them and we _can't_ fix them with the current setup. Most of them are fixed simply by integrating with this project.
  3. We use query_string and we're public about how our setup works. And query_string is a trap that no one should use. People will copy our mistakes.

Estimate: One or Two Months

Related Objects

Event Timeline

Manybubbles raised the priority of this task from to Low.
Manybubbles updated the task description. (Show Details)
Manybubbles moved this task to Search on the Discovery-ARCHIVED board.
Manybubbles raised the priority of this task from Low to Medium.
Manybubbles set Security to None.
Manybubbles moved this task from Search to On Sprint Board on the Discovery-ARCHIVED board.

I'm deprioritising this work for now. Our goal to reduce the zero-results rate is likely to have higher impact than this work, and must be prioritised. This task can be revisited later.

Raw notes from the meeting that led to the de-prioritization:

We currently have a fragile query parsing solution based on lots of regular expressions. We have kept adding more and more RE's to add functionality and fix bugs, but that is unsustainable. Nik has built a new parser (query string plus plus plus), which is much better, but it doesn't fall into a quarterly goal.

Primary customers are Russian wp and power users. And us, to (hopefully) make bugs easier to fix.
Readers typically don't file bug reports that would be helped by this-- Editors do.
Deep set math on categories isn't supported with our existing solution, and has been requested by some.

Cutting over to qsppp would be a lot of work. Exposing as a beta feature first would probably be required, due to risks of behavior changes.

Could it be considered "Technical Debt"?

  • Yes, but it's not a simple drop-in, so should probably go through beta, which involves lots of communication with community

The Discovery team can work on issues that are not related to goals, but we should not put goals at risk by doing so.

This has already been set aside for now. It could be picked up by anyone, so it is not necessary to work on it before Nik leaves.

This issue has already been marked (in phabricator) as blocking some reported bugs, and may block additional issues in the future. At some point, if a critical bug is blocked, that might be the justification to perform this work. It is also possible that someone else might continue working on this new parser.

Deskana added a subscriber: EBernhardson.

@EBernhardson says this should really be done in Elastic by Elastic, not by us. In line with that, declining.

demon subscribed.

Changing to stalled. Per IRC, this is hard. Really hard. However, the way Cirrus uses query_string is not a long-term viable design and eventually needs to be fixed. Nobody has near enough cycles to go down this path right now though.

Deskana lowered the priority of this task from Medium to Lowest.Dec 3 2015, 6:05 PM

Changing to stalled. Per IRC, this is hard. Really hard. However, the way Cirrus uses query_string is not a long-term viable design and eventually needs to be fixed. Nobody has near enough cycles to go down this path right now though.

Thanks! Fixing priority to reflect this.