Page MenuHomePhabricator

Create Elasticsearch filter so we can do aggressive_splitting without causing an invalid token order
Open, HighPublic

Description

User story: As an Elasticsearch developer, I want to be able to add useful filters in a logical order without having to worry about how they might interact to create an invalid token order.

Notes: As outlined in the parent task (T268730) and related comments, because homoglyph_norm creates multiple overlapping tokens and aggressive_splitting splits tokens, the two can interact to create tokens in an invalid order if homoglyph_norm comes before aggressive_splitting. For example, a stream of tokens with offsets (0-5, 6-7, 0-5, 6-7), which should be properly ordered as (0-5, 0-5, 6-7, 6-7).

The short-term solution is to swap their order, but that is not the logical order they should be applied—though the outcome is the same in the majority of cases (but not all).

There is a specific and a generic approach to solving the problem:

  • Specific: recreate either aggressive_splitting or its component word_delimiter in such a way that it doesn't create out-of-order tokens. This would require caching incoming tokens to make sure that none that come immediately after would be out of order.
  • Generic: create a general-purpose reordering filter that would take a stream and reorder tokens in an invalid order (up to some reasonable limit—it shouldn't have to handle a thousand tokens in reverse order, for example).
    • Alternatively, it could clobberize highlighting and possibly some other features by simply changing the offset information to be "acceptable", as word_delimiter_graph does. So, (0-5, 6-7, 0-5, 6-7) would become (0-5, 6-7, 6-6, 6-7)—it's not right, but at least it isn't broken.

The generic case would allow us to reorder tokens for the existing aggressive_splitting and could be useful in future situations, but is probably more difficult to code and possibly noticeably slower.

Acceptance Criteria:

  • We can order homoglyph_norm before aggressive_splitting without causing errors on known-troublesome tokens such as Tolstoу's (with Cyrillic у).