Page MenuHomePhabricator

startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards
Closed, ResolvedPublic5 Estimated Story Points

Description

User Story: As an Elasticsearch developer, I want to be able to use the homoglyph_norm filter with the aggressive_splitting filter (or others that might generate split tokens), without breaking Elasticsearch, and without having to specify the exact order for all analysis chains.

Notes:

The use of the aggressive_splitting filter with homoglyph filter is causing issues and prevent reindexing the wikis:

Reindex task was not successfull: Failed: [{"index":"commonswiki_content_1606298972","type":"page","id":"14924","cause":{"type":"illegal_argument_exception","reason":"startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=1163,endOffset=1165,lastStartOffset=1175 for field 'text'"},"status":400}]

Some findings by Trey: https://phabricator.wikimedia.org/T222669#6642626

The Plan:

  • Change 643523 disables the homoglyph plugin for English and Italian; that should solve the immediate problems for any index that needs to be reindexed now (such as with T268372)
  • Next, create a list of incompatible post-filters (currently just aggressive_splitting), and change the code to insert homoglyph_norm after any incompatible post-filters (or at the beginning of the filter list, as it currently does, if none are present).
  • As a follow up, we could create a new version of aggressive_splitting that doesn't have this problem, or create a generic filter that re-orders moderately out-of-order tokens. (See sub-task T268788.)

AC:

  • English and Italian wikis can be reindexed with the homoglyph plugin activated

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 643523 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Disable homoglyph plugin for English and Italian language wikis

https://gerrit.wikimedia.org/r/643523

I've uploaded a patch to just disable homoglyphs for English- and Italian-language wikis. That gives us time to think about and test a more permanent solution that allows the homoglyph plugin to be enabled with breaking things.

Not sure if this needs to move to "needs review" since this is just a stop-gap.

Change 643523 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Disable homoglyph plugin for English and Italian language wikis

https://gerrit.wikimedia.org/r/643523

Change 661806 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Insert homoglyph_norm after incompatible filters

https://gerrit.wikimedia.org/r/661806

Change 661806 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Insert homoglyph_norm after incompatible filters

https://gerrit.wikimedia.org/r/661806