User Story: As an Elasticsearch developer, I want to be able to use the homoglyph_norm filter with the aggressive_splitting filter (or others that might generate split tokens), without breaking Elasticsearch, and without having to specify the exact order for all analysis chains.
Notes:
The use of the aggressive_splitting filter with homoglyph filter is causing issues and prevent reindexing the wikis:
Reindex task was not successfull: Failed: [{"index":"commonswiki_content_1606298972","type":"page","id":"14924","cause":{"type":"illegal_argument_exception","reason":"startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=1163,endOffset=1165,lastStartOffset=1175 for field 'text'"},"status":400}]
Some findings by Trey: https://phabricator.wikimedia.org/T222669#6642626
The Plan:
- Change 643523 disables the homoglyph plugin for English and Italian; that should solve the immediate problems for any index that needs to be reindexed now (such as with T268372)
- Next, create a list of incompatible post-filters (currently just aggressive_splitting), and change the code to insert homoglyph_norm after any incompatible post-filters (or at the beginning of the filter list, as it currently does, if none are present).
- As a follow up, we could create a new version of aggressive_splitting that doesn't have this problem, or create a generic filter that re-orders moderately out-of-order tokens. (See sub-task T268788.)
AC:
- English and Italian wikis can be reindexed with the homoglyph plugin activated