ICU folding seems to cause issues with completion
Open, HighPublic

Description

ICU folding was enabled yesterday on en, fr, el and he wikis.
It seems to have triggered an issue in elastic that causes many suggester indices to fail:

[2017-01-25 03:27:18,654][DEBUG][action.bulk              ] [elastic1044] [enwiktionary_titlesuggest_1485314643][1] failed to execute bulk item (index) index {[enwiktionary_titlesuggest_1485314643][titlesuggest][3057103t], source[{"batch_id":1485314639,"suggest":{"input":["はーい"],"output":"3057103:t:はーい","weight":160700},"suggest-stop":{"input":["はーい"],"output":"3057103:t:はーい","weight":160700}}]}
[enwiktionary_titlesuggest_1485314643][[enwiktionary_titlesuggest_1485314643][1]] IndexFailedEngineException[Index failed for [titlesuggest#3057103t]]; nested: IllegalStateException[from state (0) already had transitions added];
        at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:462)
        at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
        at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
        at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:327)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:120)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: from state (0) already had transitions added
        at org.apache.lucene.util.automaton.Automaton.addTransition(Automaton.java:182)
        at org.apache.lucene.search.suggest.analyzing.XAnalyzingSuggester.replaceSep(XAnalyzingSuggester.java:335)
        at org.apache.lucene.search.suggest.analyzing.XAnalyzingSuggester.toAutomaton(XAnalyzingSuggester.java:954)
        at org.apache.lucene.search.suggest.analyzing.XAnalyzingSuggester.toFiniteStrings(XAnalyzingSuggester.java:973)
        at org.elasticsearch.search.suggest.completion.AnalyzingCompletionLookupProvider.toFiniteStrings(AnalyzingCompletionLookupProvider.java:403)
        at org.elasticsearch.search.suggest.completion.CompletionTokenStream.incrementToken(CompletionTokenStream.java:63)
        at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:634)
        at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
        at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1256)
        at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:530)
        at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
        ... 16 more

It's unclear yet, but it sounds like a bug in elastic, I'll revert the patch to activate ICU and try to find a workaround.

dcausse created this task.Jan 25 2017, 9:38 AM
Restricted Application added a project: Discovery. · View Herald TranscriptJan 25 2017, 9:38 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse triaged this task as "High" priority.Jan 25 2017, 9:39 AM

Setting to high as many completion indices will be out of date rapidly.

Change 334038 had a related patch set uploaded (by DCausse):
Revert "[cirrus] properly set wgCirrusSearchUseIcuFolding"

https://gerrit.wikimedia.org/r/334038

I can reproduce by indexing a page with the title: はーい
updateSuggesterIndex will fail with elastic 2.3.5 but seems to work on the es5 branch.
It's still unclear if it's a bug in the doc format we send to es2 or simply a bug in es2 fixed in es5.

Change 334119 had a related patch set uploaded (by DCausse):
Workaround issues in completion with empty tokens

https://gerrit.wikimedia.org/r/334119

Change 334038 merged by jenkins-bot:
Revert "[cirrus] properly set wgCirrusSearchUseIcuFolding"

https://gerrit.wikimedia.org/r/334038

Mentioned in SAL (#wikimedia-operations) [2017-01-25T14:11:08Z] <dcausse@tin> Synchronized wmf-config/InitialiseSettings.php: T156234 Revert [cirrus] properly set wgCirrusSearchUseIcuFolding (duration: 00m 41s)

It's likely caused by something similar to https://github.com/elastic/elasticsearch/issues/10987
はーい will generate the following token stream with the plain_stop analyzer:

{
  "tokens": [
    {
      "position": 0,
      "type": "<HIRAGANA>",
      "end_offset": 1,
      "start_offset": 0,
      "token": "は"
    },
    {
      "position": 1,
      "type": "<KATAKANA>",
      "end_offset": 2,
      "start_offset": 1,
      "token": ""   // <== empty token
    },
    {
      "position": 2,
      "type": "<HIRAGANA>",
      "end_offset": 3,
      "start_offset": 2,
      "token": "い"
    }
  ]
}

Adding a hack to remove empty tokens from the stream seems to fix the issue, I don't know the details but I suppose that will folded to "" by the icu_folding token filter.
We should probably do the same in the fulltext index, it does not seem to cause issues but empty tokens are not really useful.

dcausse claimed this task.Jan 25 2017, 3:20 PM
dcausse moved this task from Backlog to Needs review on the Discovery-Search (Current work) board.

Change 334119 merged by jenkins-bot:
Workaround issues in completion with empty tokens

https://gerrit.wikimedia.org/r/334119

Deskana closed this task as "Resolved".Jan 30 2017, 6:18 PM
dcausse reopened this task as "Open".Feb 14 2017, 12:58 PM
dcausse moved this task from Done to In progress on the Discovery-Search (Current work) board.

I still have some issues :

[2017-02-14 06:25:34,433][DEBUG][action.bulk              ] [elastic1020] [ptwikinews_titlesuggest_1487053489][0] failed to execute bulk item (index) index {[ptwikinews_titlesuggest_1487053489][titlesuggest][50583t], source[{"batch_id":1487053474,"suggest":{"input":["Os ossos do médico nazista Josef Mengele usados <200b><200b>"],"output":"50583:t:Os ossos do médico nazista Josef Mengele usados <200b><200b>no Brasil em cursos de medicina forense","weight":456176},"suggest-stop":{"input":["Os ossos do médico nazista Josef Mengele usados <200b><200b>"],"output":"50583:t:Os ossos do médico nazista Josef Mengele usados <200b><200b>no Brasil em cursos de medicina forense","weight":456176}}]}
[ptwikinews_titlesuggest_1487053489][[ptwikinews_titlesuggest_1487053489][0]] IndexFailedEngineException[Index failed for [titlesuggest#50583t]]; nested: IllegalStateException[from state (0) already had transitions added];
        at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:462)
        at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
        at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
        at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:327)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:120)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: from state (0) already had transitions added

The list of affected indices is much smaller:
For eqiad:

curl -s elastic1020.eqiad.wmnet:9200/_cat/indices?h=i | grep titlesugg | sed -e 's/_[0-9 ]\+//' | sort | uniq -c | grep -v '1 '
      2 arwikisource_titlesuggest
      2 arwiktionary_titlesuggest
      2 cawiki_titlesuggest
      2 ckbwiki_titlesuggest
      2 incubatorwiki_titlesuggest
      2 kmwikibooks_titlesuggest
      2 mgwiktionary_titlesuggest
      2 mlwiki_titlesuggest
      2 orwikisource_titlesuggest
      2 ptwikinews_titlesuggest
      2 ptwiki_titlesuggest
      2 siwiki_titlesuggest
      2 tewiktionary_titlesuggest

For codfw:

curl -s elastic2020.codfw.wmnet:9200/_cat/indices?h=i | grep titlesugg | sed -e 's/_[0-9 ]\+//' | sort | uniq -c | grep -v '1 '
      2 cawiki_titlesuggest
      2 incubatorwiki_titlesuggest
      2 mgwiktionary_titlesuggest
      2 orwikisource_titlesuggest
      2 ptwikinews_titlesuggest
      2 ptwiki_titlesuggest

I'll try to find a fix rapidly, if I can't I'll have to revert icu_folding again...

Change 337589 had a related patch set uploaded (by DCausse):
Fold some problematic whitespaces with completion

https://gerrit.wikimedia.org/r/337589

Change 337589 merged by jenkins-bot:
Fold some problematic whitespaces with completion

https://gerrit.wikimedia.org/r/337589

Change 337631 had a related patch set uploaded (by DCausse):
Fold some problematic whitespaces with completion

https://gerrit.wikimedia.org/r/337631

Change 337631 merged by jenkins-bot:
Fold some problematic whitespaces with completion

https://gerrit.wikimedia.org/r/337631

Mentioned in SAL (#wikimedia-operations) [2017-02-15T14:04:34Z] <hashar@tin> Synchronized php-1.29.0-wmf.12/extensions/CirrusSearch/includes/Maintenance/SuggesterAnalysisConfigBuilder.php: Fold some problematic whitespaces with completion - T156234 (duration: 00m 48s)

Change 337845 had a related patch set uploaded (by DCausse):
Fold some problematic whitespaces with completion

https://gerrit.wikimedia.org/r/337845

Change 337845 merged by jenkins-bot:
Fold some problematic whitespaces with completion

https://gerrit.wikimedia.org/r/337845

Mentioned in SAL (#wikimedia-operations) [2017-02-15T16:01:47Z] <dcausse@tin> Synchronized php-1.29.0-wmf.11/extensions/CirrusSearch/: T156234: Fold some problematic whitespaces with completion (duration: 01m 01s)