Page MenuHomePhabricator

Investigate language analyzers in ElasticSearch 6
Closed, ResolvedPublic

Description

Make sure that Elastic language analysis components, our internal components, and third-party components are all working as expected in Elasticsearch 6.

Event Timeline

I've pulled snapshots of Wikipedia and Wiktionary text for the languages below, and established a baseline analysis with our current config. These cover all the custom analysis chains, other custom config (like using the ICU tokenizer), and a number of different scripts.

Chinese, Dzongkha, English, Finnish, French, Gan, Greek, Hebrew, Italian, Japanese, Javanese, Mirandese, Polish, Russian, Rusyn, Serbian, Slovak, Swedish, Tibetan, Turkish, Ukrainian

Vvjjkkii renamed this task from Investigate language analyzers in ElasticSearch 6 to uucaaaaaaa.Jul 1 2018, 1:10 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
TJones renamed this task from uucaaaaaaa to Investigate language analyzers in ElasticSearch 6.Jul 2 2018, 2:53 PM
TJones raised the priority of this task from High to Needs Triage.
TJones updated the task description. (Show Details)
TJones added a subscriber: Aklapper.
TJones moved this task from Language Stuff to Current work on the Discovery-Search board.
TJones updated the task description. (Show Details)

First draft done. Full details on MediaWiki.

Summary: Esperanto is missing, Serbian is broken. :( Chinese surrogates have been fixed! The ICU tokenizer has been updated. My sample is so old that many changes we made last year show up.

Issues:

  • Serbian: Elastic search reports that "extra-analysis-serbian / 6.5.4-SNAPSHOT" is installed, but when I try to reindex, I get an error: Creating index...⧼Unknown filter type [serbian_stemmer] for [scstemmer]⧽
  • Esperanto isn't in my original samples because it didn't have custom processing at the time I took them. However, I noticed that there was no Esperanto plugin in the new ES 6 batch of plugins.

Thanks for catching this, here is an updated version of the deb package which should fix the issues you discovered:
https://people.wikimedia.org/~dcausse/wmf-elasticsearch-search-plugins_6.5.4-alpha6~stretch_all.deb

Everything looks good now. Serbian (et al.) and Esperanto are working as expected. Thanks, @dcausse!