Page MenuHomePhabricator

Investigate language analyzers in ElasticSearch 6
Closed, ResolvedPublic

Description

Make sure that Elastic language analysis components, our internal components, and third-party components are all working as expected in Elasticsearch 6.

Event Timeline

TJones created this task.May 16 2018, 7:37 PM

I've pulled snapshots of Wikipedia and Wiktionary text for the languages below, and established a baseline analysis with our current config. These cover all the custom analysis chains, other custom config (like using the ICU tokenizer), and a number of different scripts.

Chinese, Dzongkha, English, Finnish, French, Gan, Greek, Hebrew, Italian, Japanese, Javanese, Mirandese, Polish, Russian, Rusyn, Serbian, Slovak, Swedish, Tibetan, Turkish, Ukrainian

Vvjjkkii renamed this task from Investigate language analyzers in ElasticSearch 6 to uucaaaaaaa.Jul 1 2018, 1:10 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
TJones renamed this task from uucaaaaaaa to Investigate language analyzers in ElasticSearch 6.Jul 2 2018, 2:53 PM
TJones raised the priority of this task from High to Needs Triage.
TJones updated the task description. (Show Details)
TJones added a subscriber: Aklapper.
TJones claimed this task.Feb 7 2019, 10:12 PM
TJones moved this task from Language Stuff to Current work on the Discovery-Search board.
TJones updated the task description. (Show Details)
TJones added a comment.Feb 8 2019, 1:24 AM

First draft done. Full details on MediaWiki.

Summary: Esperanto is missing, Serbian is broken. :( Chinese surrogates have been fixed! The ICU tokenizer has been updated. My sample is so old that many changes we made last year show up.

Issues:

  • Serbian: Elastic search reports that "extra-analysis-serbian / 6.5.4-SNAPSHOT" is installed, but when I try to reindex, I get an error: Creating index...⧼Unknown filter type [serbian_stemmer] for [scstemmer]⧽
  • Esperanto isn't in my original samples because it didn't have custom processing at the time I took them. However, I noticed that there was no Esperanto plugin in the new ES 6 batch of plugins.

Thanks for catching this, here is an updated version of the deb package which should fix the issues you discovered:
https://people.wikimedia.org/~dcausse/wmf-elasticsearch-search-plugins_6.5.4-alpha6~stretch_all.deb

TJones added a comment.Feb 8 2019, 3:54 PM

Everything looks good now. Serbian (et al.) and Esperanto are working as expected. Thanks, @dcausse!

debt closed this task as Resolved.Feb 15 2019, 7:12 PM
Shizhao moved this task from Backlog to Closed on the Chinese-Sites board.Feb 18 2019, 2:34 AM