Make sure that Elastic language analysis components, our internal components, and third-party components are all working as expected in Elasticsearch 6.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | EBernhardson | T183281 [epic] ELK upgrade to 6.x (elasticsearch, kibana, logstash) | |||
Resolved | None | T183282 [epic] Search cluster upgrade to 6.x | |||
Resolved | None | T194199 [Epic] Prepare for Elasticsearch 6 upgrade | |||
Resolved | TJones | T194849 Investigate language analyzers in ElasticSearch 6 | |||
Resolved | TJones | T214439 Review Manually re-built Hebmorph plugin |
Event Timeline
I've pulled snapshots of Wikipedia and Wiktionary text for the languages below, and established a baseline analysis with our current config. These cover all the custom analysis chains, other custom config (like using the ICU tokenizer), and a number of different scripts.
Chinese, Dzongkha, English, Finnish, French, Gan, Greek, Hebrew, Italian, Japanese, Javanese, Mirandese, Polish, Russian, Rusyn, Serbian, Slovak, Swedish, Tibetan, Turkish, Ukrainian
First draft done. Full details on MediaWiki.
Summary: Esperanto is missing, Serbian is broken. :( Chinese surrogates have been fixed! The ICU tokenizer has been updated. My sample is so old that many changes we made last year show up.
Issues:
- Serbian: Elastic search reports that "extra-analysis-serbian / 6.5.4-SNAPSHOT" is installed, but when I try to reindex, I get an error: Creating index...⧼Unknown filter type [serbian_stemmer] for [scstemmer]⧽
- Esperanto isn't in my original samples because it didn't have custom processing at the time I took them. However, I noticed that there was no Esperanto plugin in the new ES 6 batch of plugins.
Thanks for catching this, here is an updated version of the deb package which should fix the issues you discovered:
https://people.wikimedia.org/~dcausse/wmf-elasticsearch-search-plugins_6.5.4-alpha6~stretch_all.deb
Everything looks good now. Serbian (et al.) and Esperanto are working as expected. Thanks, @dcausse!