Review Hebrew Analyzers previously found and look for others. Then, we'll test the analyzers to see if they really are better.
|Invalid||None||T174065 [FY 2017-18 Objective] Improve support for searching in multiple languages|
|Open||None||T154511 [Tracking] Research, test, and deploy new language analyzers|
|Resolved||TJones||T162739 [Research spike, 4 hours] Research Hebrew language analyzers|
|Resolved||TJones||T162741 Test and analyze new Hebrew language analyzers|
|Resolved||Gehel||T167057 Deploy HebMorph Plugin to production|
|Resolved||dcausse||T167058 Re-index Hebrew-language wikis|
|Resolved||debt||T71361 Search should normalize Niqqud diacritics in Hebrew characters|
The short version: the HebMorph-based analyzer it is!
https://github.com/synhershko/elasticsearch-analysis-hebrew (5 days)
- Based on HebMorph, linked by Elastic, available for ES5.3
- Offers separate lemmatizer, Niqqud (diacritic) character filter (allowing for unpacking if needed), and several levels of analzyers.
- Commercial option includes proprietary dictionary.
- Does not have an obvious stop word list, but it'll be easy enough to tell if there is one when I do analysis later.
- https://greg.blog/2013/05/01/three-principles-for-multilingal-indexing-in-elasticsearch/ (2013)
- blog post recommending stopword list
- http://web.archive.org/web/20120821214110/http://wiki.korotkin.co.il/Hebrew_stopwords (2012)
- https://github.com/Automattic/wpes-lib/blob/master/src/common/class.wpes-analyzer-builder.php#L493 (2016)
- manually trimmed, current version; licensing unclear
It's pretty hard to find anything else for Hebrew...
- https://github.com/Sefaria/Sefaria-ElasticSearch : "various ElasticSearch analyzer plugins useful for analyzing ancient Hebrew"
- http://www.basistech.com/elasticsearch-documentation/ : commercial option, and I can't find anything newer than 2014