Review Hebrew Analyzers previously found and look for others. Then, we'll test the analyzers to see if they really are better.
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | None | T174065 [FY 2017-18 Objective] Improve support for searching in multiple languages | |||
Open | None | T154511 [Tracking] Research, test, and deploy new language analyzers | |||
Resolved | TJones | T162739 [Research spike, 4 hours] Research Hebrew language analyzers | |||
Resolved | TJones | T162741 Test and analyze new Hebrew language analyzers | |||
Resolved | Gehel | T167057 Deploy HebMorph Plugin to production | |||
Resolved | dcausse | T167058 Re-index Hebrew-language wikis | |||
Resolved | debt | T71361 Search should normalize Niqqud diacritics in Hebrew characters |
Event Timeline
Comment Actions
The short version: the HebMorph-based analyzer it is!
https://github.com/synhershko/elasticsearch-analysis-hebrew (5 days)
- Based on HebMorph, linked by Elastic, available for ES5.3
- Offers separate lemmatizer, Niqqud (diacritic) character filter (allowing for unpacking if needed), and several levels of analzyers.
- Commercial option includes proprietary dictionary.
- Does not have an obvious stop word list, but it'll be easy enough to tell if there is one when I do analysis later.
On stopwords:
- https://greg.blog/2013/05/01/three-principles-for-multilingal-indexing-in-elasticsearch/ (2013)
- blog post recommending stopword list
- http://web.archive.org/web/20120821214110/http://wiki.korotkin.co.il/Hebrew_stopwords (2012)
- https://github.com/Automattic/wpes-lib/blob/master/src/common/class.wpes-analyzer-builder.php#L493 (2016)
- manually trimmed, current version; licensing unclear
It's pretty hard to find anything else for Hebrew...
- https://github.com/Sefaria/Sefaria-ElasticSearch : "various ElasticSearch analyzer plugins useful for analyzing ancient Hebrew"
- http://www.basistech.com/elasticsearch-documentation/ : commercial option, and I can't find anything newer than 2014