@dcausse updated and rebuilt the Hebmorph analysis plugin for ES 6 [zip file]. We should test it to make sure there aren't any analysis deficiencies before we consider deploying it for ES6.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | EBernhardson | T183281 [epic] ELK upgrade to 6.x (elasticsearch, kibana, logstash) | |||
Resolved | None | T183282 [epic] Search cluster upgrade to 6.x | |||
Resolved | None | T194199 [Epic] Prepare for Elasticsearch 6 upgrade | |||
Resolved | TJones | T194849 Investigate language analyzers in ElasticSearch 6 | |||
Resolved | TJones | T214439 Review Manually re-built Hebmorph plugin |
Event Timeline
I had previously extracted 500 Hebrew WIkipedia articles and 500 Hebrew Wiktionary items and analyzed them with the ES 5 Hebmorph analyzers for regression testing. Re-running them with this version built for ES 6.5.4, I see:
- There were no differences in the 58,632 tokens* from Hebrew Wiktionary.
- There were 2 differences in the 475,020 tokens* from Hebrew Wikipedia.
(* Note that many Hebrew words are analyzed with multiple tokens by the Hebrew analyzer, so the total number of original words in the text is considerably lower.)
The differences in tokens are below. The format is:
- <original token>
- <sample_count> - <multiple|stemmed|tokens>
Differences are bolded. In both cases, The ES 5 version has a stem that starts with "או" while the ES 6 version only has "א".
ES 5:
- אירינה
- 2 - אורן|אורנה|אייר|אירינה|ארה|ארון
- איתה
- 12 - אות|אותה|איית|איתה|את
ES 6:
- אירינה
- 2 - אורן|אייר|אירינה|ארה|ארון|ארנה
- איתה
- 12 - אות|איית|איתה|את|אתה
Whether it's right or wrong or hard to tell, the impact is very small: 0.002%-0.003% of types or tokens are changed (depending on what you count). I'm happy to say this is close enough to consider this a successful port to ES6.
That said, if @Smalyshev, @Matanya, or anyone else has any thoughts or insight into these stemmed versions of these two tokens, I'd like to hear them.
Looking at these two tokens, both groups look a bit weird, since אירינה seems to be a name and thus should not even be grouped with anything, and איתה does not seem to belong to either of the words too much either, but both bolded words are kinda close to it (איתה is "with her", אותה is "her", אתה is "you") so I see no obvious way to prefer either. So it's different, but I wouldn't say any worse.
I'm happy to say this is close enough to consider this a successful port to ES6.
I agree.