Review Manually re-built Hebmorph plugin
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Jan 22 2019, 9:19 PM

Description

@dcausse updated and rebuilt the Hebmorph analysis plugin for ES 6 [zip file]. We should test it to make sure there aren't any analysis deficiencies before we consider deploying it for ES6.

Related Objects
Search...

Status	Assigned	Task
Resolved	EBernhardson	T183281 [epic] ELK upgrade to 6.x (elasticsearch, kibana, logstash)
Resolved	None	T183282 [epic] Search cluster upgrade to 6.x
Resolved	None	T194199 [Epic] Prepare for Elasticsearch 6 upgrade
Resolved	TJones	T194849 Investigate language analyzers in ElasticSearch 6
Resolved	TJones	T214439 Review Manually re-built Hebmorph plugin

Event Timeline

TJones created this task.Jan 22 2019, 9:19 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 22 2019, 9:19 PM

TJones added a parent task: T194849: Investigate language analyzers in ElasticSearch 6.Jan 22 2019, 9:20 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Jan 23 2019, 10:29 PM

I had previously extracted 500 Hebrew WIkipedia articles and 500 Hebrew Wiktionary items and analyzed them with the ES 5 Hebmorph analyzers for regression testing. Re-running them with this version built for ES 6.5.4, I see:

There were no differences in the 58,632 tokens* from Hebrew Wiktionary.
There were 2 differences in the 475,020 tokens* from Hebrew Wikipedia.

(* Note that many Hebrew words are analyzed with multiple tokens by the Hebrew analyzer, so the total number of original words in the text is considerably lower.)

The differences in tokens are below. The format is:

<original token>
- <sample_count> - <multiple|stemmed|tokens>

Differences are bolded. In both cases, The ES 5 version has a stem that starts with "או" while the ES 6 version only has "א".

ES 5:

אירינה
- 2 - אורן|אורנה|אייר|אירינה|ארה|ארון
איתה
- 12 - אות|אותה|איית|איתה|את

ES 6:

אירינה
- 2 - אורן|אייר|אירינה|ארה|ארון|ארנה
איתה
- 12 - אות|איית|איתה|את|אתה

Whether it's right or wrong or hard to tell, the impact is very small: 0.002%-0.003% of types or tokens are changed (depending on what you count). I'm happy to say this is close enough to consider this a successful port to ES6.

That said, if @Smalyshev, @Matanya, or anyone else has any thoughts or insight into these stemmed versions of these two tokens, I'd like to hear them.

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Jan 23 2019, 10:48 PM

Looking at these two tokens, both groups look a bit weird, since אירינה seems to be a name and thus should not even be grouped with anything, and איתה does not seem to belong to either of the words too much either, but both bolded words are kinda close to it (איתה is "with her", אותה is "her", אתה is "you") so I see no obvious way to prefer either. So it's different, but I wouldn't say any worse.

I'm happy to say this is close enough to consider this a successful port to ES6.

I agree.

Cool! Thanks, Stas!

Thanks!

• dcausse triaged this task as Medium priority.Jan 24 2019, 11:26 AM

debt closed this task as Resolved.Jan 26 2019, 7:25 PM

Review Manually re-built Hebmorph pluginClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Review Manually re-built Hebmorph plugin
Closed, ResolvedPublic
Actions

Related Objects
Search...