Page MenuHomePhabricator

Review Manually re-built Hebmorph plugin
Closed, ResolvedPublic

Description

@dcausse updated and rebuilt the Hebmorph analysis plugin for ES 6 [zip file]. We should test it to make sure there aren't any analysis deficiencies before we consider deploying it for ES6.

Event Timeline

TJones created this task.Jan 22 2019, 9:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 22 2019, 9:19 PM
TJones added subscribers: Smalyshev, Matanya.EditedJan 23 2019, 10:48 PM

I had previously extracted 500 Hebrew WIkipedia articles and 500 Hebrew Wiktionary items and analyzed them with the ES 5 Hebmorph analyzers for regression testing. Re-running them with this version built for ES 6.5.4, I see:

  • There were no differences in the 58,632 tokens* from Hebrew Wiktionary.
  • There were 2 differences in the 475,020 tokens* from Hebrew Wikipedia.

(* Note that many Hebrew words are analyzed with multiple tokens by the Hebrew analyzer, so the total number of original words in the text is considerably lower.)

The differences in tokens are below. The format is:

  • <original token>
    • <sample_count> - <multiple|stemmed|tokens>

Differences are bolded. In both cases, The ES 5 version has a stem that starts with "או" while the ES 6 version only has "א".

ES 5:

  • אירינה
    • 2 - אורן|אורנה|אייר|אירינה|ארה|ארון
  • איתה
    • 12 - אות|אותה|איית|איתה|את

ES 6:

  • אירינה
    • 2 - אורן|אייר|אירינה|ארה|ארון|ארנה
  • איתה
    • 12 - אות|איית|איתה|את|אתה

Whether it's right or wrong or hard to tell, the impact is very small: 0.002%-0.003% of types or tokens are changed (depending on what you count). I'm happy to say this is close enough to consider this a successful port to ES6.

That said, if @Smalyshev, @Matanya, or anyone else has any thoughts or insight into these stemmed versions of these two tokens, I'd like to hear them.

Looking at these two tokens, both groups look a bit weird, since אירינה seems to be a name and thus should not even be grouped with anything, and איתה does not seem to belong to either of the words too much either, but both bolded words are kinda close to it (איתה is "with her", אותה is "her", אתה is "you") so I see no obvious way to prefer either. So it's different, but I wouldn't say any worse.

I'm happy to say this is close enough to consider this a successful port to ES6.

I agree.

Cool! Thanks, Stas!

dcausse triaged this task as Normal priority.Jan 24 2019, 11:26 AM
debt closed this task as Resolved.Jan 26 2019, 7:25 PM