Test and analyze new Hebrew language analyzers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Apr 11 2017, 7:44 PM

Description

After the research in T162739 has found some analyzers for Hebrew that are potentially better, we will test them, and analyze to see if they are better or not. If they are, we will file a task to deploy one of them.

Details

	Subject	Repo	Branch	Lines +/-
	Enable Hebrew Analysis	mediawiki/extensions/CirrusSearch	master	+27 -16

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174065 [FY 2017-18 Objective] Improve support for searching in multiple languages
Open	None	T154511 [Tracking] Research, test, and deploy new language analyzers
Resolved	TJones	T162739 [Research spike, 4 hours] Research Hebrew language analyzers
Resolved	TJones	T162741 Test and analyze new Hebrew language analyzers
Resolved	Gehel	T167057 Deploy HebMorph Plugin to production
Resolved	dcausse	T167058 Re-index Hebrew-language wikis
Resolved	debt	T71361 Search should normalize Niqqud diacritics in Hebrew characters

Event Timeline

TJones created this task.Apr 11 2017, 7:44 PM

TJones mentioned this in T154511: [Tracking] Research, test, and deploy new language analyzers.

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Apr 11 2017, 9:14 PM

Liuxinyu970226 added subscribers: Guycn2, Amire80.May 13 2017, 3:10 PM

My write-up is available.

There's a live demo.

And there's been some discussion over on the Hebrew Village Pump.

The consensus seems to be that HebMorph is a net positive; it has some problems, but those are mostly due to the complexity of Hebrew language analysis—the lack of vowels and the regular affixing makes for a lot of ambiguity.

Next steps: deploy the analyzer plugin (T167057) and re-index the Hebrew-language wikis (T167058).

I've also added a note to the ticket to update Vagrant (T164367) and the recurring index-update task (T147505).

TJones mentioned this in T167057: Deploy HebMorph Plugin to production.Jun 5 2017, 6:36 PM

TJones created subtask T167057: Deploy HebMorph Plugin to production.

TJones mentioned this in T167058: Re-index Hebrew-language wikis.Jun 5 2017, 6:41 PM

Very, very, curious.

Thanks a lot for this work.

It doesn't resolve T75862 (וטריפלקס, which means "and triplex"), but it does produce many more results for "והאבולוציה" ("and the evolution"), which is given as an example in your writeup. I guess that it's because "טריפלקס" (triplex) is not included in HebMorph's dictionary, but "אבולוציה" is. This raises two questions:

Could HebMorph be smarter and at least attempt to analyze ו is a prefix, and then try searching for "טריפלקס" ("triplex")? Or would it be too hard?
Could we some time get a report of Hebrew words that HebMorph is unable to analyze and add them to HebMorph's dictionary? I realize that it can be thousands of words, but it would be nice to at least try. That would be such an awesome crowdsourcing project. (And of course something similar could be done for other languages as well.)

Neither of the above questions is a blocker for further deployment. This already looks like an improvement.

(Also, it's nice to see that HebMorph is alive. I first ran into it seven years ago, and there was even some talk about me joining that project, but I ended up not doing anything with it. It's wonderful that Itamar keeps maintaining it!)

@Amire80:

(1) I haven't dug too deeply into the HebMorph code. But, given the ambiguity of Hebrew, I could see being cautious about automatically removing possible prefixes. English Wiktionary tells me that ויקיפדיה is "Wikipedia" (I picked it at random—really!). If that wasn't in the lexicon, would we want it to also guess that יקיפדיה is a word? It's the classic trade off of recall and precision—do you want to get every possible answer (along with a bunch of extra junk), or get only right answers (but miss a bunch of other right answers)? it's never easy to find the perfect balance.

It shouldn't be technically hard to remove plausible prefixes, but since HebMorph doesn't seem to do that, I'm guessing that either it generates a lot more junk than useful stuff, or the developer heavily favors precision over recall.

(2) I'm not quite sure how to identify potential words that are not in the lexicon.

First, there's a commercial/non-commercial wrinkle (see the GitHub page): HebMorph comes in a commercial version with a proprietary dictionary (which might very well include the word for "triplex"). The version we use is the non-commercial version, which uses the Hspell dictionary. Hspell seems to be updated every three years—so it's due for an upgrade in 2018. Maybe we could get some new words in there!

The other problem is how to identify unknown words. Hspell's word lists are available, though there is more there than just a simple list. That would take some figuring out. Another option would be to try to deduce whether a word is known or not based on how it is processed—but that gets tricky quickly.

One hacky approach would be to take a frequency-based list from Hebrew Wikipedia (or a sample thereof) and see which words Hspell recognizes. Anything it doesn't could be vetted by a Hebrew speaker or two. Frequency sorting would hopefully float better candidates to the top and typos and other junk to the bottom.

If you just want to add words that people come across ad hoc, Hspell mentions that they take submissions on their FAQ page.

To further complicate matters, it looks like HebMorph compiles the Hspell dictionary into a more convenient format, which means that even updates to Hspell are not necessarily going to get into HebMorph quickly. Older versions of HebMorph required downloading the Hspell files separately. They were up to v1.2, while Hspell is currently at v1.3. I'm not sure if the newer version of HebMorph uses the same v1.2 files, or an updated compiled version of v1.3.

The last option would be to fork the Hspell compiled files and update them independently. However, that's not future-proof, since a future version of HebMorph could use a different format, and then we'd have to scramble to figure out the format and update our fork to be compatible.

If we were only supporting Hebrew, it might be worth it to go to so much effort, but it's probably implausible to do so for multiple languages. My approach has been to try to inform the developers of shortcomings and hope that improvements come out in future versions. It didn't work for Stempel (for Polish), but it did work for STConvert (for Chinese).

For HebMorph, I'd suggest asking whether they are using v1.3 of Hspell, and if not encourage them to do so—and since that's easy, I've asked!—while also reporting missing words to Hspell for incorporation into future versions, that hopefully filter into future versions of HebMorph.

In T162741#3316411, @TJones wrote:

@Amire80:

(1) I haven't dug too deeply into the HebMorph code. But, given the ambiguity of Hebrew, I could see being cautious about automatically removing possible prefixes. English Wiktionary tells me that ויקיפדיה is "Wikipedia" (I picked it at random—really!). If that wasn't in the lexicon, would we want it to also guess that יקיפדיה is a word? It's the classic trade off of recall and precision—do you want to get every possible answer (along with a bunch of extra junk), or get only right answers (but miss a bunch of other right answers)? it's never easy to find the perfect balance.

For the particular case of this prefix, this would be OK, because there are few words where it appears in the beginning, so it's likely to be a prefix. ויקיפדיה is an exception, not a rule.

There are several other prefixes where it wouldn't work, however.

(There's an amusing story about this: Wikidata is ויקינתונים [vikinetunim], and Google Translate correctly translates it as "and hyacinths". I heard that there hyacinths in Wikimedia Germany office because of that.)

(2) I'm not quite sure how to identify potential words that are not in the lexicon.

First, there's a commercial/non-commercial wrinkle (see the GitHub page): HebMorph comes in a commercial version with a proprietary dictionary (which might very well include the word for "triplex"). The version we use is the non-commercial version, which uses the Hspell dictionary. Hspell seems to be updated every three years—so it's due for an upgrade in 2018. Maybe we could get some new words in there!

The other problem is how to identify unknown words. Hspell's word lists are available, though there is more there than just a simple list. That would take some figuring out. Another option would be to try to deduce whether a word is known or not based on how it is processed—but that gets tricky quickly.

One hacky approach would be to take a frequency-based list from Hebrew Wikipedia (or a sample thereof) and see which words Hspell recognizes. Anything it doesn't could be vetted by a Hebrew speaker or two. Frequency sorting would hopefully float better candidates to the top and typos and other junk to the bottom.

This sounds right.

If you just want to add words that people come across ad hoc, Hspell mentions that they take submissions on their FAQ page.

Yeah, they already added a few that I sent them several years ago :)

I should do it more often. And if it can be done on a larger scale, I'm sure there will be several Hebrew Wikipedians who could help vet it.

For HebMorph, I'd suggest asking whether they are using v1.3 of Hspell, and if not encourage them to do so—and since that's easy, I've asked!—while also reporting missing words to Hspell for incorporation into future versions, that hopefully filter into future versions of HebMorph.

Thanks!

I've already gotten a reply on the Hspell version. Looks like the open source version of HebMoprh is sticking to Hspell v1.2 and future development is on the proprietary dictionary.

So, our options seem to be limited to forking and improving the dictionary independently, either by updating to Hspell v1.3, or manually adding new words. That is a bummer.

Change 357299 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Enable Hebrew Analysis

https://gerrit.wikimedia.org/r/357299

gerritbot added a project: Patch-For-Review.Jun 5 2017, 9:11 PM

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Jun 6 2017, 1:43 PM

Change 357299 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Enable Hebrew Analysis

https://gerrit.wikimedia.org/r/357299

ReleaseTaggerBot added a project: MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)).Jun 8 2017, 1:00 PM

TJones moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Jun 8 2017, 2:20 PM

debt closed this task as Resolved.Jun 16 2017, 5:23 PM

TJones mentioned this in T71361: Search should normalize Niqqud diacritics in Hebrew characters.Jun 27 2017, 9:06 PM

debt mentioned this in T147505: [tracking] CirrusSearch: what is updated during re-indexing.Jul 11 2017, 5:47 PM

debt closed subtask T167057: Deploy HebMorph Plugin to production as Resolved.Sep 22 2017, 3:16 PM