Page MenuHomePhabricator

Update and/or enable custom entries for Hebmorph dictionary
Open, Stalled, LowPublic

Description

The original specific description is below, but the required solution is much more general: we would need to find a way to upgrade the HebMorph dictionary to a newer version of Hspell and/or be able to add custom entries to the dictionary.


In the Hebrew language the prepositions and the conjunctions are usually prefixed to the beginning of the word. For example, "טריפלקס" means "triplex" and "וטריפלקס" means "and triplex". The letter "ו" was added in the beginning.

Ideally, the search engine should be smart enough to know about morphology and understand that "טריפלקס" should also yield results with "וטריפלקס".

The situation with Arabic is very similar, and possibly there are more such languages.

Event Timeline

Amire80 raised the priority of this task from to Needs Triage.
Amire80 updated the task description. (Show Details)
Amire80 added a project: MediaWiki-Search.
Amire80 changed Security from none to None.
Amire80 subscribed.

So it looks like Hebrew works differently now (see T167058), and it's probably good.

But this particular bug is still not resolved, probably because "טריפלקס" is a relatively rare loanword.

I'm wondering: Is there a way to add words to the dictionary?

For better or worse we discussed this a while back.

My summary after skimming that conversation is:

  • automatic prefix-detection is certainly possible, but likely to cause errors, which is probably why HebMorph (the new plugin) doesn't do it.
  • HebMorph isn't upgrading their open-source dictionary because they have upgraded their proprietary dictionary
  • HebMorph uses an old version of the Hspell dictionary, so the newer one would probably be better, but it is compiled in a custom way for HebMorph, so it's non-trivial to do so.

My conclusion was:

Our options seem to be limited to forking and improving the dictionary independently, either by updating to Hspell v1.3, or manually adding new words. That is a bummer.

It might be possible to update and compile the dictionary independently of the plugin code, but that would make deployment more complex, since we'd have to deploy the code and dictionary updates separately. We've made updates just recently to make some complexity easier to handle—which is why enabling the new Hebrew plugin took several months—but I'm not sure how hard it would be to handle separate plugin/dictionary deployments. @dcausse might be able to say—but if it's possible to support, it would still require someone to take on the task of figuring out how to compile Hspell into a form HebMorph can use, and then maintaining it as needed.

Sure, it's certainly feasible. I currently don't know enough about Hebmorph to tell precisely how we could do.
Deploying lexical resources in addition to the hebmorph plugin is certainly possible. We could create a new static project called 'wmf-search-dictionnaries' and add all the lexical resources there.
This project would be deployed as a debian package on elastic machines.
If someone is willing to make a proof of concept that augmenting hebmorph with a dictionary can fix some issues I'd be happy to help on making it production ready and easy to deploy.

@Amire80 wondering if this is an issue on any of Hebrew MediaWiki websites, or only on Wikimedia? If the second, then this is a problem in cirrussearch rather than mwsearch.

The fix is only for CirrusSearch. Almost certainly though if you are using something like mysql search the same problem would exist (and many more, as stemming support in sql is exceptionally minimal).

debt triaged this task as Low priority.Oct 5 2017, 5:25 PM
debt moved this task from needs triage to search-icebox on the Discovery-Search board.
debt subscribed.

I don't think this is something we can take on right now. Forking the dictionary, fixing this one issue and then deploying this (at a very high level look at the tasks) would be difficult and probably not something we want to do.

Moving to later as we figure out if we want to take on this work.

We're not sure if we're going to be able to keep using Hebmorph because it hasn't been released for Elasticsearch 6. @dcausse recompiled it so we probably can go into ES6, but beyond that it's unclear, so putting any significant effort into fixing parses for specific words is unlikely to be something we can do.

Can this at least be marked as "stalled" or something? The difficulty in fixing is understandable, but the bug is real.

If we end up having to abandon HebMorph then either there won't be any morphological processing at all or, if we find a replacement, there will be a completely different set of specific errors. I guess we can leave it as stalled for as long as we have HebMorph. And I'll modify the description to be more generic since it isn't about this particular word, but about the ability to make additions to the HebMorph dictionary.

TJones renamed this task from Search "טריפלקס" in the Hebrew Wikipedia doesn't find an article with the word "וטריפלקס" to Update and/or enable custom entries for Hebmorph dictionary.Jan 30 2019, 9:58 PM
TJones updated the task description. (Show Details)
TJones updated the task description. (Show Details)
Aklapper subscribed.

This seems to be blocked on upstream hence adding Upstream.
https://github.com/synhershko/HebMorph/issues lists a request to update to hspell 1.4 but no request to support ES6.