Jan 9 2018
May 13 2017
JFYI: I've pushed some changes to Ukrainian analyzer in Lucene:
- dictionary moved to external dependency
- fixed problem with searching some proper nouns
- added thousands of new words
- added normalization for couple of more apostrophe symbols that happen in texts
- ignore u00AD in words
This change should appear in Lucene 6.6 and 7. I'll follow up and will make a note when it makes into ElasticSearch.
Mar 31 2017
As for the first sample of 100 groups, it looks good, I would say it's almost perfect (if I say so myself :)). I agree with @Piramidion here that the only flaw is merging abbreviations with normal words. The reason for this is that common approach in Lucene anayzers is to lowercase the text first and then do the stemming so we can't use the case as a help. We actually experimented with lemmatizing first and then converting to lowercase but this approach has lots of limitations and is not acceptable.
Mar 10 2017
Hi, I am author and maintainer of the Ukrainian dictionary that is used in Lucene's Ukrainian analyzer, and I'd like to note two things:
- this analyzer is very close to Polish one - both use dictionary in morfologik format (and both of them are used for grammar checking in LanguageTool), so if Polish worked I have high hopes for the Ukrainian one as well
- I'd like to hear any problems that may arise from using this analyzer, hopefully we can address most of them (though as I understand if we fix them in Lucene it may take a while to get them here)
Oct 13 2016
As I understand once the next version of Lucene is released the Elasticsearch will have Ukrainian analyzer accessible. Would we need to create another ticket here at phabricator to switch to it for Ukrainian?
In general "recommended" apostrophe for Ukrainian probably should be 02BC (due to it being part of the word), also 02BC is approved apostrophe character for Ukrainian in internationalized domain names. But majority of the Ukrainian texts out there are using 027 and (a bit less) 2019, and it probably will stay this way for long time as majority of the users will have only ' on their keyboards (and some word processors may change it to 2019). I would say we do want to support 02BC same way we do for 027 and 2019.