Page MenuHomePhabricator

dalekiy_obriy (Andriy Rysin)
User

Projects

User does not belong to any projects.

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Oct 13 2016, 1:39 PM (152 w, 2 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
Dalekiy obriy [ Global Accounts ]

Recent Activity

Jan 9 2018

dalekiy_obriy added a comment to T160106: Test and analyze new Ukrainian language analyzers.

@dalekiy_obriy—that's good news! I'm looking forward to eventually getting the updates. A reminder would be very helpful, too...

Jan 9 2018, 3:11 AM · MW-1.29-release-notes, Epic, Discovery-Search (Current work), Discovery

May 13 2017

dalekiy_obriy added a comment to T160106: Test and analyze new Ukrainian language analyzers.

JFYI: I've pushed some changes to Ukrainian analyzer in Lucene:

  1. dictionary moved to external dependency
  2. fixed problem with searching some proper nouns
  3. added thousands of new words
  4. added normalization for couple of more apostrophe symbols that happen in texts
  5. ignore u00AD in words

This change should appear in Lucene 6.6 and 7. I'll follow up and will make a note when it makes into ElasticSearch.

May 13 2017, 3:08 PM · MW-1.29-release-notes, Epic, Discovery-Search (Current work), Discovery

Mar 31 2017

dalekiy_obriy added a comment to T160106: Test and analyze new Ukrainian language analyzers.

As for the first sample of 100 groups, it looks good, I would say it's almost perfect (if I say so myself :)). I agree with @Piramidion here that the only flaw is merging abbreviations with normal words. The reason for this is that common approach in Lucene anayzers is to lowercase the text first and then do the stemming so we can't use the case as a help. We actually experimented with lemmatizing first and then converting to lowercase but this approach has lots of limitations and is not acceptable.

Mar 31 2017, 2:50 AM · MW-1.29-release-notes, Epic, Discovery-Search (Current work), Discovery

Mar 10 2017

dalekiy_obriy added a comment to T160106: Test and analyze new Ukrainian language analyzers.

Hi, I am author and maintainer of the Ukrainian dictionary that is used in Lucene's Ukrainian analyzer, and I'd like to note two things:

  1. this analyzer is very close to Polish one - both use dictionary in morfologik format (and both of them are used for grammar checking in LanguageTool), so if Polish worked I have high hopes for the Ukrainian one as well
  2. I'd like to hear any problems that may arise from using this analyzer, hopefully we can address most of them (though as I understand if we fix them in Lucene it may take a while to get them here)
Mar 10 2017, 9:01 PM · MW-1.29-release-notes, Epic, Discovery-Search (Current work), Discovery

Oct 13 2016

dalekiy_obriy added a comment to T146358: Improve processing of the apostrophe by the search engine in Ukrainian.

As I understand once the next version of Lucene is released the Elasticsearch will have Ukrainian analyzer accessible. Would we need to create another ticket here at phabricator to switch to it for Ukrainian?

Oct 13 2016, 3:42 PM · MW-1.28-release (WMF-deploy-2016-10-25_(1.28.0-wmf.23)), Patch-For-Review, Discovery-Search (Current work), CirrusSearch, Discovery
dalekiy_obriy added a comment to T146358: Improve processing of the apostrophe by the search engine in Ukrainian.

In general "recommended" apostrophe for Ukrainian probably should be 02BC (due to it being part of the word), also 02BC is approved apostrophe character for Ukrainian in internationalized domain names. But majority of the Ukrainian texts out there are using 027 and (a bit less) 2019, and it probably will stay this way for long time as majority of the users will have only ' on their keyboards (and some word processors may change it to 2019). I would say we do want to support 02BC same way we do for 027 and 2019.

Oct 13 2016, 2:00 PM · MW-1.28-release (WMF-deploy-2016-10-25_(1.28.0-wmf.23)), Patch-For-Review, Discovery-Search (Current work), CirrusSearch, Discovery