Page MenuHomePhabricator

Investigate Tibetan Lucene Analyzer
Open, MediumPublic

Description

User Story: As a user of Tibetan-language wikis, I would like to have good tokenization and stemming applied to Tibetan text in order to improve search accuracy.

A Tibetan tokenizer and stemmer is available under Apache 2.0 License here: https://github.com/buda-base/lucene-bo/

Acceptance Criteria:

  • Analysis of stemmer and tokenizer performance on Tibetan wiki sample text.
  • If the analysis result is positive, either an implementation for Tibetan (if it's easy) or a new ticket to do the implementation (if it's hard)

Event Timeline

Thanks! I'm the author of the analyzer, I'm happy to run some performance tests if it can help. We're currently using it in production for a few GB of Tibetan text (about 5,000 OCRed books) so I'm quite confident it will be ok. If there's any issue don't hesitate to contact me!

MPhamWMF triaged this task as Medium priority.Nov 29 2021, 4:27 PM
MPhamWMF moved this task from needs triage to Language Stuff on the Discovery-Search board.