Page MenuHomePhabricator

Look into improving Dagbani Search
Closed, DeclinedPublic

Description

At Arctic Knot I met some people interested in potentially improving search in their languages, includng Dagbani. This ticket is to track that work and have a central location for discussion and note taking.

If anyone is interested in helping but unfamiliar with the kinds of things we do for a language to improve search, I did a very short (~6 min) video for the 2020 Celtic Knot conference that gives an overview. You can also just describe a problem you are having and I can look into it. (Of course, not all problems are easily or quickly solvable, but we can always take a look and see what's possible.)

Event Timeline

Orthography
After a quick look, I think the characters ɛ, ɣ, ŋ, ɔ, ʒ are being processed correctly, which is good.

Stopwords
We've started working on a stopword list, including the candidates below.

  • nyini, ŋuna, ba , mma , viɛlim

I've also put together a list of words from Dagbani Wikipedia sorted by frequency, for review.

Frequency list notes:

  • I got about 12,000 words from Dagbani Wikipedia articles and did some minor cleanup on them. These are all the words that appeared 3 times or more from that list. There's obviously some English words and some names (Dagbaŋ, Ghana), but a quick look in a Dagbani/English dictionary shows that the top of the list has some good candidates!

A few things to keep in mind when reviewing the list:

  • you don't have to look at the whole list! Start at the top and work your way down, as long as it seems productive
  • feel free to add variations on words that aren't present, or any that are similar. For example, in English, seeing "a" might make you think to add "an"; seeing "of" might make you think to add "to", "from", "for", etc.
  • some words are sometimes stop words and sometimes not, like "can" in English; if it is usually a stop word, include it in the stop word list; if it is usually not a stop word, don't
  • stop words are not completely ignored, but they are heavily discounted
  • a pretty good list of stop words is very helpful; it doesn't have to be perfect, and it is relatively easy to remove words if they cause problems
TJones triaged this task as Medium priority.Jul 29 2021, 9:54 PM
TJones moved this task from needs triage to Language Stuff on the Discovery-Search board.
TJones updated the task description. (Show Details)
TJones updated the task description. (Show Details)

Thank you so much for initiating this TJones. Looking forward to the outcome.

The initial enthusiasm died down without any activity in a year, so I'm going to close this ticket.