User Story: As a Breton-language searcher, I want Breton-specific language analysis, so that I can get better search results.
Acceptance Criteria: Approval by a Breton speaker of the final version of changes made (initial proposed changes listed below), based on language analysis reports.
At the Celtic Knot 2020 conference, I got a pointer to a list of Breton stopwords, so I looked into them, and some other aspects of the Breton language that may be relevant to search.
So, as a 10% project, I plan to work on improving Breton search a little bit:
- Create a Breton-specific language analysis configuration
- Finalize a list of stop words (the linked-to list seems to be too aggressive, and is more a list of common words) and add them to the Breton config.
- Enable elision support for d', n', and p'. Look further into including m' and z'.
- Look at the impact of adding some support for the more common French elision (l', s', j', qu') since there is a fair amount of French text on Breton Wikipedia. (Definitely do not include c', since c'h is a letter in Breton.)
- Enable ICU folding. Very likely need an exception for ñ. Less likely for â, ê, î, ô, û, ù, ü (all used in Breton); watch for problems with ç (commonly used in French).
- Make sure apostrophes are normalized (e.g., c’hoar & c'hoar should get the same results).
And of course, I'll do the usual language analysis reports to make sure all the changes look reasonable.