This quarter we're researching new language analysers (see relevant mailing list post). We have some ideas for what languages need new analysers (e.g. Polish, Chinese, Hebrew, etc.) but we'd like to take a more structured look at the problem to see where we can best focus our efforts. For the sake of expediency, we'll still be starting with Polish (see T154516).
Sadly, this is something that's hard to involve volunteers in; native speakers of languages are not exceptionally hard for us to find, but most users have no idea about things like tokenisation, n-grams, and such, so explaining to them exactly what we want is fairly hard.