Page MenuHomePhabricator

Address stemming issue in Polish analyzer for search
Closed, ResolvedPublic

Description

Certain words (mostly non-Polish ones), when searched for in Polish, get stemmed to a single specific character, which then causes search results to present completely unrelated matches. The example we've used in discussion is searching for the word "Button" (https://pl.wikipedia.org/w/index.php?search=Button&title=Specjalna:Szukaj)

One way to address this would be to make the character in question a stopword, which would then force any matches to be caught by plain text matching during the relevance and ranking processing. Let's investigate.

Event Timeline

@EBjune—thanks for creating the ticket! @Smalyshev—thanks for adding me.

I'll review the Polish analyzer ("Stempel", it's a third-party plugin) to make sure it can be unpacked and thus that stop words can be added. If so, I'll review the stems with the largest distinct words in them (like the infamous ''button'' -> ''ć'' group) and see how many of them could plausibly be added as stop words. Then it's just a matter of config and testing. I think I'll pull a thousand queries and see how many have words affected by these stems to assess impact.

I did a quick test on "button" and I was able to unpack Stempel, the Polish stemmer—there doesn't seem to be much there other than lowercasing—and add ć as a stop word, so this is plausible. Since Stempel was my first new-plugin analysis, I don't have all the details I do for more recent new plugins, though I do have a list of errors I noticed.

So my approach would be:

  • re-run the analysis of the Polish sample (or possibly a larger one) to find the largest, most incorrect groups, and review them
  • configure them as stop words if there are no problems
    • test the config locally on targeted docs with the grouped words
    • see if any other groups of errors can be easily filtered, like numbers that end in "1" and one-letter stems
  • run an analysis of words in a sample of queries to see what percentage of words are affected by the new stop words
  • depending on likely impact, possibly set up a live demo on relforge
    • run the sample queries and assess impact
    • if impact is large, get feedback from Polish wiki communities. (On the other hand, if this turns out to be a rare problem in real life—as opposed to the obvious wrongness of the specific button results—we can probably make it live immediately.)

And it wouldn't hurt to think about whether LTR can do something smarter, though it isn't clear what that would be. Based on the Polish LTR A/B test, the obviously bad results do not come up often, and are outweighed by the better LTR results.

EBjune triaged this task as Medium priority.Feb 6 2018, 6:26 PM
EBjune moved this task from needs triage to Up Next on the Discovery-Search board.

This has been waiting for me to have time to work on it. I've moved it from Waiting to Backlog until I actually start on it again.

Write up is available on MediaWiki.

Summary: Stempel can be unpacked and modified, but it also had a hidden stop word list, which has to be recreated. ICU normalization comes along with the unpacking, which is fine. A set of three pattern filters cover many undesirable short stems, and a short list of longer stems that are noticeably bad added as stop words should improve precision a lot. The unstemmed plain field covers a lot of the recall deficit that filtering these tokens creates.

Upstream bugs will be filed.

Change 446686 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Patch Polish Analysis Chain to Remove Bad Stempel Stemmed Tokens

https://gerrit.wikimedia.org/r/446686

Change 446686 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Patch Polish Analysis Chain to Remove Bad Stempel Stemmed Tokens

https://gerrit.wikimedia.org/r/446686