Address stemming issue in Polish analyzer for search
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• EBjune
	Jan 30 2018, 8:07 PM

Description

Certain words (mostly non-Polish ones), when searched for in Polish, get stemmed to a single specific character, which then causes search results to present completely unrelated matches. The example we've used in discussion is searching for the word "Button" (https://pl.wikipedia.org/w/index.php?search=Button&title=Specjalna:Szukaj)

One way to address this would be to make the character in question a stopword, which would then force any matches to be caught by plain text matching during the relevance and ranking processing. Let's investigate.

Details

	Subject	Repo	Branch	Lines +/-
	Patch Polish Analysis Chain to Remove Bad Stempel Stemmed Tokens	mediawiki/extensions/CirrusSearch	master	+602 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		TJones	T186046 Address stemming issue in Polish analyzer for search
		Resolved		TJones	T200037 Re-index Polish Wikis to patch Stempel stems

Event Timeline

• EBjune created this task.Jan 30 2018, 8:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 30 2018, 8:07 PM

Smalyshev added a project: CirrusSearch.Jan 30 2018, 11:37 PM

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptJan 30 2018, 11:37 PM

Relevant writeup from @TJones: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Stempel_Analyzer_Analysis

@EBjune—thanks for creating the ticket! @Smalyshev—thanks for adding me.

I'll review the Polish analyzer ("Stempel", it's a third-party plugin) to make sure it can be unpacked and thus that stop words can be added. If so, I'll review the stems with the largest distinct words in them (like the infamous ''button'' -> ''ć'' group) and see how many of them could plausibly be added as stop words. Then it's just a matter of config and testing. I think I'll pull a thousand queries and see how many have words affected by these stems to assess impact.

TJones updated the task description. (Show Details)Feb 5 2018, 5:15 PM

I did a quick test on "button" and I was able to unpack Stempel, the Polish stemmer—there doesn't seem to be much there other than lowercasing—and add ć as a stop word, so this is plausible. Since Stempel was my first new-plugin analysis, I don't have all the details I do for more recent new plugins, though I do have a list of errors I noticed.

So my approach would be:

re-run the analysis of the Polish sample (or possibly a larger one) to find the largest, most incorrect groups, and review them
configure them as stop words if there are no problems
- test the config locally on targeted docs with the grouped words
- see if any other groups of errors can be easily filtered, like numbers that end in "1" and one-letter stems
run an analysis of words in a sample of queries to see what percentage of words are affected by the new stop words
depending on likely impact, possibly set up a live demo on relforge
- run the sample queries and assess impact
- if impact is large, get feedback from Polish wiki communities. (On the other hand, if this turns out to be a rare problem in real life—as opposed to the obvious wrongness of the specific button results—we can probably make it live immediately.)

And it wouldn't hurt to think about whether LTR can do something smarter, though it isn't clear what that would be. Based on the Polish LTR A/B test, the obviously bad results do not come up often, and are outweighed by the better LTR results.

• EBjune triaged this task as Medium priority.Feb 6 2018, 6:26 PM

• EBjune moved this task from needs triage to Up Next on the Discovery-Search board.

TJones claimed this task.Apr 27 2018, 4:33 PM

TJones moved this task from Up Next to Current work on the Discovery-Search board.

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

Nemo_bis subscribed.May 6 2018, 7:08 AM

TJones moved this task from not in use - please delete to Waiting on the Discovery-Search (Current work) board.Jun 12 2018, 5:32 PM

This has been waiting for me to have time to work on it. I've moved it from Waiting to Backlog until I actually start on it again.

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Jul 10 2018, 5:29 PM

Write up is available on MediaWiki.

Summary: Stempel can be unpacked and modified, but it also had a hidden stop word list, which has to be recreated. ICU normalization comes along with the unpacking, which is fine. A set of three pattern filters cover many undesirable short stems, and a short list of longer stems that are noticeably bad added as stop words should improve precision a lot. The unstemmed plain field covers a lot of the recall deficit that filtering these tokens creates.

Upstream bugs will be filed.

Change 446686 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Patch Polish Analysis Chain to Remove Bad Stempel Stemmed Tokens

https://gerrit.wikimedia.org/r/446686

gerritbot added a project: Patch-For-Review.Jul 18 2018, 8:21 PM

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Jul 18 2018, 8:52 PM

Change 446686 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Patch Polish Analysis Chain to Remove Bad Stempel Stemmed Tokens

https://gerrit.wikimedia.org/r/446686

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)).Jul 19 2018, 6:00 PM

TJones moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Jul 19 2018, 6:47 PM

TJones mentioned this in T200037: Re-index Polish Wikis to patch Stempel stems.

debt closed this task as Resolved.Jul 31 2018, 5:40 PM

debt closed subtask T200037: Re-index Polish Wikis to patch Stempel stems as Resolved.Aug 17 2018, 9:09 PM