After the research in T154516 has found some analysers for Polish that are potentially better, we will test them, and analyse to see if they are better or not. If they are, we will file a task to deploy one of them.
|Open||None||T174065 [FY 2017-18 Objective] Improve support for searching in multiple languages|
|Open||None||T154511 [Epic] Research, test, and deploy new language analyzers|
|Resolved||TJones||T154516 [Research spike, 4 hours] Research Polish language analysers|
|Resolved||TJones||T154517 Test and analyse new Polish language analysers|
|Resolved||EBernhardson||T158682 Deploy new Polish language analyser|
I've got Stempel set up and running on my laptop, but I'm finding some weird problems. In particular unexpected characters are being stemmed to ć, particularly but not exclusively numbers. It's been noted before but is marked as Won't Fix. I can't track down the rationale—it's alluded to in that issue, but the link is dead. I may follow up with the repo owner.
Unfortunately the problem comes down to the stemmer rules, which are compiled into one giant opaque file.
Next task is to document the extent of the problem based on my 10K article sample, and get it running on relForge so people can try it—both in an effort to see how often weird things are likely to happen.
I've finished my initial analysis, and the write up is on Mediawiki. The goal, of course, is to highlight problems, so there's a lot of looking at stemming errors, which include a consistent numerical error, and a load of random-seeming errors.
There's also a live demo of the 1.6M-article Polish Wikipedia index (no articles—so just results and snippets, and it's from the first week of Feb, so no updates) at pl-wp-stempel-relforge.wmflabs.org.
I've contacted some of the upstream developers, and I'm hoping for, but not planning on, some help from that quarter.
Next steps include reaching out to the Polish Wikipedia community to evaluate the demo with Stempel active, and running some real world queries and looking to see how common "ridiculous" results are.
After that, the only options are looking for another Polish analyzer, or skipping Polish for now.
After consulting with @Deskana, the plan is to go forward with deployment after a review period, unless there is significant pushback from the Polish wiki community.
I've document the information I got from one of the early Stempel developers, which may allow us to re-compile new rules—which would have to be tested, of course, but would allow us to patch the stemming rules in the future.
Thanks for the analysis!
I like the idea to reuse Stempel as a source to regenerate the stems table :)
Although I think that for a first analysis of this kind this opens a lot of interesting questions and possibilities but I agree with the recommendation to move forward.
Maybe another analysis of this kind will help to identify some kind of "quality threshold"? Will we have very different problems? How to assert the quality and say "it's better than nothing" esp. in the present case when you discovered these bad stemming behaviors. Should we invest time on improving an analyzer (in this case regenerating the stems table)?
(well... if the next one is chinese we may encounter very different problems though)
I think it would be hard to define a hard quality threshold. It's possible to estimate the percentage of tokens that have potential problems, but that's very shallow. It varies by sample size—as the farther you go down the long tail, the more weird low-frequency errors may happen. It also matters how the errors are distributed. If it's all rare words from far down the long tail, it may not matter. If a one-of-a-kind token, T6894280432234343253267904, happens to stem to the in English, that's arguably a ~5% error rate—but not at all the same as if and, I, and it all stemmed together—also about 5%.
I think we can get some sense of potential problems from this kind of analysis, though. For reference, I ran my tool against 10K ICU-folded English articles and 10K ICU-folded French articles, and there were no bizarre problems of the sort Stempel has, just the occasional weird Unicode character doing slightly weird things (like Ȣ stemming to ou). If Stempel had had that level of performance, I'd be willing to ship it to production without further review (though I always prefer further review).
Unfortunately, the best way to assess the quality still seems to be just to let people use it, and see if the potential problems become real problems under real usage. I did simulate that re-running the 200 real user queries on RelForge and in production. I couldn't necessarily tell if the results were better, but there weren't any drastically bad matches (though stemming "VIII" to nothing was pretty bad).
All that said, I'd love to hear ideas about ways to define a quality threshold that are tractable and don't require speaker review.
In other news, I also like the idea of rebuilding the stemmer table, so as a 10% project I might try building a tiny table to see how hard it is.
And, finally, yeah, Chinese is going to be very different. I'm slightly tempted to jump to Ukrainian just because Chinese is so different—but doing Chinese next will be "a learning experience".
More Lessons Learned
After a bit of feedback from the Polish Technical Village Pump, a few things have emerged:
- People are concerned about the ability to search for exact strings
- it may help to make the points that exact matches score better even with stemming and that quotes are available to force exact matches
- People are concerned about changes for API users
- note that no particular API flag is available to disable analysis
- quotes are still available for exact matches
- Perhaps we should consider a flag to allow people to only search in the plain field—not sure how hard that would be
If we can avoid adding a flag to the API I'd prefer to stick with quotes. Having a flag that influences how we parse the input query makes the analysis a bit harder as we will have to always carry this flag around to replay queries.
I consider not having a field with stems as a bug, concerns are valid but I don't think we can maintain a search solution where we guarantee that search results are stable by providing more and more API flags to keep specific behaviors.
Enabling BM25 has certainly changed a lot of search results but AFAIK we haven' received any complains from API consumers yet.
As a maintainer of CirrusSearch I'm not strongly opposed to it but I'd prefer not to go into this kind of solutions.
I generally agree that it's probably not going to be a problem for the vast majority of users. If someone really wanted to do it, automatically applying quotes could be tricky in certain corner cases—particularly tokenizing the query correctly. For example, dog-and-pony should probably be quoted as "dog"-"and"-"pony", not "dog-and-pony". And I'd have to look t up or test to find the right approach for under_scores or colon:separated words. It's doable, but not trivial—though the corner cases are rare and might often be poor queries anyway.
I'll leave the discussion here but remove the note from my write up.
I would be more inclined into looking at a solution at the syntax level. AFAIK this feature has not been requested by wikis where a stemmer is available and if I understood correctly I feel like these are premature concerns.
If we can wait for Q4 and the advanced syntax parser we are planning to work on then I'd prefer to introduce this kind of features as part of the new parser possibilities.
I agree that syntax for exact matches seems like a better solution than an API flag, especially given our current plans for working on the advanced syntax parser next quarter.
Although exact matches are a very legitimate use case, I suspect they represent a relatively small proportion of searches relative to all the API users doing more general queries.