Page MenuHomePhabricator

Build and enable thesaurus / synonym list for search
Open, MediumPublic

Event Timeline

Tnegrin raised the priority of this task from to Needs Triage.
Tnegrin updated the task description. (Show Details)
Tnegrin added a project: MediaWiki-Search.
Tnegrin subscribed.
Tnegrin set Security to None.

Hi Nik -- here's the broken query I mentioned on Friday. Thanks for taking a look and let me know if you need further info.

-Toby

Its a synonym problem. "us automobile production" finds what you expect as the top hit. We don't do synonyms right now. It was something that I'd wanted to work on and would have gotten around to eventually but its not as high on the list as wikidata query service. You can manually fix this by adding a redirect from "u.s. car production" to the page but its a bit lame. We should be able to automatically figure stuff like that out. In all languages too given that we could mine wiktionary.

Aklapper added a subscriber: Manybubbles.

[ Resetting assignee as assignee account is not active anymore ]

debt triaged this task as Medium priority.Jul 13 2017, 5:29 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
debt added subscribers: TJones, EBernhardson, debt.

We'll need to get some serious research work into this. It might be interesting for @TJones to take a look first. :)

Here's the note from @EBernhardson from the merged ticket:

It might be useful to include synonyms in search to improve recall. For example if a search for car transformed into (car OR automobile) as an example. Elasticsearch supports doing this as part of the analysis phase of indexing, we would need a list of synonyms to work with though. We could consider using WordNET, or perhaps extracting data from wiktionary.

This is also being discussed - but with a slightly different usage - in this conversation: https://www.mediawiki.org/wiki/Topic:Tti9vgefpnaztmol

Issue
As a user I'd like to be presented with suggestions to improve my search.

Background
Currently search depends entirely on a word either matching the search terms, or matching the title of a page. This reduces the usefulness of the search when a word can mean so many things, for example, looking for "trunk", one may mean a proboscis ("elephant's trunk"), boot (a part of a car), a part of a tree, part of a body, and so forth.

Proposed solution
Extract these from the page with a matching title for wiktionary search results much like the widget Cross-wiki Search Result Improvements/self-guided testing#Wiktionary. For example(https://en.wiktionary.org/wiki/trunk)

Provide a search suggestion: "you may be interested in : proboscis, boot ' using words extracted from the Synonyms sub-heading

Considering the different wiktionaries and different headings or rules in each wiki, this may not be feasible until there is some way to store these in a structured manner.

Even so, just showing the contents under the synonym (and similar ones in other wiktionaries) heading will be a good short term improvement.

This comment was removed by TJones.

[Once again, I've fat-fingered a half-written comment and had to delete it so I can finish my magnum opus! Sorry.]

Ugh, horribly this has gotten worse over time, as the desired result is no longer first for us automobile production. Two notes:

  • I'm going to blame word_break_helper which maps periods (and other things) to spaces, splitting up "U.S." in the desired title to "U" and "S", which does not match "us".
  • The desired article is the first and only suggestion from the completion suggester, which is matching a period-less redirect.

And thus my feelings about word_break_helper (yuck) and the completion suggester (yay!).

So, I think there are a few issues here:

  • using synonyms in search
  • using synonyms in suggestions
  • is word_break_helper even helping? (See T170625.)

While there is a common notion of synonyms (or a thesaurus), I think we should split up the topics of using a thesaurus for search and using one for suggestions. A thesaurus for search that is used automatically needs to be more tightly controlled than one used for suggestions, which are easier to skip over.

Enabling a thesaurus for searches is fraught with complications. Unfiltered WordNet is probably a bad idea; it is too complete and includes rare and archaic senses that are more likely to generate noise than not. Wiktionary might have the same problem, and definitely has a problem with being only semi-structured and thus hard to parse. I took a look at the pre-Cirrus/pre-Elastic search engine, and it only had one synonym entry: movie/film. (I support bringing that one back!)

I would assume we'd have some way to toggle the default thesaurus status, whether that is enabled or disabled. Unsophisticated newbie users are not going to know how to toggle it, so if it is on by default, it needs to be conservative so they don't get overrun with extra clutter they can't control. If it is off, they will probably never find it, even though they probably need it more than anyone else. Also, we probably shouldn't use quotes as the only way to disable the thesaurus, since that also disables language analysis. Just because I don't want lawyer to match attorney doesn't mean I don't want it to match lawyers.

So, I'd recommend a small, conservative, hand-curated, on-by-default thesaurus for searches. If that's not possible (because of the hand-curated part, esp. as it relates to all the languages we support), then I'd recommend either off-by-default or only using the thesaurus for suggestions, so that the clutter is kept to a minimum.

We'd also have to think about how this interacts with Learn to Rank. (We may have to get used to saying that a lot—but in a good way!) Esp. if the thesaurus is on-by-default, "matched a synonym" is probably a good LTR feature, and the newly introduced results might require a retraining, depending on how many of them there are.

recommend a small, conservative, hand-curated, on-by-default thesaurus for searches

I've seen this approach work well. It's time-consuming to create, but priceless when it's done.

Since I merged another ticket into this one, I've updated the title to be more precise and easier to find in the future.

Nobody seems to care about this, yet in my belief this is one of the crucial points why people (e.g. me) would regularly use Google to search Wikipedia instead of Wikipedia itself.

I come across this regularly (every week maybe? every 3 days?), and I've long wanted to find a relevant task on Phab and talk about this with interested people.

Today was the last straw: I typed "osmotic load", and here's what I got in Wikipedia vs Google:

image.png (785×856 px, 105 KB)
image.png (921×1 px, 149 KB)

What I was looking for was "osmotic pressure", of course.

I think it is obvious in 2025 that this should be mediated by machine learning, so it's strange this is not tagged. Please correct me if I didn't use the right project tag.

  1. People don't always realize the size of the dictionary at play here, so curating by hand is not an option even in the slightest. Besides, there are many languages, as @TJones correctly pointed out.
  2. You don't only have precise synonyms, but also semantically close concepts (a semantic field perhaps) – my example above being a good illustration ("load" and "pressure" are hardly synonyms; although Merriam-Webster's thesaurus lists "load" as a synonym of "pressure", but not vice versa). I don't think you need another task for this, because why not implement this as part of one functionality.

Again, it's generally strange that this task is not in the center of attention, as this has to be one of the bottlenecks why people don't use Wikipedia search and prefer Google instead. Aren't there teams in WMF whose focus is areas like this? Maybe I'm looking the wrong way, and all the action is happening somewhere else?

@Jack_who_built_the_house, as things currently stand, I don't think this is the right ticket for what you are proposing. This ticket is for using the explicit synonym mapping feature available in Elasticsearch/OpenSearch (we're mid-migration from ES to OS right now).

It's unfortunate that it appears that no one cares about this issue. The search team does care and has spent a fair amount of time thinking about this; we've done some experimenting and it has been on our long-term radar for forever. Please keep in mind that Google employs tens of thousands of engineers, and online estimates suggest 1000 to 3000 engineers working on their core search. The Wikimedia Search Platform team is six people—four software engineers and two SREs. In addition to search we have also been responsible for the Wikidata Query Service for quite a while (that is changing over the next year, we hope). Infrastructure needs—capacity, security, upgrades—often dominate our time.

(That said, I have to hype up our team and our on-wiki search for caring about things that the big search engines don't: a serious focus on privacy, costly wiki-specific tools that would not be profitable at scale (like regex search and some of our wonkier search keywords), and our support and love for less-resourced (and less profitable?) languages.)

Synonyms of any sort would be a big project, and we haven't had the bandwidth to address it in recent years. It is on our internal short list of projects for next fiscal year (July 2025–June 2026), but that is no guarantee it will make the cut. Also, finding the best way to approach the problem is complex.

Fuzzy "synonym" approaches favored by search engines and online retailers are not univerally loved by users. They can be great for recall when you don't quite know the right words to search for, but they can bring back a lot of junk (poor precision) when you do know exactly what you are looking for. We aren't aware of anything like a plug-n-play option that would "just work" on-wiki.

I think less sophisticated Wikipedia searchers could benefit from fuzzy search, but power users and editors could find any kind of synonym-type function distracting, so there would have to be a way to disable it from some queries. Is that a keyword or other syntax, a UI element (like Google's "Verbatim" button), or a user preference? Repurposing quotes might work, but that also disables stemming, which is orthogonal to synonyms.... etc., etc., etc.

Machine learning for synonyms takes it out of the hands of the wiki communities, which I personally don't like. I also have very low expectations for the results for languages outside of major world languages. Many big companies tout their multilingual capabilities, but closer inspection and/or talking to native speakers reveals that their support for less-resourced languages is often really lacking.

My current preference/strawman proposal (which certainly needs validation with product managers and eventual users) would be a hand-curated list of synonyms, possibly augmented or boot-strapped with a list of data-mined (but probably not machine-learned) suggestions from various sources. It is slow to build—but Wikipedia seemed pointless until it wasn't; Wikidata seemed useless until it wasn't—and I think even a dozen well-chosen synonyms could help a lot of users (and free editors from needing to create redirects). I also like the idea of wiki communities deciding how expansive or conservative to be in including synonyms, and I like the idea of having different lists across different projects in the same language. (And in the case of small projects and/or languages without a lot of text out there to mine, a hand-built thesaurus could be faster and better than anything you could get from machine learning.)

There's a lot of work to figuring out how to implement a community-defined list, define the update cycle for edits to the list, and build tools to include to help editors test possible thesaurus entries before enabling them. Some old brain-dump ideas are on my potential NLP projects list (search for "Use a Thesaurus"). And maybe a community-built thesaurus isn't feasible because of the added workload on editors, or dealing with edit wars, or other concerns.. we haven't really looked too deeply into the practical aspects of it yet. Hence the need for input from users and product managers before we get serious about building anything specific.

Back to the right ticket for machine learning approaches. Using word vector embeddings has been a common approach, though retrieval-augmented generation with LLMs is also fashionable right now. We've tried out word embeddings internally over the years, often using word vectors trained by others from Wikipedia data—going back to word2vec in the early-ish days—and generally haven't seen clear improvements in results or obvious use cases for it. But it comes up again frequently, and we and others have run or planned experiements to look at word embeddings and similar vector search possibilities.

Searching on Phab for those things can be hard because people talk about embedding media and the Vector skin a lot, but here are some possibly relevant tickets:

@TJones Thanks for taking the time and sharing your perspective in such detail. Thank you for the task links as well. I subscribed to them all.

I greatly appreciate the work, past and present, done by Wikimedia employees in this domain and their commitment to the Wikimedia values.

I agree with you that the approach of big search engines is adversely influenced by their

  • catering to a mass user with no sophisticated needs,
  • taking control out of the hands of users,
  • focus on profitability,

among other things. But what you argue for seems to me to be another extreme, e.g. a hand-curated list of synonyms, highly tweakable by each wiki, and an elaborate update cycle for them. (This actually evokes in my mind how much of a hassle it is for volunteers to maintain article redirects, which is a somewhat related mechanism.) It's just not feasible in this day and age. Nor do I believe communities would be so much eager to invest their resources into it. It's too niche a thing. Way less ambitious undertakings have failed; this would too if attempted.

I think finding a way to integrate fuzziness in search, but doing it in a way that would not frustrate users with sophisticated needs, is a way more promising path. So, this part of your comment is of the greatest interest to me:

Back to the right ticket for machine learning approaches. Using word vector embeddings has been a common approach, though retrieval-augmented generation with LLMs is also fashionable right now. We've tried out word embeddings internally over the years, often using word vectors trained by others from Wikipedia data—going back to word2vec in the early-ish days—and generally haven't seen clear improvements in results or obvious use cases for it. But it comes up again frequently, and we and others have run or planned experiements to look at word embeddings and similar vector search possibilities.

But I'd also be very concerned about not trying to reinvent the wheel here, especially in the light of your rightful comments about the size of the team of Wikimedia compared to Google's. So, this remark got my attention as well:

We aren't aware of anything like a plug-n-play option that would "just work" on-wiki.

It could be the case that looking out for such an option would be a proper investment of time and energy after all.