Page MenuHomePhabricator

Investigate irrelevant sister project search results on Wikipedia
Closed, ResolvedPublic

Description

Originally reported:

https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(policy)#Obscene_(and_irrelevant)_trans-wiki_results

It looks like quotations from Wiktionary may be causing the relevancy score of results to be off.

Per discussion with Erika, one option would be to filter quotations on Wiktionary, which can be excluded with a relatively minor config change and reindex, and that may produce even more accurate results in the end. Unknown until tested.

Event Timeline

Filtering quotations isn't going to solve the more general problem, and it will decrease the value of search on Wiktionary, because sometimes the quotations are where the information you want is going to be found.

As an example of why quotations aren't the problem, the innocuous seeming search sting german attested gives back the Wiktionary result fuck—and all of the matched terms are in the etymology, not the quotations. This result shows up in the sister search for the same reason: there's only one result. It's a pretty bad result, but it's the only one, so it's the "best".

We have millions of users and millions of searches, so these weird irrelevant matches are always going to pop up for someone, and because we don't censor content, some are going to be offensive to someone, though the irrelevant matches are more likely when you search terms that are "off topic" for a given wiki (like geographical regions in a dictionary—a singular irrelevant result would be much less likely to happen on Wikivoyage for the original query).

A better approach—which may or may not be feasible—would be to have a minimum quality of search for sister search. So in these cases, the weak result that is the only result could be ignored. Unfortunately, scores are not consistent across projects for lots of reasons, so it would need to be configured per wiki—and that's a lot of wikis. You could do it automatically—say by tracking the scores of the top result of every search and then picking a cutoff, like 50th percentile, and applying that per wiki. So on English Wiktionary the 50th percentile score might be 1753.4, while on Italian Wiktionary it's 24.2. That's a fair amount of infrastructure, though.

A community-lead approach would be to tag NSFW/R-Rated/Not Family Friendly content in each wiki with a consistent category and then exclude results in that category by adding an additional clause to all sister searches. (Though I can't get negated category searches to work right now, so it might require some fiddling with the search parser, alas.) This would definitely require community involvement because we wouldn't want to search more than one or two categories.

EDIT: Reading more of the comments on the Village pump, maybe the title search is the right way to go. It also acts as a quality filter. It may be too aggressive of a filter for some, but for sister wiki search maybe it's okay,

</2¢>

EBjune added a project: Discovery-Search.
EBjune moved this task from needs triage to Up Next on the Discovery-Search board.
  1. Searching Wiktionary for "DC Baltimore area" returns a seemingly random word, because that is not a term you'd normally search Wiktionary for.
  2. When you search Wikipedia, you are given the top result from Wiktionary as if you'd searched Wiktionary for it. "We asked Wiktionary what it thought of your search term, and this is what it said."

1 is normal behaviour. 2 is also normal behaviour. The interaction of 1 and 2 produces something that is not just offensive to someone, it is hate speech. According to Wikimedia, the DC Baltimore area has been "ruined by black people". As I said in the village pump thread, if you were a white supremacist trying to spread your message through Wikimedia, engineering things so that a word meaning "ruined by black people" was the top result for "DC Baltimore area" is exactly the way you'd go about it.

Change 405206 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Switch wiktionary sister search on enwiki to title only

https://gerrit.wikimedia.org/r/405206

EBernhardson moved this task from Up Next to Current work on the Discovery-Search board.

For the moment, patch switches wiktionary on enwiki to use the title filter. We can ponder a bit on how to improve the relevance of the wiktionary search, or how to filter results that happen to be there but aren't particularly good.

Change 405206 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch wiktionary sister search on enwiki to title only

https://gerrit.wikimedia.org/r/405206

Mentioned in SAL (#wikimedia-operations) [2018-01-19T00:09:50Z] <ebernhardson@tin> Synchronized wmf-config/InitialiseSettings.php: T185250 Switch wiktionary sister search on enwiki to title only (step 1) (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2018-01-19T00:11:46Z] <ebernhardson@tin> Synchronized wmf-config/CirrusSearch-common.php: T185250 Switch wiktionary sister search on enwiki to title only (step 2) (duration: 00m 56s)

We chatted more about this today - this isn't the mostest perfect way to handle this issue, but it's been running for nearly 2 months now and seems to be good enough for enwiki. Thanks for everyone's help and comments on this.