Page MenuHomePhabricator

Searching for minus (-) character in fawiki takes you to the page on apostrophe (') character instead
Closed, InvalidPublic


If you type just a minus (-) character in the search box of enwiki and press Enter, it will take you to which in turn redirects you to

But in fawiki, it takes you to'&redirect=no which in turn takes you toآپاستروف (Apostrophe) which makes no sense.

Why does the - character get interpreted as ' ?

Originally reported on fawiki by fa:User:Sunfyre

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Short answer: Add a redirect from - to the appropriate page to solve the problem. (I'd do it, but English Wikipedia distinguishes hyphen and hyphen-minus, and I don't read Farsi, so I can't figure out the right redirect to add.)

Long answer: On-wiki search isn't really optimized for single punctuation characters, and so it can do weird things. In this case, a number of different factors are interacting to get this behavior.

First, a detour to explain how we analyze text to find matches. There are several ways:

  • "text" analysis does as much language-specific normalization as possible: breaking the text into words, lowercasing words, stemming (i.e., so hope, hopes, hoped, and hoping all match), removing foreign diacritics, dropping stop words (common words like the without that don't carry much content), etc. It's used for general full text searching. "Text" analysis generally ignores punctuation, especially when it is on its own.
  • "plain" analysis does as little as possible to the text other than breaking the text into words, lowercasing them, and doing some basic normalization of uncommon characters for most languages. In English, it also strips diacritics, because English almost always ignores them. It's used for "exact" matching, like when you search with quotes. "plain" analysis also generally ignores punctuation, especially when it is on its own.
  • "near match" analysis also does as little as possible, like "plain", but does not break the text into words. It's used for title matches. It doesn't break the string into words, but it does discount some punctuation marks by converting them to spaces, so that hyphenated-man, hyphenated_man, and hyphenated man are all equivalent.
  • "near match ASCII folding" is the same as "near match", but it also aggressively removes diacritics.

When you go to the search box, it looks for an exact title or redirect match, and if there is one, you are taken to it. (It's a little more complicated than this in the cases where you have entries that only differ by capitalization, like jack and Jack or ebay and eBay on English Wiktionary. If you search for jaCK or eBaY you will get sent to the one that's all lower case.)

If not, then it processes the text with "near match" and if there is exactly one title match (after deduplicating redirects), you are taken to it. Thus on English Wikipedia you can search for Albert Einstein, Albert_Einstein, or Albert-Einstein and get the expected result. On Farsi Wikipedia, آلبرت_اینشتین, آلبرت اینشتین, and آلبرت-اینشتین also all work.

If "near match" doesn't get any results, "near match ASCII folding" takes a turn and if there is only one result (ignoring redirect duplicates), you are taken to it. On English Wikipedia, you can search for Ḁłɓęȑṭ Ǝḭɲṧʈɇḯȵ and get taken directly to "Albert Einstein".

If "near match" has more than one result, or "near match ASCII folding" gets no results or more than one result, then the query gets set to the full text search, which uses a combination of different analyses to get results. As an example, on English Wikipedia, if you search for udem you get taken right to the "UdeM" page (that's a "near match" result). If you search for üdem you get taken to the "Üdem" page (which is also a "near match" result, though it redirects to "Uedem" because German spelling is like that).

Now here's where it gets tricky. If you search for udëm there are no "near match" results, but there are two "near match ASCII folding" results: UdeM and Üdem. Since it can't choose between them, you get rolled over to full text search.

Why is all of this relevant? Isolated punctuation marks get reduced to nothing by "text" and "plain" processing, but get indexed as a space by "near match". As a result, if you do full text search on English Wikipedia for a plain single quote ('), a hyphen-minus (-), or a curly apostrophe (’), "text" and "plain" reduce them to nothing, and "near match" converts them to a space. Since there is more than one result for a space as a title, you get rolled over to the full text search results. (The modifier apostrophe ( ʼ) is also returned because "near match" converts it to a space, but searching for it directly gives lots of results because "text" and "plain" do not reduce it to nothing. As I said, on-wiki search isn't really optimized for single punctuation characters.)

Here are links to results on English Wikipedia: search ', search -, or search ’.

You get similar results on English Wiktionary: search ', search -, or search ’.

Now, on Farsi Wikipedia, there's only one full text result for these three characters: search ', search -, or search ’.

Single quote (') and curly apostrophe (’) work in the search box because both have an exact match to a redirect to the "آپاستروف" (apostrophe) article.

Hyphen has no exact title match, so it tries a "near match", gets converted to space (" "), which has only one match—the apostrophe article—so you get sent there.

We could try to figure out a smarter way to process everything and handle all the special cases of punctuation and such, but the most straightforward solution is to add a redirect from "-" to the right article on Farsi Wikipedia.

Your answer deserves a medal!

I created the redirect as you suggested, and am closing this as invalid.

Vvjjkkii renamed this task from Searching for minus (-) character in fawiki takes you to the page on apostrophe (') character instead to xbbaaaaaaa.Jul 1 2018, 1:04 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: Huji, Aklapper.
CommunityTechBot renamed this task from xbbaaaaaaa to Searching for minus (-) character in fawiki takes you to the page on apostrophe (') character instead.Jul 2 2018, 2:59 PM
CommunityTechBot closed this task as Invalid.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: Huji, Aklapper.