Page MenuHomePhabricator

A Hebrew article title with an apostrophe cannot be found when searching without an apostrophe
Open, LowestPublic

Description

There's an article in the Hebrew Wikipedia called סנדומייז'‏ (it's about the Polish city of Sandomierz).

If you search for it in the search box without the apostrophe, then the article doesn't come up as a result. In casual typing people quite often omit the apostrophe, so this is a practical problem. Sandomierz is just an example - there are many more article titles with an apostrophe in Hebrew and in some other languages.

It's just a simple and common punctuation mark, so the search engine should be smart enough to find the article.

This is comparable to T75862, but it should be much simpler, because this is just about punctuation and not morphology.

Event Timeline

Amire80 created this task.Apr 8 2015, 10:09 AM
Amire80 raised the priority of this task from to Needs Triage.
Amire80 updated the task description. (Show Details)
Amire80 added projects: MediaWiki-Search, I18n.
Amire80 added a subscriber: Amire80.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 8 2015, 10:09 AM

@Amire80: Did you intentionally file this against MediaWiki-Search instead of CirrusSearch ?

@Amire80: Did you intentionally file this against MediaWiki-Search instead of CirrusSearch ?

Not in particular. It should be filed under whatever is relevant for Wikimedia sites, but it's a simple thing and ideally it should be fixed for all MediaWiki installations.

Amire80 set Security to None.
Restricted Application added a project: Discovery. · View Herald TranscriptSep 16 2015, 5:39 PM
Deskana moved this task from Needs triage to Search on the Discovery board.Oct 27 2015, 10:41 AM
Deskana added a subscriber: Deskana.Dec 3 2015, 5:44 PM

@Amire80 How common is this problem? It'd likely take us a significant amount of time to dive into it, and we're unsure of the rewards for that effort right now.

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptDec 3 2015, 5:44 PM
Deskana triaged this task as Lowest priority.Dec 30 2015, 9:31 PM
Amire80 moved this task from Untriaged to Search on the I18n board.Mar 12 2018, 1:30 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptMar 12 2018, 1:30 PM
EBjune added a subscriber: EBjune.Mar 12 2018, 7:13 PM

Since this ticket seems to be taking a circular path, I will ask Deskana's question again, because now we'll need to triage this again: @Amire80 How common is this problem?

Quite common. Just check how many articles have an apostrophe in the title.

EBjune added a comment.Apr 5 2018, 5:31 PM

@Amire80, while there are a lot of apostrophes in titles, is the problem that searching without them still doesn't show them in the results? Can you give us a few query examples that illustrate it?

We have done a lot of work on the Hebrew analyzer for search since this ticket was created, so we want to be sure it's the same problem.

EBjune added a subscriber: TJones.Apr 5 2018, 5:39 PM

Clarification: the completion search for סנדומייז finds the article for me, but the fulltext one (Special:Search) does not. Also, completion on Special:Search field does locate the correct article and *סנדומייז works for fulltext search.

We might want to just tokenize if without the apostrophe. Not sure if it would break anything for Hebrew, would like to hear @Amire80's opinion (and others too of course).

EBernhardson added a subscriber: EBernhardson.EditedApr 5 2018, 6:07 PM

I got some help from @Smalyshev to test this out:

  • Autocomplete successfully finds the page, due to the fuzziness of the algorithm
  • Full text does not find the page. Our 'title' field keeps the apostrophe as part of the token. I think the near_match field should have found it but doesn't. Term vectors for that page on title.near_match reports "סנדומייז " with an additional space that should probably not be on the end of the token. This seems like potentially a bug in the hebrew analyzer. @TJones any thoughts?
TJones added a comment.Apr 5 2018, 6:44 PM

Term vectors for that page on title.near_match reports "סנדומייז " with an additional space that should probably not be on the end of the token. This seems like potentially a bug in the hebrew analyzer. @TJones any thoughts?

The extra space isn't Hebrew-specific and comes from the near_match analysis chain, which is not custom for Hebrew—it's the same as for English. There's a character filter, near_space_flattener, that converts straight and curly apostrophes, underscores, and dashes to spaces. When they happen at the edge of a word, that's what you get.

As for the more general issue, it would be possible to either always strip apostrophes or index forms with and without apostrophes. I'd want to test and see how many new indexing collisions it caused—those things can be hard to predict, especially with the Hebrew analyzer, since it usually generates multiple tokens per input word.

Also, just to clarify, is the apostrophe operating as a geresh here? If so, is this also a problem for gershayim/double quote? It looks like the analyzer converts geresh to apostrophe and gershayim to double quote, so that seems plausible.

Smalyshev added a comment.EditedApr 5 2018, 7:00 PM

Also, just to clarify, is the apostrophe operating as a geresh here?

Yes.

If so, is this also a problem for gershayim/double quote?

Probably not the same one, as gershayim is rarely seen at the end of the word, and less frequently omitted I presume. But may be the same one underneath.

TJones updated the task description. (Show Details)Apr 5 2018, 7:08 PM
TJones added a comment.Apr 5 2018, 7:11 PM

Are apostrophes only normally omitted at the end of a word? Would it be omitted in צ'ארלס‬, and searched as צארלס‬?

Strictly speaking it's not correct, but can happen and probably is a relatively common typo. Not sure how common it is in our data set - probably needs some data crunching to see.

TJones added a comment.Apr 5 2018, 8:54 PM

From the Description:

It's just a simple and common punctuation mark, so the search engine should be smart enough to find the article.

Hmm... I think want to bring this into question after reading Stas's reply above about צארלס‬ being a typo.

For those who don't read Hebrew and didn't look it up yet, צ'ארלס‬ is Hebrew for "Charles", and צארלס‬ is the same thing with the ' omitted. In this case, the apostrophe (or more Unicodey geresh) is used to modify the pronunciation of the preceding letter to indicate a sound not normally found in Hebrew (here, the "ch" in "Charles"); it has other uses, too.

Okay, so the idea of a "simple" punctuation mark causing problems isn't enough to justify jumping through lots of hoops. We don't do anything like that in French, Italian, or English (the first three languages I thought of that make heavy use of the apostrophe). We have parallel situations where lacking an apostrophe gets a match with the completion suggester (only off by one character) but not full text search (term not in the article):

  • en: No, It Isn't / No, It Isnt
  • fr: Joconde jusqu'à cent / Joconde jusquà cent
  • it: Diritto d'autore / Diritto dautore

For some specific examples, search does an okay job, sometimes because there's an apostrophe-less version in the article (in a URL, for example), or there's a redirect for the apostrophe-less version, or Did You Mean or the completion suggester correct it—but in general we don't merge apostrophe-ful and apostrophe-less versions of words in these other languages. The period is also a simple punctuation mark, but acronyms in English cause all sorts of trouble. N.A.S.A. is not at all the same as NASA.

So, I think there's a continuum of language/spelling/typing difficulty that we should accommodate, and it's not clear to me where this falls on that continuum.

Things we should (and do!) support (in no particular order):

  • No one has geresh < ֜> on their keyboard, so we map geresh to apostrophe <'> for Hebrew.
  • Russian speakers don't distinguish <е> and <ё> unless they are lexicographers, so we map one to another.
  • English speakers can't type most diacritics, so we strip them.

Things we do not support (this time in order of being increasingly less plausible to support):

  • English speakers type isnt instead of isn't (and similar in French or Italian) as an error or because they are lazy
    • (Though low frequency errors get redirects on highly visited pages to correct for this.)
  • English speakers type u for you, r for are, and ur for your. <shudder>
  • Someone has no idea what they are looking for and expects magic: that movie with that guy from the thing

Off the top of my head, a heuristic I might use is this: would a typical high school student who is familiar with what they are searching but not an expert in the subject know what to fix and how to fix it if they got poor results? For the one's we do not support, I think they would (other than not being able to remember a movie title). For the ones we should (and do!) support, they might not—whether it's not remembering to use an alternate character, or being unable to type it even if you did know what you were supposed to do.

So, when an average user searches for צארלס‬ or סנדומייז‏ and they get obviously bad results, is it going to be clear that they should have searched for צ'ארלס‬ or סנדומייז'‏? Or are they going to be lost and unable to continue?

Are we helping people find information in a straightforward way, or are we accommodating people being a bit lazy (or a lot lazy)? For English, dropping an apostrophe is easy enough for the searcher to fix, and merging apostrophe-ful versions would cause problems. For Hebrew—I have no idea!