Page MenuHomePhabricator

Apostrophes do not work well in search on nia.wikipedia
Closed, ResolvedPublic3 Estimated Story PointsBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

  • TBD

What happens?:
For most searches, there is a problem with diacritics and apostrophes: pages with them show up in Special:Search only typed exactly as they are, otherwise the page will not be shown. Often folks write in a word doc and then copy and paste inside wiki. If someone search with the normal apostrophe from the computer keyword, it won't find the word because Word turned it to another kind of apostrophe.

What should have happened instead?:
Search should show correct results also if diacritics and apostrophes are involved.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc.: current version on nia.wikipedia.org

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Removing MediaWiki-Search as that's not used on Wikimedia wikis (but CirrusSearch).

Interesting results indeed when using a backtick:

Screenshot from 2022-06-29 23-00-22.png (604×902 px, 70 KB)

Screenshot from 2022-06-29 23-00-26.png (604×902 px, 66 KB)

Screenshot from 2022-06-29 23-00-29.png (604×902 px, 66 KB)

I just want to clarify the issue. When our contributors write an article first in MS Word (which happens often, because they feel more comfortable to write in Word than in Visual Editor) the apostroph on the keyboard (', u+0027) will be turned automatically by Word into another apostroph (ʼ, u+02BC).

Unfortunately this happens also when they write the page title.

The result: if for example a user search for keyword ndru'u (with u+0027) it will not find all words ndruʼu (with u+02BC) and visa versa. This happens across Wikipedia, Wiktionary and Wikibooks.

I ran a robot some times ago to change the u+02BC into u+0027 in the page contents, but I haven't figured out how to change it in the page title (how to create a pywikibot script to change the apostroph in the page titles).

Unfortunately despite warning from me, our contributors often forget to check their apostroph before creating a wiki page.

Nerdspeak: The technical issue here is that we have ICU normalization enabled by default everywhere that we can (though not in monolithic analyzers), but ICU folding does the mapping from curly quotes to straight quotes. ICU folding is not enabled by default because it is too aggressive and needs to be customized for each language/alphabet/writing system to keep it from causing problems.

This is in our work queue now, so I will get to it in the normal course of things (it'll be a little while, though). I may do some sort of mini hack for Nias/nia and open a new ticket for the larger task of addressing this consistently everywhere. I'm currently leaning toward a new focused apostrophe filter (apostrophe_norm, maybe) that converts curly quotes (‘ & ’) and maybe backticks/backquotes/grave accents (`) to straight apostrophes (')—but to do so widely by defaultwill require more careful testing.

Clearly Nias Wikipedia needs this fix, though, since they have at least one pair of titles that differ only by the curliness of the quotes: Hili'adulo and Hili’adulo. You don't need to be able to understand the language to see that those are very similar and probably should be the same article/disambiguation page. I'll probably set up apostrophe_norm just for nia and open a ticket to enable it everywhere by default (and maybe look into aggressive_splitting for English at the same time.. enwiki currently would index both Hili'adulo and Hili’adulo as two tokens: hili and adulo.

In the meantime, a couple of questions and comments:

  • @Sannita, you mentioned diacritics in the title of the ticket, but I don't see any discussed in the ticket. Are there any other specific diacritics that are causing trouble? If so, I can probably set up a nia-specific version of ICU folding to handle them.
  • @Aklapper the results you see from the completion suggester with a backtick aren't really processing the backtick as anything other than a punctuation character that it can ignore or replace with another character to get a good match. You get the same results if you use %, (, !, or @. (Other punctuation gets treated a bit differently. Using - or [ gets slightly different results. The point is that the backtick isn't being treated in any special way—good or bad—because it is similar to an apostrophe.)

@TJones This is a question for @Slaia, I only wrote the ticket in their stance to help them.

Hallo @TJones The issue with diacritic seems gone. I can't replicate the issue I had a few weeks ago. In the past the word u'o'ö-o'ö'ö (I'm following you) would not show the result where u'o'ö (I follow) is also a part of the word or sentence. Thanks.

TJones renamed this task from Diacritics and apostrophes do not work well in search on nia.wikipedia to Apostrophes do not work well in search on nia.wikipedia.Aug 11 2022, 2:49 AM

(Updated title to reflect focus on apostrophes after discussion here.)

Change 822647 had a related patch set uploaded (by EJoseph; author: EJoseph):

[mediawiki/extensions/CirrusSearch@master] Add apostrophe filter to the character filter nia wiki

https://gerrit.wikimedia.org/r/822647

TJones added subscribers: EJoseph, MPhamWMF, EBernhardson.

@EJoseph and I looked at this over the last couple of days. In addition to the curly quotes, we found some additional characters being used as apostrophes. The config Emmanuel submitted a patch for converts all of the following letters to apostrophes:

  • U+0060 ` grave accent/backtick (e.g., Ya`aqob)
  • U+02BC ʼ modifier letter apostrophe (e.g., ibeʼe)
  • U+02BE ʾ modifier letter right half ring (e.g., Maʾrib—from enwiki)
  • U+02BF ʿ modifier letter left half ring (e.g., el-ʿArabeyya‎)
  • U+2018 left single quotation mark (e.g., molo‘ö)
  • U+2019 right single quotation mark (e.g., Hili’alaŵa)

All examples except the right half ring come from our samples (from Nias Wikipedia and Wiktionary); there are lots of examples of right half ring from other wikis, though, and the plan is to extend this work more broadly. After talking a bit with @MPhamWMF I learned that the left half ring and right half ring can be meaningfully distinguished in the transliteration of Semitic languages:

However, both half rings (and other characters) regularly have an apostrophe substituted, and I found all seven variations on English Wikipedia (using insource:/.../ to get exact matches) of Qur'an (4933), Qurʿan (7), Qurʾan (69), Qurʼan (140), Qur`an (15), Qur’an (711), and Qur‘an (4); and mostly in English Wikipedia, but also German and Italian Wikipedia and English Wikisource, I found all seven variations of Ma'rib, Maʿrib, Maʾrib, Maʼrib, Ma`rib, Ma’rib, Ma‘rib (no counts since they are from different sources). Those where the first two examples I looked for. So while all seven variants are not equally common, they definitely can all occur.

Once the patch is committed and deployed, we need to reindex Nias-language wikis to enable the changes. As I mentioned earlier, we can look into a more general solution—which will require a lot more reindexing!—later.

Since @EBernhardson will be back next week, I've only given the patch a +1 and tagged him, since another reviewer would be a good thing.

As a side note, there were also a fair number of instances of characters with diacritics using Cyrillic homoglyphs, particularly ö/ӧ, as in ӧnӧ (bold letters are Cyrillic). It's nice to see that the homoglyph plugin is doing good things!

Change 822647 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add apostrophe filter to the character filter nia wiki

https://gerrit.wikimedia.org/r/822647

FYI: The reindexing to enable the new apostrophe handling is complete. I checked and searching for Hili'adulo, Hili’adulo, or Hili‘adulo all return the same 66 results.