Page MenuHomePhabricator

Find/Replace (in-editor text search) should be diacritic-insensitive
Closed, ResolvedPublic8 Estimated Story Points

Description

Steps to understand:

  1. Open a page with a lot of text in a diacritic-friendly language (e.g., Spanish, French, German, etc.) in VisualEditor (either mode).
  2. Search for the text you're trying to find (Command-F) using VisualEditor's built-in Find/Replace feature.
  3. Miss the typo because you searched for the string ole and someone had typed olé.

If you search for ole; then Find/Replace ought to find all simple variants on these letters, including, e.g., ółę. This is standard behavior for web browsers and search engines and so is likely expected as well as being helpful.

Event Timeline

I believe JS's localeCompare has options for this right ? Maybe use that unless exact case matching is enabled.

Jdforrester-WMF moved this task from To Triage to TR1: Releases on the VisualEditor board.
Jdforrester-WMF subscribed.

This will make some work that I do difficult, like looking for "mis-spellings" of works that fail to put in the diacritics. Maybe it should be an option?

Jdforrester-WMF set the point value for this task to 8.

Make it only happen when regular-expression mode is enabled, maybe? Failing that, think up an icon that symbolizes "diacritic-insensitive", and we could stick it into the options section, there's certainly room.

pasted_file (86×418 px, 8 KB)

Note that while this works natively in Chrome, it doesn't in Firefox, despite a bug being opened in 2003: https://bugzilla.mozilla.org/show_bug.cgi?id=202251

Also with all things Unicode this is a can of worms. What works well for the English language and Latin scripts might not be simple elsewhere. Also as others have pointed out, any such accent-insensitivity would need to be a separate option.

'e'.localeCompare('é', undefined, {sensitivity:'base'}) would appear to be helpful here, but the options param is not supported in IE<11

We could just disable the feature in IE9/10 then?

Change 339931 had a related patch set uploaded (by Esanders):
Icon for diacritic insensitive search

https://gerrit.wikimedia.org/r/339931

Change 339932 had a related patch set uploaded (by Esanders):
Diacritic insensitive search in find & replace dialog

https://gerrit.wikimedia.org/r/339932

Change 339933 had a related patch set uploaded (by Esanders):
[PULL THROUGH] Diacritic insensitive search

https://gerrit.wikimedia.org/r/339933

Do we have data to on the actual usage and need for an ability to search with and without diacritics as a search option?
It seems to be a very minor use-case to justify adding another icon option, esp. an icon that is fairly obscure and hard to determine.

Going through the steps in the scenario, if a user wants to search for "olé" and type in "ole" without the accented é and get "0 results", then it is quite simple for them to amend the search term to include the diacritic since it is something they wish to find.

It wouldn't be so simple to enter an exact term containing accents unsupported by the user's keyboard. For example, an English Wikipedia editor may want to search for Paul Erdős or Owain Glyndŵr without even knowing (or caring) what accents their names contain, let alone knowing how to type them.

I agree usage stats would be very interesting. Certainly every language's Wikipedia contains many names with accents not normally typed in that language.

It wouldn't be so simple to enter an exact term containing accents unsupported by the user's keyboard. For example, an English Wikipedia editor may want to search for Paul Erdős or Owain Glyndŵr without even knowing (or caring) what accents their names contain, let alone knowing how to type them.

The above scenario is possible, but in practice is it actually likely? If an editor wants to search for a term whilst editing, they most likely would (and should) know the correct spelling of the term they would like to search, since the main use-case intent would be to search for a term to replace/update.

Just reiterating that adding a new icon to the UI is unnecessarily adding complexity to mode that already has 3 different search options. Looking at usage stats of these existing options would be helpful as well.

It wouldn't be so simple to enter an exact term containing accents unsupported by the user's keyboard. For example, an English Wikipedia editor may want to search for Paul Erdős or Owain Glyndŵr without even knowing (or caring) what accents their names contain, let alone knowing how to type them.

The above scenario is possible, but in practice is it actually likely? If an editor wants to search for a term whilst editing, they most likely would (and should) know the correct spelling of the term they would like to search, since the main use-case intent would be to search for a term to replace/update.

I think it is very likely. Many English articles on foreign people contain accents that most English users either don't know about or don't know how to type.

As the product owner, I've already decided to add this feature.

Though I appreciate (and share!) the curiosity, for such a minor feature ("8 points" ~= 2–3 engineer days) I don't think the high expense of coming up with reliable figures given that there are comparable features in similar products (if mostly aimed at the more 'techy' user base).

@RHo Here's a resource I find interesting: https://www.translatemedia.com/us/blog-us/the-death-of-the-accent-its-impact-on-search/

FWIW, as a french speaker, I tend to omit the accents when searching online, although I am very picky about grammar in my writings. I do expect search to be diacritic-insensitive, and by default (I don't even see why we'd make it an option).

(I don't even see why we'd make it an option).

You may have two words that differ only by accent but that have different meanings. If you wanted to use replace-all this would be difficult if you couldn't distinguish between them.

(I don't even see why we'd make it an option).

You may have two words that differ only by accent but that have different meanings. If you wanted to use replace-all this would be difficult if you couldn't distinguish between them.

Good point.

Well, to me it seems simpler and more intuitive to receive a post-search warning like "Beware the different spellings found in results.", and treat results or refine my query accordingly, rather than using a "diacritic-sensitive search" pre-filter. Just my 2 cents.

@JGirault - my point exactly is the context of where the option is being added is when the user is editing, where the search is invoked for the purposes is to fix some issue with characters containing diacritics. The argument that people don't know how to type diacritics in the term would not be likely this scenario, since they are actively searching for a term to amend whilst editing.

This point is moot if the option is going to be added, but
my comments were coming from a perspective of warning against adding unnecessary complexity to the UI (in the form of yet another option) for a low utility feature.

@JGirault - my point exactly is the context of where the option is being added is when the user is editing, where the search is invoked for the purposes is to fix some issue with characters containing diacritics. The argument that people don't know how to type diacritics in the term would not be likely this scenario, since they are actively searching for a term to amend whilst editing.

I think in-document search can be invoked in more than this one scenario (find a specific word and replace its occurrences using the Find&Replace widget).

Personally I often use in-document search to navigate faster within the document, because I remember one keyword of such section, so I (Cmd/Ctrl)+F and type the keyword instead of scrolling down. I may or may not want to use the Replace feature. Many times I don't want.
I can also use it to research if a document contains one term, for faster parsing. When I search I'm used to dropping accents because I expect that non-accented search to be less-specific and potentially return more results (the words with right grammar + the words with wrong grammar).
Speaking for the French language, it contains a lot of words with accents and people are less and less inclined to type those. I have seen many people who entirely rely on auto-correction to put the accents on words, they never mind typing them.

@JGirault - I agree there is a difference between those who are not inclined to type accents versus English speakers who "don't know" about diacritics in foreign terms as an argument raised earlier for this search toggle to be introduced. Fwiw also agreed with the original intent of the ticket of diacritic insensitive search, just not the part about introducing another option.

Fwiw also agreed with the original intent of the ticket of diacritic insensitive search, just not the part about introducing another option.

This is exactly what we should be worried about. Everything is a use case. everything. everything cannot have a tool of it's own. identifying implicit and explicit is part of the design process.

@Jdforrester-WMF I don't think we are saying the request is invalid. The solution on the other hand is another thing to be discussed. every time you add something to your interface you are adding weight on people who don't use it. it's an overhead. it's another thing to think about; even it is not meant for you. [1]

Can we revisit the implementation details here? and get design from the designer on editing aka @Pginer-WMF

[1] http://lawsofsimplicity.com/los/law-1-reduce.html

Without commenting on interface design (which isn't really my fortệ), I want to point out that this is one of the world's Contentious Issues. See https://bugs.chromium.org/p/chromium/issues/detail?id=71741 for example. So while some may disagree with this implementation, I think we should recognise that no amount of process is likely to resolve that disagreement.

Note also that accent [in]sensitivity has vastly different effects depending on the language in which you're editing (think Vietnamese versus French and English, for example) - it's not purely a design issue. Just picking one behaviour to fit all our c.300 languages would have effects that might not be very obvious (and setting differing default behaviours for different languages would take more developer time).

Change 339931 merged by jenkins-bot:
icons: Add 'searchDiacritic' icon, in editing-advanced pack

https://gerrit.wikimedia.org/r/339931

Adding an explicit option for everyone is probably the most straightforward solution, but not necessarily the best. I think that requiring users to make an upfront decision has a cost and it is worth exploring if we can avoid or reduce this cost.

There are some principles that could help in these cases: using smart defaults, provide tools right when they are needed, and reducing options.

One possible solution could be to use a flexible search by default that finds results with diacritics and regardless of them being uppercase or lowercase. Then, asking whether a more strict search is needed but only when it is relevant:

  • If all matches are exact, no options is provided for the user. A user searching for "Erdős" does not need to decide whether to turn diacritic-sensitiveness on before doing the search if all matches in the page are written exactly as she did.
  • If there are some matches that are not exact, an option to "view only exact matches" is shown. If the user uses such option, the search is reduced to only the exact word. A user can search for "Erdos" and find "Erdős" but she can also select the "exact match" option to focus on the instances where the diacritic is missing in order to correct it.

This approach reduces options by combining case-sensitiveness with diacritic-sensitiveness using the concept of "exact match". It also avoids upfront decisions: a user typing "Erdős" is not required to understand what the diacritic option is about before searching. The effort is reserved to the less frequent cases, while not exposing an option to cases where such option is irrelevant.

This is just one possibility, I think we can explore more options before jumping into a particular solution.

@Pginer-WMF

To me it seems simpler and more intuitive to receive a post-search warning like "Beware the different spellings found in results.", and treat results or refine my query accordingly, rather than using a "diacritic-sensitive search" pre-filter.

Change 339932 merged by jenkins-bot:
Diacritic insensitive search in find & replace dialog

https://gerrit.wikimedia.org/r/339932

Change 339933 merged by jenkins-bot:
Update VE core submodule to master (8211ebc70)

https://gerrit.wikimedia.org/r/339933

Although the task has been closed, I'd still appreciate getting some feedback on the proposal on T154195#3060546 from VisualEditor team. I'm curious to hear about their concerns, interest or whether some aspects were considered in the implementation, in order to know what to do next with the proposal (capture in a separate ticket, add more details to it, discard it, etc.)

Thanks.

Although the task has been closed, I'd still appreciate getting some feedback on the proposal on T154195#3060546 from VisualEditor team. I'm curious to hear about their concerns, interest or whether some aspects were considered in the implementation, in order to know what to do next with the proposal (capture in a separate ticket, add more details to it, discard it, etc.)

Absolutely, sorry.

That's roughly what we had already implemented, specifically:

One possible solution could be to use a flexible search by default that finds results with diacritics and regardless of them being uppercase or lowercase

This is the new default search experience; the special search options (case sensitive, diacritic sensitive, whole word-only, and regex) are all disabled by default.

However, to dive into the complexity a little, "results with diacritics" is language-dependent even when you don't enable the "diacritic-sensitive search option". In languages that consider diacritics to be rare (according to your browser), like English, o and ő are the "same" character. In languages where the diacritics make letters which are considered different (again, according to your browser), like Hungarian, o and ő are "different". I believe this is decided in CLDR/Unicode and implement consistently between browsers, but at least it will mean that the browser-native page search and the search inside VE should be consistent.

  • If all matches are exact, no options is provided for the user. A user searching for "Erdős" does not need to decide whether to turn diacritic-sensitiveness on before doing the search if all matches in the page are written exactly as she did.

This would violate one of the core design principles for VE. Options do not magically reveal and hide themselves, but are disabled/enabled based on availability (e.g. you can't switch to visual mode on talk pages).

We could possibly hide the advanced search options behind a cog/collapse function if they prove to be really confusing and distracting for regular searchers, but given that the current main wikitext editor has the capitalisation and regex functionality (and taking up way more space), I don't think it's a priority.

Note that users do not "need to decide whether to turn diacritic-sensitiveness on before doing the search", much the same as they do not need to decide whether to run regexs or whatever.

  • If there are some matches that are not exact, an option to "view only exact matches" is shown. If the user uses such option, the search is reduced to only the exact word. A user can search for "Erdos" and find "Erdős" but she can also select the "exact match" option to focus on the instances where the diacritic is missing in order to correct it.

In general we don't hide functionality in second-use prompts. (Power) users complain an awful lot about this design pattern, and though it does indeed serve to reduce complexity up-front, I don't think it's a great compromise in terms of inhibiting other workflows.

Hope this is helpful. I'd certainly be open to different design considerations for these kinds of workflow in general (outwith this task).

J.

This is the new default search experience; the special search options (case sensitive, diacritic sensitive, whole word-only, and regex) are all disabled by default.

Actually the button is a "diacritic insensitive" button, so it the default is sensitive (unselected). I can see arguments for keeping it as is, and for changing it.