Page MenuHomePhabricator

Words not suggested with type-ahead because underscores are not tokenized: Use dashes/hyphens instead
Closed, ResolvedPublic

Description

This is a blocker for day 1.

If I type "preferences" in https://bugzillapreview.wmflabs.org/search/query/c3Suo_oyO3VI/#R , I don't get autocompleted to "User preferences" (same with "Unknown" for "General/Unknown"). If I search "User" (because I'm not a normal user and I remember the full name of the component), I get correctly autocompleted to MediaWiki-User_preferences.

So it seems the underscores are not used by the search engine's tokenizer.

Event Timeline

Nemo_bis raised the priority of this task from to Needs Triage.
Nemo_bis updated the task description. (Show Details)
Nemo_bis changed Security from none to None.
Nemo_bis updated the task description. (Show Details)
Nemo_bis subscribed.
Qgil triaged this task as High priority.Oct 25 2014, 2:53 PM
Qgil subscribed.

Nemo, thanks for catching this problem. I agree it is important, especially in the context of creating new tasks and finding the right project for them.

The problem can be reproduced with https://phab-01.wmflabs.org/tag/mediawiki-user_preferences/ . This project is not retrieved when you type "pref" in project type-ahead fields, while "med" or "use" will.

I remember some discussion upstream about tokenization using dashes (and underscores?). Wasn't this fixed?

@chasemp requested upstream the tokenization of hyphens and also making hastags case insensitive, which was also a feature related with project names. Both requests were fixed pretty fast.

I could not find a request for tokenizing underscores, and now I don't remember whether this was intentional or not. If not, we can request it.

Qgil renamed this task from "Preferences" doesn't match "User preferences" component in search and task creation to Words not suggested with type-ahead because underscores are not tokenized.Oct 26 2014, 6:01 AM

For what is worth, I have gone through https://bugzillapreview.wmflabs.org/project/ imagining that all underscores were hyphens, and I don't see any problem doing that. Search would work right now, visually it would look better, and I don't think we would have any semantic loss.

If any specific project does lose something with the change, we could always edit it manually. In most cases I think the change just works.

Search would work right now, visually it would look better, and I don't think we would have any semantic loss.

The semantic you'd lose is that currently you're using hyphens to indicate hierarchy (macroproject-project) and underscores to separate words (this_is_a_project_name). I don't know what's better; both are rather bad, see T911.

In other contexts, as a naming convention we're using https://www.mediawiki.org/wiki/BEM

I considered the hierarchy aspect, and I think the semantics equally work without the distinction between hyphens and underscores. For instance:

  • Commons-App-Android
  • MediaWiki-File-management
  • MediaWiki-Language-Converter
  • MediaWiki-extensions-Babel
  • Parsoid-Token-Stream-Transforms

Whoever understands these labels with undersocres will undertand them equally with hyphens.

And yes, it is not a perfect solution, but it would be a better solution than the current situation with non-searchable words after underscores, realtively simple to implement.

Qgil added a subscriber: Aklapper.

@Aklapper to come up with a proposal that respects the tokenization of every word in a project name.

We don't need to aim for perfection here. Something that works for the migration is good enough. I volunteer to make any manual changes to project names at a later stage, if needed.

: colons, _ underscores, / slashes, > brackets don't work with current tokenization rules so we are stuck with - hyphens/dashes it seems. So let's go for that. :-/

Looks like that's already fixed in https://git.wikimedia.org/blob/phabricator%2Ftools.git/7e65baec365bbfeb88219fa78d95c817073cd100/wmfphablib%2Fbzlib.py#L11 by using hyphens/dashes for everything and replacing underscores and spaces by dashes, and slashes by -or- ? Or are my code reading skills improvable? :)

component_separator = '-'
product = re.sub('\s', '-', product)
product = product.replace('_', '-')
component = re.sub('\s', '-', component)
component = component.replace('_', '-')
component = component.replace('/', '-or-')
return  "%s%s%s" % (product,
                     component_separator,
                     component)
Aklapper renamed this task from Words not suggested with type-ahead because underscores are not tokenized to Words not suggested with type-ahead because underscores are not tokenized: Use dashes/hyphens instead.Nov 3 2014, 10:26 AM

As ugly as it is...it's consistent and works in all cases. This willl all be honed over time I suspect.

For now..

https://bugzillapreview.wmflabs.org/project/view/116/

Screen_Shot_2014-11-05_at_10.46.30_AM.png (85×797 px, 15 KB)