Page MenuHomePhabricator

Words not suggested with type-ahead because underscores are not tokenized: Use dashes/hyphens instead
Closed, ResolvedPublic

Description

This is a blocker for day 1.

If I type "preferences" in https://bugzillapreview.wmflabs.org/search/query/c3Suo_oyO3VI/#R , I don't get autocompleted to "User preferences" (same with "Unknown" for "General/Unknown"). If I search "User" (because I'm not a normal user and I remember the full name of the component), I get correctly autocompleted to MediaWiki-User_preferences.

So it seems the underscores are not used by the search engine's tokenizer.

Event Timeline

Nemo_bis raised the priority of this task from to Needs Triage.
Nemo_bis updated the task description. (Show Details)
Nemo_bis changed Security from none to None.
Nemo_bis updated the task description. (Show Details)
Nemo_bis added a subscriber: Nemo_bis.
Qgil triaged this task as High priority.Oct 25 2014, 2:53 PM
Qgil added a subscriber: Qgil.

Nemo, thanks for catching this problem. I agree it is important, especially in the context of creating new tasks and finding the right project for them.

The problem can be reproduced with https://phab-01.wmflabs.org/tag/mediawiki-user_preferences/ . This project is not retrieved when you type "pref" in project type-ahead fields, while "med" or "use" will.

I remember some discussion upstream about tokenization using dashes (and underscores?). Wasn't this fixed?

Qgil added a subscriber: chasemp.Oct 26 2014, 5:16 AM

@chasemp requested upstream the tokenization of hyphens and also making hastags case insensitive, which was also a feature related with project names. Both requests were fixed pretty fast.

I could not find a request for tokenizing underscores, and now I don't remember whether this was intentional or not. If not, we can request it.

Qgil renamed this task from "Preferences" doesn't match "User preferences" component in search and task creation to Words not suggested with type-ahead because underscores are not tokenized.Oct 26 2014, 6:01 AM
Qgil added a comment.Oct 28 2014, 9:36 AM

For what is worth, I have gone through https://bugzillapreview.wmflabs.org/project/ imagining that all underscores were hyphens, and I don't see any problem doing that. Search would work right now, visually it would look better, and I don't think we would have any semantic loss.

If any specific project does lose something with the change, we could always edit it manually. In most cases I think the change just works.

Nemo_bis added a comment.EditedOct 29 2014, 7:54 AM

Search would work right now, visually it would look better, and I don't think we would have any semantic loss.

The semantic you'd lose is that currently you're using hyphens to indicate hierarchy (macroproject-project) and underscores to separate words (this_is_a_project_name). I don't know what's better; both are rather bad, see T911.

In other contexts, as a naming convention we're using https://www.mediawiki.org/wiki/BEM

Qgil added a comment.Oct 29 2014, 8:37 AM

I considered the hierarchy aspect, and I think the semantics equally work without the distinction between hyphens and underscores. For instance:

  • Commons-App-Android
  • MediaWiki-File-management
  • MediaWiki-Language-Converter
  • MediaWiki-extensions-Babel
  • Parsoid-Token-Stream-Transforms

Whoever understands these labels with undersocres will undertand them equally with hyphens.

And yes, it is not a perfect solution, but it would be a better solution than the current situation with non-searchable words after underscores, realtively simple to implement.

Qgil assigned this task to Aklapper.Oct 29 2014, 9:22 PM
Qgil added a subscriber: Aklapper.

@Aklapper to come up with a proposal that respects the tokenization of every word in a project name.

Qgil added a comment.Oct 31 2014, 8:50 PM

We don't need to aim for perfection here. Something that works for the migration is good enough. I volunteer to make any manual changes to project names at a later stage, if needed.

: colons, _ underscores, / slashes, > brackets don't work with current tokenization rules so we are stuck with - hyphens/dashes it seems. So let's go for that. :-/

Qgil reassigned this task from Aklapper to chasemp.Oct 31 2014, 9:21 PM

Looks like that's already fixed in https://git.wikimedia.org/blob/phabricator%2Ftools.git/7e65baec365bbfeb88219fa78d95c817073cd100/wmfphablib%2Fbzlib.py#L11 by using hyphens/dashes for everything and replacing underscores and spaces by dashes, and slashes by -or- ? Or are my code reading skills improvable? :)

component_separator = '-'
product = re.sub('\s', '-', product)
product = product.replace('_', '-')
component = re.sub('\s', '-', component)
component = component.replace('_', '-')
component = component.replace('/', '-or-')
return  "%s%s%s" % (product,
                     component_separator,
                     component)
Aklapper renamed this task from Words not suggested with type-ahead because underscores are not tokenized to Words not suggested with type-ahead because underscores are not tokenized: Use dashes/hyphens instead.Nov 3 2014, 10:26 AM
Aklapper moved this task from Ready To Go to Done on the Bugzilla-Preview board.Nov 4 2014, 3:50 PM
chasemp closed this task as Resolved.Nov 5 2014, 4:47 PM

As ugly as it is...it's consistent and works in all cases. This willl all be honed over time I suspect.

For now..

https://bugzillapreview.wmflabs.org/project/view/116/