Page MenuHomePhabricator

[Regression] CirrusSearch: Excerpts should not show normalised version (all lowercase and no punctuation)
Closed, ResolvedPublic

Description

I believe this change happened fairly recently. When searching for a phrase, the excerpts shown on Special:Search now seem to be exposing the normalised versions (e.g. stripped of all parenthesis, character casing and other variants, presumably accents as well).

Though it doesn't happen consistently. Presumably this is more than a simple character stripping/replacement, but something more finetuned for language. So maybe the recent regression was not it being turned on for excerpts, but the normalisation itself being changed.

User facing issue:

Search for "WisReden" on nl.wikipedia.org.

Results:

  1. Gebruiker:Chaemera/monobook.js

    'gebruiker erwin blockmsg.js ; importscript 'wikipedia wisreden' ;

    130 B (11 woorden) - 21 nov 2007 16:26
  1. Gebruiker:Emmelie/monobook.js

    'gebruiker warddr qpreview.js ; importscript 'wikipedia wisreden' ; document.write ' ' ; version 1.beta.4 zeus_head_thumb-zanaq

    4 kB (563 woorden) - 25 mrt 2008 17:18
  1. Gebruiker:Oliphaunt/monobook.js

    importscript 'wikipedia wisreden' ; importscript 'en wikipedia wikiproject user scripts scripts add

    2 kB (190 woorden) - 23 jul 2008 10:19

Actual page content:

  • Gebruiker:Chaemera/monobook.js:
// [[Gebruiker:Erwin/blockmsg.js]]
importScript('Gebruiker:Erwin/blockmsg.js');
importScript('Wikipedia:WisReden');
  • excerpt:

    'gebruiker erwin blockmsg.js ; importscript 'wikipedia wisreden' ;

Rather weird that it:

  • Converted everything to lower case.
  • Added a space before the semi-colon.
  • Turned the quoted text into one quotation instead of two, but preserved the semi-colon.

As for inconsistency, here is a search for "addOnloadhook" on nl.wikipedia.org:

  1. Gebruiker:Aleichem/monobook.js

    addonloadhook stats ; document.write ' ' ;
  1. Wikipedia:WisReden

    location.href.indexOf("action=delete")!=-1) addOnloadHook(WisReden); //

Result #1 has a normalised excerpt, result #3 has original case preserved.


Version: unspecified
Severity: major

Details

Reference
bz65803

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:25 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz65803.

Chad, can you look at this? I think you have more experience with the normalization stuff then I. It looks like we're normalizing the on the way in in Cirrus somewhere. I think we should let Elasticsearch do the normalization, for the most part.

This example from the bottom of the description is good:
https://nl.wikipedia.org/w/index.php?title=Speciaal%3AZoeken&profile=all&search=addOnloadhook&fulltext=Search

Chad told me earlier that he believed this was something we'd fixed and that reindexing the scripts should have fixed it. I reindexed the whole wiki and poked around and couldn't find any more scripts that looked bad. I'm marking this verified. If you see something still broken reopen and we'll dig into it.