Page MenuHomePhabricator

Lack of case sensitivity with hastemplate:
Closed, ResolvedPublic

Description

Why do I get exactly the same count (≈983) for both

  1. Template:Arxiv
  2. Template:ArXiv

The first one is a redirect to the second one. The first one has less than 200 transclusions, the second one does contain all transclusions of the first one, but 750 more.

Event Timeline

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMay 12 2019, 7:47 PM
Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald Transcript

As far as cirrussearch is concerned a template and it's redirects are the same page. Will have to find some time next week to look closer into this particular case, but in general there is no useful distinction in CirrusSearch between redirects and the pages they redirect to.

Well, the use case is that I want to edit those pages which transclude the redirect, in order to modify template name and old parameters, rather than pages transcluding the generic template name where new parameters are already used.

That works fine if generic template name and redirect name differ significantly.

Change 512198 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/core@master] Templates in search should be case sensitive

https://gerrit.wikimedia.org/r/512198

I took a closer look at this and indeed, the mapping is performing lowercasing for all queries to the template field (and we only have a single analysis chain applied). We can probably simply change the analysis chain there, will require some quick review that template boosting is all using appropriately cased template names across wikis that have configured it.

I'm not sure if there might be more knock-on effects though...

EBernhardson triaged this task as Medium priority.May 23 2019, 5:29 PM
EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.
EBernhardson edited projects, added MediaWiki-Search; removed CirrusSearch.

Change 512198 abandoned by EBernhardson:
Templates in search should be case sensitive

Reason:
discussed with Stas, we think a better way forward is to have the field indexed both ways (case sensitive and insensitive). Unfortunately we are having some disk space problems and adding new fields will have to wait for Q1 to replace aging servers.

https://gerrit.wikimedia.org/r/512198

Thank you for now.

BTW, linksto: and incategory: are both using page names as well.

I do not expect categories distinguished by letter case only, but for linksto: there might be a difference between BIOS and Bios articles, even more at Wiktionary with significant first letter.

debt added a subscriber: debt.Jul 23 2019, 5:35 PM

We're still waiting on new disks for more space...

Old servers have been replaced with new servers now, should be able to unblock this.

High level plan:

  1. Adjust cirrussearch mapping generation to add a case-sensitive multi-field to the template property
  2. Run an in-place reindex across all wikis and clusters
  3. Adjust hastemplate keyword to utilize new case-sensitive multi-field
Adjust cirrussearch mapping generation to add a case-sensitive multi-field to the template property

references

Today the template field is defined as:

$fields['template'] = $engine->makeSearchFieldMapping(
    'template',
    SearchIndexField::INDEX_TYPE_KEYWORD
);
$fields['template']->setFlag( SearchIndexField::FLAG_CASEFOLD );

FLAG_CASEFOLD is used to tell the search engine that it should ignore case for this field. It seems like what we actually want to tell the search engine is that casefolding is convenient for default searches, but to identify a specific template requires case-sensitive matching. Whatever name is chosen to indicate this, KeywordIndexField::getMapping will need to be adjusted to recognize the flag and generate an appropriate multi-field.

Run an in-place reindex across all wikis and clusters

https://wikitech.wikimedia.org/wiki/Search#In_place_reindex

Adjust hastemplate keyword to utilize new case-sensitive multi-field

Adjust CirrusSearch\Query\HasTemplateFeature::parseValue to recognize whatever syntax is agreed on to trigger case-sensitive matching, returning a 'case-sensitive' property along with the current templates. Use this value in HasTemplateFeature::doGetFilterQuery to decide the appropriate field to filter on.

Just a reminder:

  • The Flag needs three states:
    • IGNORE
    • SENSITIVE_ALL
    • SENSITIVE_2ND or IGNORE1_SENSITIVE
  • For hastemplate: and incategory: IGNORE1_SENSITIVE is appropriate.
  • On a Wiktionary, linksto: is SENSITIVE_ALL for main namespace, but any other linksto: is IGNORE1_SENSITIVE.
  • Common text search is IGNORE.
  • When accessing the database tables, it is no problem at all, capitalizing first character of the title part (not namespace) will deliver IGNORE1_SENSITIVE entries.
  • There are config variables indicating which page names might need SENSITIVE_ALL.
dcausse added a subscriber: dcausse.EditedJan 16 2020, 9:43 AM

I suppose that the last remark refers to the $wgCapitalLinks and
$wgCapitalLinkOverrides configuration variables.
When querying cirrus properly honors these parameters in a way that searching for hastemplate:foo will actually search for Template:Foo on english wikipedia but Template:foo on english wiktionary.

For the indexed value a single flag is needed I think because the wiki configuration will be taken into account by CirrusSearch when searching.

Well, on a wiktionary only pages in main namespace are SENSITIVE_ALL, but templates and categories do behave like every other wiki.

On any non-wiktionary main space pages and all others are IGNORE1_SENSITIVE, afaik.

Change 565370 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[mediawiki/extensions/CirrusSearch@master] Allow template keyword to be case sensitive

https://gerrit.wikimedia.org/r/565370

Change 566389 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[mediawiki/extensions/CirrusSearch@master] Allow search for case sensitive template keyword

https://gerrit.wikimedia.org/r/566389

Change 565370 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add case sensitive subfield for template keyword

https://gerrit.wikimedia.org/r/565370

The reindex for this is in progress, will be another week or more before it's complete.

Mstyles claimed this task.Mar 10 2020, 5:27 PM

Change 566389 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Allow search for case sensitive template keyword

https://gerrit.wikimedia.org/r/566389

The patch will go out with the train in the last week of april (no train running next week). The reindex that allows this to work has mostly completed, a few wikis have to be re-run but will hopefully be finished at or soon after this train rolls forward.

Deployed and the example links work. Marking this as done

Mstyles closed this task as Resolved.May 5 2020, 4:58 PM