Page MenuHomePhabricator

Search of wikidata string property values using haswbstatement is case sensitive
Closed, ResolvedPublic

Event Timeline

Mvolz triaged this task as Medium priority.Oct 10 2018, 10:40 AM
Mvolz created this task.

@Smalyshev what do you think? I haven't run into this myself. My feeling is that case insensitive is probably better, but would that require a lot of work?

I don't think removing case sensitivity would be a lot of manual work, but it will require a reindex to change the index. I'm not sure why we decided on it being case-sensitive, I'll try to figure it out and if there's no reason we can change it. Note that this will apply for all fields, so if there are properties where case does matter it may get things wrong.

I don't think removing case sensitivity would be a lot of manual work, but it will require a reindex to change the index. I'm not sure why we decided on it being case-sensitive, I'll try to figure it out and if there's no reason we can change it. Note that this will apply for all fields, so if there are properties where case does matter it may get things wrong.

As long as it's possible to get the original case from the API then you can remove false positives in the case sensitive case by doing another call for each result and then comparing equality. Whereas if there are no results than it's very hard to get a result as you have to try every case permutation - all lower, all upper, camel case, sentence case, title case or completely random :). That said, string values are now available from the general search which does work, so maybe there isn't a need? i.e. https://www.wikidata.org/w/api.php?action=query&list=search&srsearch=10.1371/journal.PCBI.1002947 works.

What is the status of this?

It is *kind* of a blocker for T196353. Which is to say, using haswbstatement is the most expedient way to accomplish it, which avoids having to adding support for sparql queries, at least temporarily! If this is relatively simply to achieve this might be the easiest solution?

But as noted the above work around of just doing a regular search and then checking the properties might work as a workaround.
cc: @WMDE-leszek

Switching the statement data to lowercase_keyword would be easy, but that'd break all searches till full reindexing. Alternatively, we could add another field that would index as lowercase_keyword and once it is deployed and indexed, switch haswbstatement to use that field, then possibly switch the main field and switch haswbstatement back (or just leave it as is?). That would be much slower, but would not break search in the meantime.

@dcausse @EBernhardson what do you think?

Change 514198 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/WikibaseCirrusSearch@master] Add case-insenstitive subfield for statement field

https://gerrit.wikimedia.org/r/514198

@Smalyshev switching the main field for statements to lowercase_keyword won't break anything, it's like a new field it'll be taken into account just after the next reindex. I would advise against a new field here, the cardinality would nearly double.

we should also note we index this data in the main filter field which means that for searches that are unlikely to be ambiguous (IDs and such) one could simply search for 10.1371/journal.pcbi.1002947. Benefit is that it's tolerant to small variation in punctuation but also accept partial searches like:
journal.pcbi.1002947 or even with small variations: journal pcbi 1002947.

So instead of giving up with no results this kind of searches could be tried if a human is behind to select/accept/validate a result.

@dcausse agree but I think it still makes sense to make it case insensitive, since most data there are either case-insensitive or at least case there is rarely used to distinguish between things (i.e. having pcbi.100123 in one item and PCBI.100123 in another is not likely). So insensitive makes more sense to me, I guess.

@Smalyshev I totally agree, I was suggesting a UX where a first attempt search would try to match using the haswbstatement keyword (switched to case insensitive) and then a second try could be made using the fulltext mode if the first attempt is unsuccessful.

Change 514198 merged by jenkins-bot:
[mediawiki/extensions/WikibaseCirrusSearch@master] Add case-insenstitive subfield for statement field

https://gerrit.wikimedia.org/r/514198

Also I should point out that using an indexed search is *much* better performance wise!

Now after it's deployed reindex will be needed.

Yay! Thank you!

Now after it's deployed reindex will be needed.

Do you know when this will be?

Since wmf.10 hasn't been deployed yet (T220735) and probably won't be this week - sometime after that happens. I'll update the ticket then. It probably will take several days (after the train has been resumed), so realistically count on starting using it somewhere in July. I understand it's taking long, but combination of no deployments and then failed deployments is an unfortunate circumstance we have to adjust for.

No worries; I wasn't aware the train was broken and wondered if the
reindexing had to be triggered manually or something, thanks for the update!

We're going to try to deploy this first on test.wikidata, can this be reindexed as well? Or do they both get re-indexed at the same time?

I've reindexed testwikidata last week, the patch should already be there.

debt claimed this task.