Maniphest T206613

Search of wikidata string property values using haswbstatement is case sensitive
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Mvolz
	Oct 10 2018, 10:40 AM

Description

i.e. https://www.wikidata.org/w/api.php?action=query&list=search&srsearch=haswbstatement:P356=10.1371/JOURNAL.PCBI.1002947 works but

https://www.wikidata.org/w/api.php?action=query&list=search&srsearch=haswbstatement:P356=10.1371/journal.pcbi.1002947 does not.

Maybe this is as desired?

Details

	Subject	Repo	Branch	Lines +/-
	Add case-insenstitive subfield for statement field	mediawiki/extensions/WikibaseCirrusSearch	master	+4 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T46529 Wikidata search problems (tracking)
		Resolved		debt	T206613 Search of wikidata string property values using haswbstatement is case sensitive

Event Timeline

Mvolz triaged this task as Medium priority.Oct 10 2018, 10:40 AM

Mvolz created this task.

Restricted Application edited projects, added Discovery-Search; removed Discovery-Search (Current work). · View Herald TranscriptOct 10 2018, 10:40 AM

Mvolz edited parent tasks, added: T46529: Wikidata search problems (tracking); removed: T163642: Index Wikidata strings in statements for fulltext search.Oct 10 2018, 10:40 AM

• EBjune moved this task from needs triage to Up Next on the Discovery-Search board.Oct 11 2018, 5:04 PM

@Smalyshev what do you think? I haven't run into this myself. My feeling is that case insensitive is probably better, but would that require a lot of work?

I don't think removing case sensitivity would be a lot of manual work, but it will require a reindex to change the index. I'm not sure why we decided on it being case-sensitive, I'll try to figure it out and if there's no reason we can change it. Note that this will apply for all fields, so if there are properties where case does matter it may get things wrong.

In T206613#4664719, @Smalyshev wrote:

I don't think removing case sensitivity would be a lot of manual work, but it will require a reindex to change the index. I'm not sure why we decided on it being case-sensitive, I'll try to figure it out and if there's no reason we can change it. Note that this will apply for all fields, so if there are properties where case does matter it may get things wrong.

As long as it's possible to get the original case from the API then you can remove false positives in the case sensitive case by doing another call for each result and then comparing equality. Whereas if there are no results than it's very hard to get a result as you have to try every case permutation - all lower, all upper, camel case, sentence case, title case or completely random :). That said, string values are now available from the general search which does work, so maybe there isn't a need? i.e. https://www.wikidata.org/w/api.php?action=query&list=search&srsearch=10.1371/journal.PCBI.1002947 works.

Smalyshev moved this task from Up Next to Current work on the Discovery-Search board.Nov 13 2018, 6:30 PM

Smalyshev edited projects, added Discovery-Search (Current work); removed Discovery-Search.

EBernhardson moved this task from Incoming to Waiting on the Discovery-Search (Current work) board.Jan 15 2019, 11:22 PM

Smalyshev moved this task from Waiting to Incoming on the Discovery-Search (Current work) board.Jan 29 2019, 10:43 PM

Smalyshev edited projects, added Discovery-Search; removed Discovery-Search (Current work).Jan 30 2019, 8:47 PM

Smalyshev moved this task from needs triage to Wikibase Search on the Discovery-Search board.

What is the status of this?

It is *kind* of a blocker for T196353. Which is to say, using haswbstatement is the most expedient way to accomplish it, which avoids having to adding support for sparql queries, at least temporarily! If this is relatively simply to achieve this might be the easiest solution?

But as noted the above work around of just doing a regular search and then checking the properties might work as a workaround.
cc: @WMDE-leszek

Smalyshev moved this task from Backlog to Doing on the User-Smalyshev board.Jun 3 2019, 8:36 PM

Switching the statement data to lowercase_keyword would be easy, but that'd break all searches till full reindexing. Alternatively, we could add another field that would index as lowercase_keyword and once it is deployed and indexed, switch haswbstatement to use that field, then possibly switch the main field and switch haswbstatement back (or just leave it as is?). That would be much slower, but would not break search in the meantime.

@dcausse @EBernhardson what do you think?

Change 514198 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/WikibaseCirrusSearch@master] Add case-insenstitive subfield for statement field

https://gerrit.wikimedia.org/r/514198

gerritbot added a project: Patch-For-Review.Jun 4 2019, 12:04 AM

Diffusion mentioned this in rEWCSa911da151614: Add case-insenstitive subfield for statement field.Jun 4 2019, 12:05 AM

@Smalyshev switching the main field for statements to lowercase_keyword won't break anything, it's like a new field it'll be taken into account just after the next reindex. I would advise against a new field here, the cardinality would nearly double.

we should also note we index this data in the main filter field which means that for searches that are unlikely to be ambiguous (IDs and such) one could simply search for 10.1371/journal.pcbi.1002947. Benefit is that it's tolerant to small variation in punctuation but also accept partial searches like:
journal.pcbi.1002947 or even with small variations: journal pcbi 1002947.

So instead of giving up with no results this kind of searches could be tried if a human is behind to select/accept/validate a result.

Smalyshev moved this task from Wikibase Search to Current work on the Discovery-Search board.Jun 4 2019, 6:56 AM

Smalyshev edited projects, added Discovery-Search (Current work); removed Discovery-Search.

@dcausse agree but I think it still makes sense to make it case insensitive, since most data there are either case-insensitive or at least case there is rarely used to distinguish between things (i.e. having pcbi.100123 in one item and PCBI.100123 in another is not likely). So insensitive makes more sense to me, I guess.

@Smalyshev I totally agree, I was suggesting a UX where a first attempt search would try to match using the haswbstatement keyword (switched to case insensitive) and then a second try could be made using the fulltext mode if the first attempt is unsuccessful.

Diffusion mentioned this in rEWCS8e1a9238c49d: Add case-insenstitive subfield for statement field.Jun 4 2019, 4:07 PM

Diffusion mentioned this in rEWCSbaae3a57faf6: Add case-insenstitive subfield for statement field.Jun 4 2019, 6:35 PM

Smalyshev moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.Jun 4 2019, 8:03 PM

Change 514198 merged by jenkins-bot:
[mediawiki/extensions/WikibaseCirrusSearch@master] Add case-insenstitive subfield for statement field

https://gerrit.wikimedia.org/r/514198

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.10; 2019-06-18).Jun 5 2019, 12:00 AM

Now after it's deployed reindex will be needed.

Smalyshev moved this task from Doing to Waiting/Blocked on the User-Smalyshev board.Jun 5 2019, 4:43 AM

Also I should point out that using an indexed search is *much* better performance wise!

In T206613#5235476, @Smalyshev wrote:

Now after it's deployed reindex will be needed.

Yay! Thank you!

In T206613#5235476, @Smalyshev wrote:

Now after it's deployed reindex will be needed.

Do you know when this will be?

Since wmf.10 hasn't been deployed yet (T220735) and probably won't be this week - sometime after that happens. I'll update the ticket then. It probably will take several days (after the train has been resumed), so realistically count on starting using it somewhere in July. I understand it's taking long, but combination of no deployments and then failed deployments is an unfortunate circumstance we have to adjust for.

No worries; I wasn't aware the train was broken and wondered if the
reindexing had to be triggered manually or something, thanks for the update!

Mvolz mentioned this in T199197: [2.11] Integrate Citoid in Wikidata.Jun 27 2019, 9:13 AM

We're going to try to deploy this first on test.wikidata, can this be reindexed as well? Or do they both get re-indexed at the same time?

I've reindexed testwikidata last week, the patch should already be there.

In T206613#5311274, @Smalyshev wrote:

I've reindexed testwikidata last week, the patch should already be there.

Oh great, looks like it works!

https://test.wikidata.org/w/api.php?action=query&list=search&srsearch=haswbstatement:P168=10.1371/journal.pcbi.1002947
https://test.wikidata.org/w/api.php?action=query&list=search&srsearch=haswbstatement:P168=10.1371/JOURNAL.pcbi.1002947

Reindexing on eqiad currently blocked by T227136: Reindexing search index wikidatawiki for eqiad fails.

dcausse moved this task from Waiting to Needs Reporting on the Discovery-Search (Current work) board.Jul 18 2019, 1:48 PM

Smalyshev moved this task from Waiting/Blocked to Done on the User-Smalyshev board.Jul 18 2019, 5:43 PM

Mvolz awarded a token.Jul 19 2019, 10:54 AM

debt closed this task as Resolved.Jul 19 2019, 3:44 PM

debt claimed this task.

Mvolz mentioned this in T232565: case-sensitive equivalent of haswbstatement.Nov 4 2019, 2:52 PM

Search of wikidata string property values using haswbstatement is case sensitiveClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Search of wikidata string property values using haswbstatement is case sensitive
Closed, ResolvedPublic
Actions

Related Objects
Search...