|Open||None||T46529 Wikidata search problems (tracking)|
|Resolved||debt||T206613 Search of wikidata string property values using haswbstatement is case sensitive|
- Mentioned In
- T199197: [2.11] Integrate Citoid in Wikidata
T147505: [Recurring task] CirrusSearch: what is updated during re-indexing
rEWCSbaae3a57faf6: Add case-insenstitive subfield for statement field
rEWCS8e1a9238c49d: Add case-insenstitive subfield for statement field
rEWCSa911da151614: Add case-insenstitive subfield for statement field
- Mentioned Here
- T227136: Reindexing search index wikidatawiki for eqiad fails
T220735: 1.34.0-wmf.10 deployment blockers
T196353: Add citoid support for WikiBase to the Citoid extension
I don't think removing case sensitivity would be a lot of manual work, but it will require a reindex to change the index. I'm not sure why we decided on it being case-sensitive, I'll try to figure it out and if there's no reason we can change it. Note that this will apply for all fields, so if there are properties where case does matter it may get things wrong.
As long as it's possible to get the original case from the API then you can remove false positives in the case sensitive case by doing another call for each result and then comparing equality. Whereas if there are no results than it's very hard to get a result as you have to try every case permutation - all lower, all upper, camel case, sentence case, title case or completely random :). That said, string values are now available from the general search which does work, so maybe there isn't a need? i.e. https://www.wikidata.org/w/api.php?action=query&list=search&srsearch=10.1371/journal.PCBI.1002947 works.
What is the status of this?
It is *kind* of a blocker for T196353. Which is to say, using haswbstatement is the most expedient way to accomplish it, which avoids having to adding support for sparql queries, at least temporarily! If this is relatively simply to achieve this might be the easiest solution?
But as noted the above work around of just doing a regular search and then checking the properties might work as a workaround.
Switching the statement data to lowercase_keyword would be easy, but that'd break all searches till full reindexing. Alternatively, we could add another field that would index as lowercase_keyword and once it is deployed and indexed, switch haswbstatement to use that field, then possibly switch the main field and switch haswbstatement back (or just leave it as is?). That would be much slower, but would not break search in the meantime.
we should also note we index this data in the main filter field which means that for searches that are unlikely to be ambiguous (IDs and such) one could simply search for 10.1371/journal.pcbi.1002947. Benefit is that it's tolerant to small variation in punctuation but also accept partial searches like:
journal.pcbi.1002947 or even with small variations: journal pcbi 1002947.
So instead of giving up with no results this kind of searches could be tried if a human is behind to select/accept/validate a result.
@dcausse agree but I think it still makes sense to make it case insensitive, since most data there are either case-insensitive or at least case there is rarely used to distinguish between things (i.e. having pcbi.100123 in one item and PCBI.100123 in another is not likely). So insensitive makes more sense to me, I guess.
Since wmf.10 hasn't been deployed yet (T220735) and probably won't be this week - sometime after that happens. I'll update the ticket then. It probably will take several days (after the train has been resumed), so realistically count on starting using it somewhere in July. I understand it's taking long, but combination of no deployments and then failed deployments is an unfortunate circumstance we have to adjust for.