Page MenuHomePhabricator

New term store doesn't provide case insensitive search (for when elastic is not used)
Open, LowPublic

Description

The old term store (the beautiful wb_terms table) is used to empower search when elastic is not used (some third parties, etc.). In order to support that, the table has two extra columns:

MariaDB [wikidatawiki_p]> select * from wb_terms where term_language = 'en' and term_search_key != '' and term_text = 'Berlin' and term_type = 'label' and term_full_entity_id = 'Q64' limit 5;
+-------------+----------------+---------------------+------------------+---------------+-----------+-----------+-----------------+-------------+
| term_row_id | term_entity_id | term_full_entity_id | term_entity_type | term_language | term_type | term_text | term_search_key | term_weight |
+-------------+----------------+---------------------+------------------+---------------+-----------+-----------+-----------------+-------------+
|  1977063762 |             64 | Q64                 | item             | en            | label     | Berlin    | berlin          |        0.29 |
+-------------+----------------+---------------------+------------------+---------------+-----------+-----------+-----------------+-------------+
1 row in set (0.10 sec)

Note the difference between term_text and term_search_key plus a random-looking number called term_weight

While implementing read usecases, I realized the old term store (if you set the "use this table for search" flag to true) turn the requested search key to lower case and normalizes them (like removing trailing space, etc.) and then does an exact match query but we don't have such functionality built in the new term store meaning Wikibase installations without elastic would not have proper search, for example if you search for "berlin" instead of "Berlin", you would not get a result. Another problem also arises for lack of term_weight column in the new term store meaning if you search for "Berlin" and if there are multiple items with English label of "Berlin", you would get the result on random order by in the old term store, you would get something like this:

MariaDB [wikidatawiki_p]> select * from wb_terms where term_language = 'en' and term_search_key != '' and term_text = 'Berlin' and term_type = 'label' order by term_weight desc;
+-------------+----------------+---------------------+------------------+---------------+-----------+-----------+-----------------+-------------+
| term_row_id | term_entity_id | term_full_entity_id | term_entity_type | term_language | term_type | term_text | term_search_key | term_weight |
+-------------+----------------+---------------------+------------------+---------------+-----------+-----------+-----------------+-------------+
|  1977063762 |             64 | Q64                 | item             | en            | label     | Berlin    | berlin          |        0.29 |
|  1092206185 |       31910788 | Q31910788           | item             | en            | label     | Berlin    | berlin          |       0.002 |
|  1204679787 |       19141260 | Q19141260           | item             | en            | label     | Berlin    | berlin          |       0.002 |
|  1204683603 |       19144153 | Q19144153           | item             | en            | label     | Berlin    | berlin          |       0.002 |
|  1204686119 |       19141273 | Q19141273           | item             | en            | label     | Berlin    | berlin          |       0.002 |
|  1704045728 |       42187298 | Q42187298           | item             | en            | label     | Berlin    | berlin          |       0.001 |
+-------------+----------------+---------------------+------------------+---------------+-----------+-----------+-----------------+-------------+
6 rows in set (0.01 sec)

(Q64 first)

The weight is calculated using this code:

	private function getWeight( EntityDocument $entity ) {
		$weight = 0.0;

		if ( $entity instanceof LabelsProvider ) {
			$weight = max( $weight, $entity->getLabels()->count() / 1000.0 );
		}

		if ( $entity instanceof Item ) {
			$weight = max( $weight, $entity->getSiteLinkList()->count() / 1000.0 );
		}

		return $weight;
	}

Possible solutions:

  • Leave it as it is, the search would not work properly in third party installations but given that we recommend docker for that and docker has elastic, this should not be super big.
  • Add another column for the search key in wbt_text_in_lang and reuse wbt_text table for the search key values. Lots of work, it doesn't solve the term_weight problem, the clean up logic will be fun.
  • Add another table, practically wb_terms_for_search_:( but keep it for third parties only.
  • Keep wb_terms forever but just for third parties. Maintaining it would be fun.

Event Timeline

So, I remember the decision being made about repos without elastic not having term weight, or any fancy sorting, and that was agreed on by product (at some point, no idea where that is documented).
As for the case sensitivity, I believe that is something that has been overlooked / not discussed before.

Lucas_Werkmeister_WMDE renamed this task from New term store doesn't properly work with search when elasitc is not used to New term store doesn't properly work with search when elastic is not used.Jan 13 2020, 6:43 PM
Addshore renamed this task from New term store doesn't properly work with search when elastic is not used to New term store doesn't provide case insensitive search (for when elastic is not used).Jan 14 2020, 9:03 AM

This probably doesn't need to happen urgently.
This will only actually hit people once we get to doing T242723: Use new terms store by default on Wikibase installs

This probably doesn't need to happen urgently.
This will only actually hit people once we get to doing T242723: Use new terms store by default on Wikibase installs

Unless the paths that still use this little bit of search on wikdiata.org need the case insensitivity...

This probably doesn't need to happen urgently.
This will only actually hit people once we get to doing T242723: Use new terms store by default on Wikibase installs

Unless the paths that still use this little bit of search on wikdiata.org need the case insensitivity...

That's not the case, the flag to use the term_search_key is set to false for Wikidata so all searches are case sensitive, the value of the field in lots of cases is set to empty string, it would have exploded majestically otherwise.

WMDE-leszek added a subscriber: Samantha_Alipio_WMDE.

Having briefly discussed the topic with @Samantha_Alipio_WMDE and @Lydia_Pintscher: For the time being let's continue with the limited capabilities of the non-ElasticSearch-powered search, i.e. only case sensitive exact-matching search will be provided in the environment without ElasticSearch configured.
Product Management will revisit the topic when circumstances require.

Addshore moved this task from Incoming to Needs Work on the Wikidata-Campsite board.

As this is currently on the campsite for now I'll add it to the needs product column to determine what if anything we want to do and when.
Regarding a timeline for Wikibase and support of insensitive search, as it stands when 1.35 is released it will still support insensitive search by default out of the box.