Replace term_search_key and term_weight with empty values when wb_terms is not used for search
Closed, ResolvedPublic
Actions

Description

This part is pretty much useless when elastic is being used for search.

Details

	Subject	Repo	Branch	Lines +/-
	Add options to control term_search_key+term_weight use	mediawiki/extensions/Wikibase	master	+287 -33

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Addshore	T188992 Make wb_terms table fancy
Resolved	Lucas_Werkmeister_WMDE	T188993 Replace term_search_key and term_weight with empty values when wb_terms is not used for search
Resolved	Ladsgroup	T189776 Disable reading from term_search_key from wb_terms table in beta cluster and testwikidata
Resolved	Lucas_Werkmeister_WMDE	T189777 Disable reading from term_search_key from wb_terms table in wikidata
Declined	Ladsgroup	T189779 Run clearTermSqlIndexSearchFields on Wikidata
Resolved	Lucas_Werkmeister_WMDE	T191631 Add maintenance script to wipe term_search_key and term_weight columns

Event Timeline

Ladsgroup created this task.Mar 6 2018, 12:50 PM

Ladsgroup mentioned this in T188992: Make wb_terms table fancy.Mar 6 2018, 12:55 PM

Lucas_Werkmeister_WMDE renamed this task from Replace term_search_key and term_weight with empty values when wb_term is not used for search to Replace term_search_key and term_weight with empty values when wb_terms is not used for search.Mar 6 2018, 12:56 PM

Currently, term_search_key is still used for TermSqlIndex’ implementation of the LabelConflictFinder interface – to find entities with the same label and description (you’re not allowed to produce such conflicts when editing or creating entities). Before we can wipe the column, we need to decide what to do with this – we can either add another LabelConflictFinder implementation based on Cirrus/Elastic, or use term_text instead of term_search_key in conflict detection (and accept that “foo” and “foO” will no longer be detected as a conflict).

After that:

write empty values if search uses Cirrus/Elastic
add maintenance script to wipe the columns (in batches, etc.) – not yet sure if this should be part of update.php
add maintenance script to re-populate the columns and note in docs/options.wiki that admins who disable Cirrus/Elastic again must make sure to run this script

Before we can wipe the column, we need to decide what to do with this – we can either add another LabelConflictFinder implementation based on Cirrus/Elastic, or use term_text instead of term_search_key in conflict detection (and accept that “foo” and “foO” will no longer be detected as a conflict).

Some more information – using term_text doesn’t only mean case sensitivity in the search, it also means that «thé» and «thé» will be considered different (one uses a precomposed character, one a combining diacritic), or “foo” and “‎foo” (the second one starts with an LTR mark), or “foo” and “foo ”. The term_search_key normalization takes care of all that for us (Unicode normalization, removal of control characters, stripping of leading and trailing whitespace, and finally case conversion).

@Lydia_Pintscher any opinion on this? Do you think it’s acceptable to detect conflicts based only on the pure, non-normalized term text?

Yet more information :) @Ladsgroup pointed out that some of these normalizations are also performed by Wikibase on any term (not just for the search key), specifically Unicode normalization and whitespace stripping. However, removal of control characters doesn’t seem to be part of that – I can add the same alias twice, once with a leading LTR character, and the database will have two rows with the same term_search_key but different term_texts. But I guess that’s still acceptable.

Okay, that’s because StringNormalizer only removes category Cc (other, control) whereas TermSqlIndex also removes Cf (other, format), Cn (other, not assigned) and Cs (other, surrogate). I doubt that’s intentional… perhaps we should clean that up anyways.

Looking at list of Cf and searching for them in Wikidata, doesn't return anything. We are not using them and I think they should be dropped too.

In T188993#4036412, @Ladsgroup wrote:

Looking at list of Cf and searching for them in Wikidata, doesn't return anything. We are not using them and I think they should be dropped too.

I couldn't been more wrong, ZWNJ is one of them

Yeah, at least ZWNJ, ZWJ and ZWSP are in Cf and widely used. Also, due to our outdated PHP version some recently added codepoints for regular characters are misclassified as Cn (unassigned), we shouldn’t remove those either.

It’s probably acceptable to use the less normalized term_text for conflict search, though, so I think we can leave the normalization situation as it is for now.

TermPropertyLabelResolver also uses a TermIndex, so if I understand correctly that means there’s a small chance that a Lua module referring to a property by its non-normalized label will stop working…

I’d feel a lot more comfortable with this change if we split it into two phases – first stop using the columns, then stop writing them. We don’t have to maintain this separation forever, but for the initial deployment I’d like to be able to roll back this change without having to worry about database rows where the search key was removed while the change was in effect.

In T188993#4043154, @Lucas_Werkmeister_WMDE wrote:

I’d feel a lot more comfortable with this change if we split it into two phases – first stop using the columns, then stop writing them. We don’t have to maintain this separation forever, but for the initial deployment I’d like to be able to roll back this change without having to worry about database rows where the search key was removed while the change was in effect.

That is definitely part of the plan.

Also, I have no idea what to do in the client, where we also use TermSqlIndex but don’t have a useCirrus setting…

Perhaps this should be a separate setting and not automatically inferred from useCirrus, so that we can have it in both repo and client?

Lucas_Werkmeister_WMDE moved this task from Tasks to In Progress on the Wikidata-Ministry-Of-Magic board.Mar 12 2018, 2:50 PM

Change 418968 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Add options to control term_search_key+term_weight use

https://gerrit.wikimedia.org/r/418968

gerritbot added a project: Patch-For-Review.Mar 12 2018, 7:05 PM

Lucas_Werkmeister_WMDE moved this task from In Progress to Needs Review on the Wikidata-Ministry-Of-Magic board.Mar 12 2018, 7:06 PM

Change 418968 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add options to control term_search_key+term_weight use

https://gerrit.wikimedia.org/r/418968

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2018-03-20 (1.31.0-wmf.26)).Mar 15 2018, 10:00 AM

Ladsgroup moved this task from Needs Review to Done on the Wikidata-Ministry-Of-Magic board.Mar 15 2018, 10:20 AM

Still to be done:

enable those config settings on wikidata (and testwikidata, beta, etc.)
if everything goes okay, disable the “force writing” setting again
run maintenance script to kill term_search_key and term_weight. I think rebuildTermSqlIndex.php might do the job, but that should be tested manually. rebuildTermsSearchKey always writes the full search key afaict, so that’s not what we want (perhaps it should be updated)

but those should probably be separate tasks.

Ladsgroup mentioned this in T189776: Disable reading from term_search_key from wb_terms table in beta cluster and testwikidata.Mar 15 2018, 12:47 PM

Ladsgroup mentioned this in T189777: Disable reading from term_search_key from wb_terms table in wikidata.Mar 15 2018, 12:50 PM