Wikidata search suggestions do not display on screen if character whose decomposition contains nukta is present in search query
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Mahir256
	Jul 16 2017, 10:20 PM

Description

Most Indic-language sites and Commons, which use opensearch for search queries, appear to process characters such as ढ़, য়, ਖ਼, and ଡ଼—note that these are combined, i.e. not already decomposed into a consonant and a nukta—appropriately when they are present in search queries, returning appropriate suggestions. (The bolding of the text within the search suggestions corresponding to what was typed does not appear, but that's not quite as troublesome of a matter.)

Wikidata's search functionality, wbsearchentities, returns the proper JSON response given a search containing the aforementioned characters, but the results are not rendered properly at all. In particular, the warning "The value passed for "search" contains invalid or non-normalized data. Textual data should be valid, NFC-normalized Unicode without C0 control characters other than HT (\t), LF (\n), and CR (\r)." is attached to the results. This causes either 1) the waiting icon to remain indefinitely, if it is a new search query, or 2) the previous results to remain, if it is a modification to another search query. A screenshot of 1) is presented below.

To see 1) for yourself, you can change your interface language to Bengali, copy the text "বিষয়শ্রেণী:" ("Category:" in Bengali) and paste it into Wikidata's search box, and see no category pages pop up. Change the "য়" in that word to "য + ়", after removing the two spaces and the plus from that quotation, and such category pages will appear. To see 2) for yourself, change the "য + ়" back to "য়" and add "উইকিপিডিয়া" (Wikipedia) to the end, and the results shown will not change.

This does not appear to be an issue with all characters for which a decomposition exists in the Unicode standard, as searches such as "Cañada" (where the "ñ" decomposes into "n + U+0303") do return suggestions properly, without any warning attached to the JSON response.

It appears that all of the characters I have mentioned are part of this list of characters excluded from composition per the Unicode Standard, and that they cannot ever occur in their respective normalization forms—if that information is in any way helpful.

If wbsearchentities does not already normalize the Unicode data passed to it in the "search" parameter, then it becomes really problematic since input methods for languages such as Bengali and Hindi are not always guaranteed to output letters containing nukta as separate characters.

Details

	Subject	Repo	Branch	Lines +/-
	Normalize search term before matching against result	data-values/value-view	master	+8 -1
	To identify superseded requests, use the requested search term instead of the returned search term	mediawiki/extensions/Wikibase	master	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T46529 Wikidata search problems (tracking)
		Resolved		None	T170779 Wikidata search suggestions do not display on screen if character whose decomposition contains nukta is present in search query

Event Timeline

Mahir256 created this task.Jul 16 2017, 10:20 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 16 2017, 10:20 PM

Mahir256 renamed this task from Wikidata search suggestions do not return anything if a character containing nukta is present to Wikidata search suggestions do not return anything if a character whose decomposition contains nukta is present.Jul 16 2017, 10:40 PM

Mahir256 triaged this task as Medium priority.

Mahir256 updated the task description. (Show Details)

Mahir256 added a parent task: T46529: Wikidata search problems (tracking).Jul 16 2017, 11:34 PM

Mahir256 raised the priority of this task from Medium to Needs Triage.Jul 16 2017, 11:46 PM

Lydia_Pintscher added subscribers: thiemowmde, daniel.Jul 17 2017, 5:55 PM

Mahir256 renamed this task from Wikidata search suggestions do not return anything if a character whose decomposition contains nukta is present to Wikidata search suggestions do not display on screen if character whose decomposition contains nukta is present in search query.Jul 23 2017, 4:57 PM

Mahir256 updated the task description. (Show Details)

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 23 2017, 4:57 PM

Mahir256 updated the task description. (Show Details)Jul 24 2017, 8:33 PM

Mahir256 added a project: MediaWiki-extensions-WikibaseRepository.Jul 24 2017, 8:42 PM

Mahir256 updated the task description. (Show Details)

Smalyshev added a subscriber: TJones.Oct 25 2017, 10:27 PM

Smalyshev added a project: Discovery-Search.Oct 25 2017, 10:48 PM

Liuxinyu970226 subscribed.Oct 26 2017, 4:09 AM

This should be automagically resolved with the latest switch to elasticsearch that was done earlier this year. @Mahir256 can you take another look and close if it's been resolved? Thanks!

@debt No, the results are simply not appearing. The JSON response is still normal, up to the warning I mentioned above, but the results do not render on screen. The symptoms as listed in the task description still persist. See the screenshot in the task description.

Perhaps this is a purely front-end issue?

Note that you don’t need to change your interface to Bengali to see these effects, and the fact that it is the Bengali keyword for “category” doesn’t seem to matter either. You can search for single characters and get the described behavior.

(Be sure to clear the search box between examples—otherwise you the old results, as @Mahir256 noted above.)

For Bengali, Devanagari, and Gurmukhi nukta, the precomposed versions hang, and the decomposed versions get suggestions:

Bengali	য়	U+09DF	precomposed	hangs
Bengali	য়	U+09AF U+09BC	decomposed	works
Devanagari	ग़	U+095A	precomposed	hangs
Devanagari	ग़	U+0917 U+093C	decomposed	works
Gurmukhi	ਗ਼	U+0A5A	precomposed	hangs
Gurmukhi	ਗ਼	U+0A17 U+0A3C	decomposed	works

Oddly, the opposite behavior happens for Latin, Cyrillic, and Greek characters—the precomposed versions work and the decomposed versions hang:

Latin	ñ	U+00F1	precomposed	works
Latin	ñ	U+006E U+303	decomposed	hangs
Latin	é	U+00E9	precomposed	works
Latin	é	U+0065 U+0301	decomposed	hangs
Latin	ở	U+1EDF	precomposed	works
Latin	ở	U+01A1 U+0309	decomposed	hangs
Cyrillic	Ѓ	U+0403	precomposed	works
Cyrillic	Ѓ	U+0413 U+0301	decomposed	hangs
Cyrillic	Ѐ	U+0400	precomposed	works
Cyrillic	Ѐ	U+0415 U+0300	decomposed	hangs
Cyrillic	Ѝ	U+040D	precomposed	works
Cyrillic	Ѝ	U+0418 U+0300	decomposed	hangs
Greek	ἆ	U+1F06	precomposed	works
Greek	ἆ	U+1F00 U+0342	decomposed	hangs

However, when there is no precomposed alternative, the decomposed version works fine (depending on your fonts, the mixed script versions may or may not look right):

Latin	q́	U+0071 U+0301	decomposed	works
Latin	q̀	U+0071 U+0300	decomposed	works
Latin	q̃	U+0071 U+0303	decomposed	works
Latin	q̉	U+0071 U+0309	decomposed	works
Latin	q͂	U+0071 U+0342	decomposed	works
Latin + Bengali	q়	U+0071 U+09BC	decomposed	works
Latin + Devanagari	q़	U+0071 U+093C	decomposed	works
Latin + Gurmukhi	q਼	U+0071 U+0A3C	decomposed	works

So, I’m really not sure what’s going on here, but it looks like it is more than just Indic languages that have the problem, and there seems to be an “expected” form which works, and an “unexpected” form that doesn’t—and the (pre|de)composition difference can break in either direction for a given script.

Mahir256 updated the task description. (Show Details)Oct 27 2017, 4:39 PM

Mahir256 removed a subscriber: PokestarFan.

Gotcha, thanks for the verification, @Mahir256 and the investigation, @TJones both should give us the information we need to get it fixed.

Looking at the network trace when searching for ଡ଼, the result is returned (Q31441900) but for some reason is not displayed. The warning is there, but does not prevent the display of the result. I suspect the frontend, but will dig deeper.

In jquery.ui.suggester.js we've got this:

			if ( typeof requestTerm === 'string' && requestTerm !== self._term ) {
				// Skip request since it does not correspond to the current search term.
				return;
			}

So looks like the term we get from the result may not be the original one but normalized one, because this check fails for the search above, thus ensuring no results.

Confirming:

requestTerm.length
2
self._term.length
1
requestTerm == self._term
false
requestTerm.normalize() == self._term.normalize()
true

So I guess the solution is to use normalization when comparing Unicode strings.

Smalyshev added a subscriber: hoo.Nov 1 2017, 10:36 PM

Smalyshev added a project: ValueView.Nov 1 2017, 10:57 PM

Change 387959 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[data-values/value-view@master] Normalize search term before matching against result

https://gerrit.wikimedia.org/r/387959

gerritbot added a project: Patch-For-Review.Nov 1 2017, 11:21 PM

Smalyshev moved this task from Up Next to Current work on the Discovery-Search board.Nov 1 2017, 11:29 PM

Smalyshev edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Smalyshev moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

@Smalyshev, thanks for tracking this one down! That was some weird behavior, but things getting normalized and not matching makes sense.

Smalyshev added a subscriber: Snaterlicious.Nov 3 2017, 11:46 PM

@Snaterlicious, @hoo, @thiemowmde Do you know why the check is there and what it meant to be doing? @tstarling raised the following concern:

The search term is normalized by the server using $wgContLang->normalize(), which potentially includes transformations beyond NFC, especially if the content language is Arabic or Malayalam. So even if you do client-side NFC using the same version of Unicode as the server, there is at least a hypothetical possibility of a hang.

Smalyshev added a project: User-Smalyshev.Nov 3 2017, 11:50 PM

Smalyshev moved this task from Backlog to Waiting/Blocked on the User-Smalyshev board.

In T170779#3734809, @Smalyshev wrote:
@Snaterlicious, @hoo, @thiemowmde Do you know why the check is there and what it meant to be doing? @tstarling raised the following concern:
The search term is normalized by the server using $wgContLang->normalize(), which potentially includes transformations beyond NFC, especially if the content language is Arabic or Malayalam. So even if you do client-side NFC using the same version of Unicode as the server, there is at least a hypothetical possibility of a hang.

Replied on gerrit.

thiemowmde added a project: Wikidata-Former-Sprint-Board.Nov 7 2017, 11:47 AM

thiemowmde moved this task from incoming to in progress on the Wikidata board.

thiemowmde moved this task from Proposed to Monitoring on the Wikidata-Former-Sprint-Board board.

Smalyshev removed a project: Patch-For-Review.Nov 8 2017, 7:03 PM

Change 390365 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/Wikibase@master] To identify superseded requests, use the requested search term instead of the returned search term

https://gerrit.wikimedia.org/r/390365

gerritbot added a project: Patch-For-Review.Nov 10 2017, 4:57 AM

Change 390365 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] To identify superseded requests, use the requested search term instead of the returned search term

https://gerrit.wikimedia.org/r/390365

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2017-11-14 (1.31.0-wmf.8)).Nov 10 2017, 9:00 AM

Change 387959 abandoned by Smalyshev:
Normalize search term before matching against result

Reason:
looks like better solution has been found

https://gerrit.wikimedia.org/r/387959

Can we consider this done with https://gerrit.wikimedia.org/r/390365?

Smalyshev moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Nov 14 2017, 5:51 PM

@thiemowmde I think it should fix it but can't check since it's not deployed yet (not even on test). So I'd like to keep it open until we can verify the problem is indeed gone.

Seems to be working ok on test.wikidata.org, so closing. If it fails on wikidata when the train moves on, please reopen.

Smalyshev closed this task as Resolved.Nov 15 2017, 8:56 PM

Liuxinyu970226 unsubscribed.Nov 16 2017, 3:56 AM

Mahir256 reopened this task as Open.Nov 17 2017, 8:45 PM

Mahir256 updated the task description. (Show Details)

@Mahir256 Please note wmf.8 is still not deployed on www.wikidata.org - see https://phabricator.wikimedia.org/T178635.

Wikidata is still on MediaWiki 1.31.0-wmf.7, while the change is included in 1.31.0-wmf.8.

@Sjoerddebruin @Smalyshev @greg What’s up with https://tools.wmflabs.org/versions/ then? Are dewiki and wikidatawiki versions just not reported correctly there?

In T170779#3771313, @Mahir256 wrote:

@greg What’s up with https://tools.wmflabs.org/versions/ then? Are dewiki and wikidatawiki versions just not reported correctly there?

See T178635, tl;dr: database crash halted the train. We'll get wmf.8 everywhere on Monday.

wikidata is special right now with some stuff @Addshore is doing to kill the wikidata build. That should be up-to-date on Monday as well.

@greg Okay, thank you for the clarification. I will close this task after Monday, then. (Probably should fix the URL I linked to, though.)

Mahir256 closed this task as Resolved.Nov 21 2017, 4:43 AM

Mahir256 updated the task description. (Show Details)

Liuxinyu970226 awarded a token.Aug 18 2018, 6:04 AM

This problem is present in other Wikis too. I have noticed it in enWS and bnWS, whenever a link containing these characters (like য়) goes to a non-Wiki site, it does not work, but it works fine when the link goes to another Wiki.

@Hrishikes this task is about search and it already has been fixed. Please submit the linking issue as a separate task.

Hrishikes mentioned this in T206188: External link not working if character whose decomposition contains nukta is present in link.Oct 4 2018, 2:13 AM

	F10470366: Capture.PNG
	Oct 27 2017, 4:37 PM

	F10470361: Capture.PNG
	Oct 27 2017, 4:35 PM

Wikidata search suggestions do not display on screen if character whose decomposition contains nukta is present in search queryClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Wikidata search suggestions do not display on screen if character whose decomposition contains nukta is present in search query
Closed, ResolvedPublic
Actions

Related Objects
Search...