Page MenuHomePhabricator

Wikidata search suggestions do not display on screen if character whose decomposition contains nukta is present in search query
Closed, ResolvedPublic

Description

Most Indic-language sites and Commons, which use opensearch for search queries, appear to process characters such as ढ़, য়, ਖ਼, and ଡ଼—note that these are combined, i.e. not already decomposed into a consonant and a nukta—appropriately when they are present in search queries, returning appropriate suggestions. (The bolding of the text within the search suggestions corresponding to what was typed does not appear, but that's not quite as troublesome of a matter.)

Wikidata's search functionality, wbsearchentities, returns the proper JSON response given a search containing the aforementioned characters, but the results are not rendered properly at all. In particular, the warning "The value passed for "search" contains invalid or non-normalized data. Textual data should be valid, NFC-normalized Unicode without C0 control characters other than HT (\t), LF (\n), and CR (\r)." is attached to the results. This causes either 1) the waiting icon to remain indefinitely, if it is a new search query, or 2) the previous results to remain, if it is a modification to another search query. A screenshot of 1) is presented below.

To see 1) for yourself, you can change your interface language to Bengali, copy the text "বিষয়শ্রেণী:" ("Category:" in Bengali) and paste it into Wikidata's search box, and see no category pages pop up. Change the "য়" in that word to "য + ়", after removing the two spaces and the plus from that quotation, and such category pages will appear. To see 2) for yourself, change the "য + ়" back to "য়" and add "উইকিপিডিয়া" (Wikipedia) to the end, and the results shown will not change.

This does not appear to be an issue with all characters for which a decomposition exists in the Unicode standard, as searches such as "Cañada" (where the "ñ" decomposes into "n + U+0303") do return suggestions properly, without any warning attached to the JSON response.

It appears that all of the characters I have mentioned are part of this list of characters excluded from composition per the Unicode Standard, and that they cannot ever occur in their respective normalization forms—if that information is in any way helpful.

If wbsearchentities does not already normalize the Unicode data passed to it in the "search" parameter, then it becomes really problematic since input methods for languages such as Bengali and Hindi are not always guaranteed to output letters containing nukta as separate characters.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 16 2017, 10:20 PM
Mahir256 renamed this task from Wikidata search suggestions do not return anything if a character containing nukta is present to Wikidata search suggestions do not return anything if a character whose decomposition contains nukta is present.Jul 16 2017, 10:40 PM
Mahir256 triaged this task as Normal priority.
Mahir256 updated the task description. (Show Details)
Mahir256 raised the priority of this task from Normal to Needs Triage.Jul 16 2017, 11:46 PM
Mahir256 renamed this task from Wikidata search suggestions do not return anything if a character whose decomposition contains nukta is present to Wikidata search suggestions do not display on screen if character whose decomposition contains nukta is present in search query.Jul 23 2017, 4:57 PM
Mahir256 updated the task description. (Show Details)
Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 23 2017, 4:57 PM
Mahir256 updated the task description. (Show Details)Jul 24 2017, 8:33 PM
debt triaged this task as Normal priority.Oct 26 2017, 5:06 PM
debt added a subscriber: debt.

This should be automagically resolved with the latest switch to elasticsearch that was done earlier this year. @Mahir256 can you take another look and close if it's been resolved? Thanks!

Mahir256 added a comment.EditedOct 26 2017, 5:58 PM

@debt No, the results are simply not appearing. The JSON response is still normal, up to the warning I mentioned above, but the results do not render on screen. The symptoms as listed in the task description still persist. See the screenshot in the task description.

Perhaps this is a purely front-end issue?

TJones added a comment.EditedOct 27 2017, 4:30 PM

Note that you don’t need to change your interface to Bengali to see these effects, and the fact that it is the Bengali keyword for “category” doesn’t seem to matter either. You can search for single characters and get the described behavior.

(Be sure to clear the search box between examples—otherwise you the old results, as @Mahir256 noted above.)

For Bengali, Devanagari, and Gurmukhi nukta, the precomposed versions hang, and the decomposed versions get suggestions:

BengaliU+09DFprecomposedhangs
Bengaliয়U+09AF U+09BCdecomposedworks
DevanagariU+095Aprecomposedhangs
Devanagariग़U+0917 U+093Cdecomposedworks
GurmukhiU+0A5Aprecomposedhangs
Gurmukhiਗ਼U+0A17 U+0A3Cdecomposedworks

Oddly, the opposite behavior happens for Latin, Cyrillic, and Greek characters—the precomposed versions work and the decomposed versions hang:

LatinñU+00F1precomposedworks
LatinU+006E U+303decomposedhangs
LatinéU+00E9precomposedworks
LatinU+0065 U+0301decomposedhangs
LatinU+1EDFprecomposedworks
LatinởU+01A1 U+0309decomposedhangs
CyrillicЃU+0403precomposedworks
CyrillicЃU+0413 U+0301decomposedhangs
CyrillicЀU+0400precomposedworks
CyrillicЀU+0415 U+0300decomposedhangs
CyrillicЍU+040Dprecomposedworks
CyrillicЍU+0418 U+0300decomposedhangs
GreekU+1F06precomposedworks
GreekἆU+1F00 U+0342decomposedhangs

However, when there is no precomposed alternative, the decomposed version works fine (depending on your fonts, the mixed script versions may or may not look right):

LatinU+0071 U+0301decomposedworks
LatinU+0071 U+0300decomposedworks
LatinU+0071 U+0303decomposedworks
LatinU+0071 U+0309decomposedworks
LatinU+0071 U+0342decomposedworks
Latin + Bengaliq়U+0071 U+09BCdecomposedworks
Latin + Devanagariq़U+0071 U+093Cdecomposedworks
Latin + Gurmukhiq਼U+0071 U+0A3Cdecomposedworks

So, I’m really not sure what’s going on here, but it looks like it is more than just Indic languages that have the problem, and there seems to be an “expected” form which works, and an “unexpected” form that doesn’t—and the (pre|de)composition difference can break in either direction for a given script.

Mahir256 updated the task description. (Show Details)Oct 27 2017, 4:39 PM
Mahir256 removed a subscriber: PokestarFan.
debt moved this task from needs triage to Up Next on the Discovery-Search board.Oct 27 2017, 9:31 PM

Gotcha, thanks for the verification, @Mahir256 and the investigation, @TJones both should give us the information we need to get it fixed.

Looking at the network trace when searching for ଡ଼, the result is returned (Q31441900) but for some reason is not displayed. The warning is there, but does not prevent the display of the result. I suspect the frontend, but will dig deeper.

Smalyshev added a comment.EditedNov 1 2017, 10:26 PM

In jquery.ui.suggester.js we've got this:

			if ( typeof requestTerm === 'string' && requestTerm !== self._term ) {
				// Skip request since it does not correspond to the current search term.
				return;
			}

So looks like the term we get from the result may not be the original one but normalized one, because this check fails for the search above, thus ensuring no results.

Confirming:

requestTerm.length
2
self._term.length
1
requestTerm == self._term
false
requestTerm.normalize() == self._term.normalize()
true

So I guess the solution is to use normalization when comparing Unicode strings.

Smalyshev added a subscriber: hoo.Nov 1 2017, 10:36 PM

Change 387959 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[data-values/value-view@master] Normalize search term before matching against result

https://gerrit.wikimedia.org/r/387959

TJones added a comment.Nov 2 2017, 1:05 PM

@Smalyshev, thanks for tracking this one down! That was some weird behavior, but things getting normalized and not matching makes sense.

@Snaterlicious, @hoo, @thiemowmde Do you know why the check is there and what it meant to be doing? @tstarling raised the following concern:

The search term is normalized by the server using $wgContLang->normalize(), which potentially includes transformations beyond NFC, especially if the content language is Arabic or Malayalam. So even if you do client-side NFC using the same version of Unicode as the server, there is at least a hypothetical possibility of a hang.
Smalyshev moved this task from Backlog to Waiting/Blocked on the User-Smalyshev board.
hoo added a comment.Nov 5 2017, 7:01 PM

@Snaterlicious, @hoo, @thiemowmde Do you know why the check is there and what it meant to be doing? @tstarling raised the following concern:

The search term is normalized by the server using $wgContLang->normalize(), which potentially includes transformations beyond NFC, especially if the content language is Arabic or Malayalam. So even if you do client-side NFC using the same version of Unicode as the server, there is at least a hypothetical possibility of a hang.

Replied on gerrit.

thiemowmde moved this task from incoming to in progress on the Wikidata board.
thiemowmde moved this task from Proposed to Monitoring on the Wikidata-Former-Sprint-Board board.

Change 390365 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/Wikibase@master] To identify superseded requests, use the requested search term instead of the returned search term

https://gerrit.wikimedia.org/r/390365

Change 390365 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] To identify superseded requests, use the requested search term instead of the returned search term

https://gerrit.wikimedia.org/r/390365

Change 387959 abandoned by Smalyshev:
Normalize search term before matching against result

Reason:
looks like better solution has been found

https://gerrit.wikimedia.org/r/387959

@thiemowmde I think it should fix it but can't check since it's not deployed yet (not even on test). So I'd like to keep it open until we can verify the problem is indeed gone.

Seems to be working ok on test.wikidata.org, so closing. If it fails on wikidata when the train moves on, please reopen.

Smalyshev closed this task as Resolved.Nov 15 2017, 8:56 PM
Mahir256 reopened this task as Open.Nov 17 2017, 8:45 PM
Mahir256 updated the task description. (Show Details)

@Mahir256 Please note wmf.8 is still not deployed on www.wikidata.org - see https://phabricator.wikimedia.org/T178635.

Wikidata is still on MediaWiki 1.31.0-wmf.7, while the change is included in 1.31.0-wmf.8.

Mahir256 added a subscriber: greg.Nov 17 2017, 9:07 PM

@Sjoerddebruin @Smalyshev @greg What’s up with https://tools.wmflabs.org/versions/ then? Are dewiki and wikidatawiki versions just not reported correctly there?

greg added a comment.Nov 17 2017, 9:30 PM

@greg What’s up with https://tools.wmflabs.org/versions/ then? Are dewiki and wikidatawiki versions just not reported correctly there?

See T178635, tl;dr: database crash halted the train. We'll get wmf.8 everywhere on Monday.

wikidata is special right now with some stuff @Addshore is doing to kill the wikidata build. That should be up-to-date on Monday as well.

@greg Okay, thank you for the clarification. I will close this task after Monday, then. (Probably should fix the URL I linked to, though.)

Mahir256 closed this task as Resolved.Nov 21 2017, 4:43 AM
Mahir256 updated the task description. (Show Details)

This problem is present in other Wikis too. I have noticed it in enWS and bnWS, whenever a link containing these characters (like য়) goes to a non-Wiki site, it does not work, but it works fine when the link goes to another Wiki.

@Hrishikes this task is about search and it already has been fixed. Please submit the linking issue as a separate task.