Page MenuHomePhabricator

incorrect UTF-8 processing on output of page and section titles
Closed, DeclinedPublic


Author: Innocenti.Maresin

The search system used in most WikiMedia projects makes errors in search result page. There is no apparent flaw in matching algorithm, but <span class="searchmatch"> tags are placed incorrectly when the search term contain multibyte characters and appears in the title of a wikipage or its section. Probably, matching algorithm provides substring lengths and offsets in characters (code points), which are incorrectly interpreted as byte offsets by HTML generating engine.

Version: unspecified
Severity: normal



Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:09 PM
bzimport set Reference to bz23629.
bzimport added a subscriber: Unknown Object (MLST).

orenbochman wrote:

Please attach an example query that causes this error.

Innocenti.Maresin wrote:

Let us browse exactly to the query mentioned by me in the bugzilla's "URL" field and examine the resulting document.

% wget ''

=> `index.php?title=Special:Search&fulltext=1&search=а&ns4=1&uselang=en'

20:09:30 (124.34 KB/s) - `index.php?title=Special:Search&fulltext=1&search=а&ns4=1&uselang=en' stored [41804/41804]

% hexdump -C -s 0x5d90 -n 128 index.php\?title=Special:Search\&fulltext=1\&search=а\&ns4=1\&uselang=en
00005d90 d0 be d0 b2 20 7c 20 3c 73 70 61 6e 20 63 6c 61 |.... | <span cla|
00005da0 73 73 3d 27 73 65 61 72 63 68 6d 61 74 63 68 27 |ss='searchmatch'|
00005db0 3e d0 3c 2f 73 70 61 6e 3e 90 2e d0 9a d1 80 d1 |>.</span>.......|
00005dc0 8b d0 bc d0 be d0 b2 20 7c 20 32 30 30 38 2d 31 |....... | 2008-1|
00005dd0 31 2d 30 39 20 7c 20 39 37 34 35 20 7c 20 d0 9f |1-09 | 9745 | ..|
00005de0 d0 b0 d1 82 d1 80 d1 83 d0 bb d0 b8 d1 80 d1 83 |................|
00005df0 d1 8e d1 89 d0 b8 d0 b9 2c 20 d0 be d1 82 d0 ba |........, ......|
00005e00 d0 b0 d1 82 d1 8b d0 b2 d0 b0 d1 8e d1 89 d0 b8 |................|

Here you can see invalid byte string 0xd0 (without continuation bytes) at offset 0x00005db1 and misplaced continuation byte 0x90 at 0x00005db9.
This is U+0410 — Cyrillic letter "А" — split to 2 portions. This is clearly visible in a browser too, as replacement characters. Is this exercise really so complicated or boring for MediaWiki programmers?

orenbochman wrote:

Thanks for the prompt response.

I'm fairly new to Bugzilla and missed the URL you gave. Also your second response is very helpful since I have not had to fix problems involving multibyte Unicode characters.

Your original bug report points to the Result Rendering Stage of search.

I'm now trying to narrow down the source of the bug.

I have found that there are bugs in Java's (Multibyte) Unicode implementation which carried over to the version of Wikipedia's search library, Lucene. While Lucene has fixed these we are still working with the old version.
Another second option could be the highlighter code which is being upgraded.

Anyhow I'll also be adding some unit test to make sure this issue does not reccur once it is fixed.

I'll update here as soon once I find out more.

[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]

Don't have this problem with the new search engine.

Example query:

Closing WONTFIX as lsearchd has been end of life'd.