Page MenuHomePhabricator

The experimental highlighter may break surrogate pairs
Closed, ResolvedPublic

Details

Related Gerrit Patches:

Event Timeline

dcausse created this task.Oct 4 2019, 12:02 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptOct 4 2019, 12:02 PM
Restricted Application added subscribers: Cosine02, Aklapper. · View Herald Transcript
dcausse added a comment.EditedOct 4 2019, 1:14 PM

this is a bit puzzling as the response from the backend is sometimes "correct" as it contains a tofu char but sometimes contain the broken surrogate pair which in return will cause mediawiki to fail and certainly causing: T231023

I believe that the tofu char is added when serializing the response through the transport protocol since I always see it when targeting a node that does no have the shard

"text":["注意:本页面含有Unihan新版用字。有关字符可能會错误显示,詳见Unicode扩展汉字。 六疊字是指漢字中一類由六個完全相同的部分所组成的疊字。大多數的六疊字亦可以被視為兩個三疊字或三個二疊字的組合。 現將已知的此類字列於下表: 疊字 二疊字 三疊字 四疊字 五疊字 八疊字 ZDIC.NET. �"]

vs the broken surrogate when I target the node hosting the shard

"text":["注意:本页面含有Unihan新版用字。有关字符可能會错误显示,詳见Unicode扩展汉字。 六疊字是指漢字中一類由六個完全相同的部分所组成的疊字。大多數的六疊字亦可以被視為兩個三疊字或三個二疊字的組合。 現將已知的此類字列於下表: 疊字 二疊字 三疊字 四疊字 五疊字 八疊字 ZDIC.NET. \uD847"]

The root cause needs to be fixed, I don't know yet if it is worth spending effort to make sure that the transport protocol does not lose our broken surrogate pair and file a bug/patch upstream.

dcausse claimed this task.Oct 7 2019, 12:48 PM
dcausse triaged this task as Medium priority.
dcausse moved this task from needs triage to Current work on the Discovery-Search board.

Change 541345 had a related patch set uploaded (by DCausse; owner: DCausse):
[search/highlighter@master] Do not break surrogate pairs when returning the no match snippet

https://gerrit.wikimedia.org/r/541345

Change 541345 merged by jenkins-bot:
[search/highlighter@master] Do not break surrogate pairs when returning the no match snippet

https://gerrit.wikimedia.org/r/541345

Gehel closed this task as Resolved.Oct 29 2019, 5:51 PM