Page MenuHomePhabricator

Characters in CJK extension C treated as U+FFFD when searching on zhWP [EPIC-ish]
Closed, ResolvedPublic

Description

If one searches for a character in CJK extension C on zh.wikipedia.org, he will get all pages containing any of the characters from CJK extension C and these characters will be displayed as U+FFFD. (For instance: https://zh.wikipedia.org/w/index.php?search=𨨏&title=Special:搜索&fulltext=1 ) It seems that all these characters are wrongly treated/indexed as U+FFFD.

Note: This definitely needs fixing, but a partial workaround is to search for the 32-bit characters with quotes around them. Then the high and low surrogates are treated as a phrase. (For instance: https://zh.wikipedia.org/w/index.php?search=%22%F0%A8%A8%8F%22&title=Special:%E6%90%9C%E7%B4%A2&fulltext=1 ) This doesn't solve the problem for all use cases, but it is helpful for some.

Event Timeline

Antigng created this task.Jun 20 2017, 5:17 PM
Restricted Application added subscribers: Cosine02, Aklapper. · View Herald TranscriptJun 20 2017, 5:17 PM

Unfortunately, this is a side effect of the new Chinese language analysis (T158203, etc.).

The example, 𨨏, is from the Ideographs Extension B, which also exhibits the problem.

The Ideograph Extensions have 32-bit Unicode code points. So, 𨨏 is U+28A0F, but it is represented using 16-bit high and low surrogates as U+D862 and U+DE0F together. A nearby character, 𨨄 for example, is U+28A04, which is represented as U+D862 and U+DE04. Note that both share high surrogate U+D862 as the first component.

So, searching on 𨨏 gets split into searching on U+D862 and U+DE0F. The U+D862 part can match the first half of 𨨄 (and others). When only the first half of the character (the high surrogate) is highlighted, I think both halves get converted to � because surrogates are only supposed to occur in matched adjacent pairs.

The problem is that the current analysis chain splits these characters into the separate 16-bit pieces. I'm not quite sure where in the process it's going wrong. Searching with quotes, such as "𨨏" doesn't have the problem, so the correct info is being stored internally, although incorrectly.

It doesn't really matter where the problem that leads to � being displayed is happening, if we can get the index to be built correctly (with U+D862 and U+DE0F indexed together as one token, for example). That would take some digging, since there are several plugins working in the analysis chain, and we'd have to figure out which one is doing it and how to fix it. Best case, there's a clever config for a plugin or hacky filter that can solve the problem. Worst case, we need to write a plugin that puts sequential high and low surrogates back together into one token. (I don't yet know how to do such a thing—but it's less complicated than some of the stuff David has done, I think.)

@Antigng, can you give us some idea about how common the problem is? I had some trouble with the 32-bit code points with the tools I used for my original analysis, and they weren't extremely common in the article text. Are they common in search queries?

@debt, this is definitely something we should try to fix, but my initial guess is that it is at least somewhat rare, though @Antigng or other users could give us better information about how often it happens.

debt moved this task from needs triage to Up Next on the Discovery-Search board.Jun 22 2017, 5:13 PM
debt triaged this task as Low priority.

Sure, let's take a look and see if we can fix this.

I took a look at this today and, to my surprise, it's the smartcn_tokenizer from Elastic (though I think it really comes from Lucene farther down the stack). The tokenizer uses a hidden markov model. I glanced at the code, but didn't dig into it enough to find the source of the problem, but it's possible that there's a 16-bit character assumption hidden down in there somewhere. I opened a bug with Elasticsearch since that's the level at which I can replicate the problem.

Anyway, my current plan, for Wikimedia-Hackathon-2018, is to try to make a filter that reassembles broken up high surrogate/low surrogate pairs.

During the hackathon I wrote a filter that re-combines broken high/low surrogate pairs. It leaves behind some empty tokens, which need to be cleaned up, and I'm still looking at how best to do that.

I opened a ticket for Elastic about this, and it turns out that someone else tracked the problem to Lucene and submitted a patch to fix it, and it should be out in Elastic 6.4.

I'll try to get an estimate of when we'll get to Elastic 6.4, and decide whether it's worth pursuing my repair plugin—which would also fix other much rarer high/low pair splits, for example those split by a space—or waiting for ES 6.4.

It's going to be at least three months until we go to ES 6, and then we will go to the current latest version, unless there's a problem of some sort. I'm not sure if 6.4 will be available when we go to 6.x. I'll take a quick survey of Chinese Wikipedia queries and see how often 32-bit characters come up.

TJones updated the task description. (Show Details)Oct 31 2018, 7:03 PM
TJones added a subscriber: dcausse.EditedOct 31 2018, 7:27 PM

Things have become a bit more complicated—as they are wont to do!

While testing my fix for this, I discovered that the original problem—UTF-32 characters being split into high and low surrogate characters—was causing an error on my laptop, rather than just poor results and abundant Unicode replacement characters (U+FFFD, �) on the search results page like we see in production.

I tracked down the error to an odd interaction between the Elasticsearch highlighter and PHP's JSON::parse deep in the bowels of a utility function. Elasticsearch stores the surrogate characters in Java Unicode (\u) format, so something like "\uD862\uDF46". The highlighter inserts U+E000 and U+E001 around snippets to be highlighted, so the string becomes "\uD862\uDF46" after highlighting. JSON::parse apparently converts the \u formatted characters into actual Unicode, at which point the high and low surrogates are invalid.

At this point I was stuck, but @dcausse figured out that there is an important difference in JSON::parse between HHVM (production) and PHP 7 (our vagrant dev environment). HHVM converts the invalid characters to � (hence our results in production), while PHP 7 throws an error, which cascades upward in unpredictable ways (this utility function catches the error and just keeps the unparsed JSON as the value it returns, which eventually leads to the failure I saw).

This is important because I was considering letting this linger until we upgrade to ES6, which should fix the problem at the root (in the smartcn plugin), but T176370: Migrate to PHP 7 in WMF production may be deployed before we upgrade to ES 6, in which case searching for 32-bit characters on Chinese-language wikis will start throwing errors.

Our plugin projects have already been updated to ES6, so the current (and slightly amended) plan is as follows:

  • @dcausse created a ES5.5 branch of the extra plugin repo.
  • I'll merge my fix there. and then we can deploy it to production.
  • I'll update Analysis Config Builder to look for the new plugin and enable the fix when it is available.
  • Once everything is deployed, we can re-index the Chinese-language wikis.

(We abandoned the version of the plan that went through the ES6 version of the plugin repo because the problem doesn't exist in ES6 and so all my tests fail. We only need a ES5.x version of this plugin.)

Hopefully we can get this done before these searches start throwing errors in production, which will be even more confusing for users than the current � mess.

TJones moved this task from Up Next to Current work on the Discovery-Search board.Oct 31 2018, 7:27 PM
TJones claimed this task.

Change 471201 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/extra@5.5] Provide Surrogate Merger Plugin for ES 5.5

https://gerrit.wikimedia.org/r/471201

Change 471204 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Enable Chinese Surrogate Fix

https://gerrit.wikimedia.org/r/471204

TJones added a comment.Nov 6 2018, 7:45 PM

I ran a quick analysis of the effect of the analysis chain change to the indexing results. There isn't much to report, and nothing surprising, so I'm not going to do a full write up. I compared before and after the surrogate merging on 10,000 Wikipedia articles (out of ~1M) and 10,000 Wiktionary articles (out of ~800K).

  • Only two 32-bit characters were affected in the Wikipedia corpus (out of ~3.1M).
  • But 632 32-bit characters were affected in the Wiktionary corpus (out of ~139K), which is both a much larger number and a significantly higher rate of occurrence.

All of the changes were what you'd expect—unpaired high and low surrogates lost from the index, paired surrogates added to the index.

Change 471204 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Enable Chinese Surrogate Fix

https://gerrit.wikimedia.org/r/471204

TJones added a comment.Nov 7 2018, 8:02 PM

Change 471204 merged by jenkins-bot:
https://gerrit.wikimedia.org/r/471204

That's the configuration that will allow the surrogate merger to be enabled for Chinese when the plugin is present.

The other Gerrit patch (below) is the actual update to the 5.5 version of the plugin. Once that's merged, we still need to deploy the plugin and re-index, but we are moving in the right direction.

https://gerrit.wikimedia.org/r/471201

Change 471201 merged by jenkins-bot:
[search/extra@5.5] Provide Surrogate Merger Plugin for ES 5.5

https://gerrit.wikimedia.org/r/471201

I should have treated this ticket as an epic and created sub-tasks for it. The first part of the work—creating the plugin to re-merge surrogate pairs and the setting up the config to use the new plugin—was done on this ticket and is complete, but there is more to do. I don't want to close this ticket because the problem isn't solved yet, but the work I was doing here is done. So, after flailing around on the workboard a bit, I've moved it to Waiting, and I'll open sub-tasks for the remaining related tasks.

TJones renamed this task from Characters in CJK extension C treated as U+FFFD when searching on zhWP to Characters in CJK extension C treated as U+FFFD when searching on zhWP [EPIC-ish].Nov 9 2018, 3:01 PM
TJones added a project: Epic.

Almost done. Reindexing (T209156) is still in progress. The smaller wikis (Wikiversity, Wikiquote, Wikibooks, Wikivoyage, Wikinews) are done and checked and everything looks good so far. Wikisource, Wiktionary, and Wikipedia are still processing.

The highlighter fix has already fixed the replacement character problem (though the results are still wrong—at least now they are readable). Reindexing will get this fix in place and then we should be done—though it will probably run into the weekend.

Reindexing for the live search cluster (eqiad) is complete, and the example link now gives 46 results instead of ~94K. The spare cluster (codfw) is still running, so I won't move the re-indexing task to done until it finished.

Shizhao moved this task from Backlog to Closed on the Chinese-Sites board.Nov 30 2018, 3:24 AM