Page MenuHomePhabricator

Unicode replacement character \ufffd appear in text extract if exchar cuts at section boundary
Open, LowPublic3 Story Points

Description

Related Objects

Event Timeline

lilydjwg created this task.Mar 13 2015, 3:13 PM
lilydjwg raised the priority of this task from to Needs Triage.
lilydjwg updated the task description. (Show Details)
lilydjwg added a project: TextExtracts.
lilydjwg added a subscriber: lilydjwg.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 13 2015, 3:13 PM
Aklapper renamed this task from Bad characters in TextExtracts API results for Chinese Wikipedia to \ufffd characters appended in TextExtracts API results for Chinese Wikipedia.Mar 17 2015, 11:20 AM
Aklapper set Security to None.
Jdlrobson renamed this task from \ufffd characters appended in TextExtracts API results for Chinese Wikipedia to Unicode replacement character \ufffd should be scrubbed from text extract.Sep 18 2015, 6:47 PM
Jdlrobson triaged this task as Low priority.
Jdlrobson lowered the priority of this task from Low to Lowest.
Jdlrobson added a subscriber: Jdlrobson.

How many articles on Chinese Wikipedia does this impact? What kind of problems is it causing? Shouldn't these characters render empty (I'm not too familiar with this character)?

I only encounter one such article. I guess the only problem is whoever sees them know there is a problem with text processing. That character renders as an (ugly) symbol indicating there is something wrong, usually a diamond with a question mark. See here: https://en.wikipedia.org/wiki/Specials_%28Unicode_block%29#Replacement_character

Jdlrobson raised the priority of this task from Lowest to Low.Sep 21 2015, 10:07 PM

@lilydjwg thanks for your feedback. Sounds something worth investigating but not sure or impact right now.

Restricted Application added a subscriber: Cosine02. · View Herald TranscriptDec 21 2016, 9:58 AM
Arthur2e5 added a subscriber: Arthur2e5.EditedMar 16 2017, 9:57 PM

With a longer extract length (256) set, and using &utf8 for readability, \n\n\n1 历史 appeared in place of the \n\n��2� markers. Using &exintro kills the FFFD as expected as that's where the lead section ends. Using a length of 129 gives `\n\n\n1. A length of 127 or 126 takes the last FFFD into the ellipses but keeps the 2, and 125 takes away �2, leaving a single FFFD.

From the issue sample, I am able to generate two mini-reproductions:

It actually sounds quite easy to reproduce now.

Arthur2e5 added a comment.EditedMar 16 2017, 10:21 PM

Since the 2 thing is nevertheless weird (UTF-8 won't let ASCII characters appear from nowhere), I modified my second mini-repro to use an h3 heading. Very curiously, it now says ��3 near the end.

ApiQueryExtracts::execute() currently truncates the text before deciding to doSections(). Can this behavior corrupt regex matching and lead to weird FFFDs? Switching the truncate/sections sequence would be uneconomical, but it might be possible to let tidy run a tag check before doSections. (Currently the tidy() wrapper gives up trying tidy when plain text is specified.)

Arthur2e5 renamed this task from Unicode replacement character \ufffd should be scrubbed from text extract to Unicode replacement character \ufffd appear in text extract if exchar cuts at section boundary.Mar 16 2017, 10:25 PM

Change 343216 had a related patch set uploaded (by Gerrit Patch Uploader; owner: Artoria2e5):
[mediawiki/extensions/TextExtracts] Complete plaintext section markers

https://gerrit.wikimedia.org/r/343216

Arthur2e5 added a comment.EditedMar 16 2017, 11:54 PM

Patch for testing submitted. Use it as a proof of concept, and don't expect it to pass human review. (Ugh, PHP bites harder than snakespeak!)

(Gosh, forgot to regenerate the patch.)

Am I supposed to write a test case for ApiQueryExtracts now? I am too lazy to do that...

Gee, I forgot to consider the "\1\2\d at the end", (also \1\2 and \1) cases.

Arthur2e5 added a subscriber: zhuyifei1999.EditedMar 17 2017, 4:45 AM

@zhuyifei1999, the new revision (6) should have fixed the problem. Would you like to give it a try? I wrote a somehow nobody-wants-to-reuse function, but got rid of the XXX: note in the process.

MBinder_WMF set the point value for this task to 3.

Please remove Patch-Needs-Improvement when you've had a chance to review the feedback. Thank you!

bearND added a subscriber: bearND.Nov 9 2017, 12:01 AM

FWIW, a new summary endpoint implementation (not deployed yet) avoids this issue by only considering the characters in the first paragraph. Maybe this is preferable to use? Not sure what your use case is. Just want to give a heads-up about potential alternate solutions.

Current implementation at https://zh.wikipedia.org/api/rest_v1/page/summary/ECMAScript:

"extract": "ECMAScript是一种由Ecma国际(前身为欧洲计算机制造商协会)通过ECMA-262标准化的脚本程序设计语言。这种语言在万维网上应用广泛,它往往被称为JavaScript或JScript,但实际上后两者是ECMA-262标准的实现和扩展。",
"extract_html": "<p><span lang=\"en\" xml:lang=\"en\"><b>ECMAScript</b></span>是一种由Ecma国际(前身为欧洲计算机制造商协会)通过ECMA-262标准化的脚本程序设计语言。这种语言在万维网上应用广泛,它往往被称为JavaScript或JScript,但实际上后两者是ECMA-262标准的实现和扩展。",

Soon, it will be like this (once the MCS summary endpoint is exposed via RESTBase, preview):

"extract": "是一种由Ecma国际 通过ECMA-262标准化的脚本程序设计语言。这种语言在万维网上应用广泛,它往往被称为JavaScript或JScript,但实际上后两者是ECMA-262标准的实现和扩展。",
"extract_html": "<p>是一种由<span>Ecma国际</span> 通过ECMA-262标准化的<span>脚本</span><span class=\"mw-redirect\">程序设计语言</span>。这种语言在<span>万维网</span>上应用广泛,它往往被称为<span>JavaScript</span>或<span>JScript</span>,但实际上后两者是ECMA-262标准的实现和扩展。</p>"

Note the new implementation is not based on TextExtracts anymore but does some DOM transformations on top of Parsoid output. I hope the stripping of parentheses is OK here.

Besides the removed parenthesis, the other significant difference I notice is that in the TextExtract implementation the title ("ECMAScript") is repeated but that's not the case in the new implementation based on Parsoid since Parsoid doesn't show this. (The wiki page does also show the title at the beginning of the content.) I'm wondering if that's a bug in Parsoid or may be due to the missing handling of language variants in Parsoid. Going to check with the Parsoid folks and file a separate ticket for that if needed. Zooming in on this difference:

<p><span lang=\"en\" xml:lang=\"en\"><b>ECMAScript</b></span>是一种由Ecma国际 [...]

vs.

<p>是一种由<span>Ecma国际</span> [...]

Looks like a template ({{lang|en|'''ECMAScript'''}}) does not get transcluded as expected.

cscott added a subscriber: cscott.Nov 9 2017, 12:04 AM

Ok, just chatted with @cscott. The missing string (ECMAScript) should come later, once phase 2 of T43716 is implemented (and RESTBase handles lang variants). See also T122942.