Try the following API request:
The extracted text has three U+fffd characters near the end.
Try the following API request:
The extracted text has three U+fffd characters near the end.
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Remove broken plaintext section markers | mediawiki/extensions/TextExtracts | master | +90 -2 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Declined | None | T166272 HTML version of text extracts is not balanced/well formed and naive | |||
| Open | None | T92628 Unicode replacement character \ufffd appear in text extract if exchar cuts at section boundary |
How many articles on Chinese Wikipedia does this impact? What kind of problems is it causing? Shouldn't these characters render empty (I'm not too familiar with this character)?
I only encounter one such article. I guess the only problem is whoever sees them know there is a problem with text processing. That character renders as an (ugly) symbol indicating there is something wrong, usually a diamond with a question mark. See here: https://en.wikipedia.org/wiki/Specials_%28Unicode_block%29#Replacement_character
@lilydjwg thanks for your feedback. Sounds something worth investigating but not sure or impact right now.
With a longer extract length (256) set, and using &utf8 for readability, \n\n\n1 历史 appeared in place of the \n\n��2� markers. Using &exintro kills the FFFD as expected as that's where the lead section ends. Using a length of 129 gives `\n\n\n1. A length of 127 or 126 takes the last FFFD into the ellipses but keeps the 2, and 125 takes away �2, leaving a single FFFD.
From the issue sample, I am able to generate two mini-reproductions:
It actually sounds quite easy to reproduce now.
Since the 2 thing is nevertheless weird (UTF-8 won't let ASCII characters appear from nowhere), I modified my second mini-repro to use an h3 heading. Very curiously, it now says ��3 near the end.
ApiQueryExtracts::execute() currently truncates the text before deciding to doSections(). Can this behavior corrupt regex matching and lead to weird FFFDs? Switching the truncate/sections sequence would be uneconomical, but it might be possible to let tidy run a tag check before doSections. (Currently the tidy() wrapper gives up trying tidy when plain text is specified.)
Change 343216 had a related patch set uploaded (by Gerrit Patch Uploader; owner: Artoria2e5):
[mediawiki/extensions/TextExtracts] Complete plaintext section markers
Patch for testing submitted. Use it as a proof of concept, and don't expect it to pass human review. (Ugh, PHP bites harder than snakespeak!)
(Gosh, forgot to regenerate the patch.)
Am I supposed to write a test case for ApiQueryExtracts now? I am too lazy to do that...
Gee, I forgot to consider the "\1\2\d at the end", (also \1\2 and \1) cases.
@zhuyifei1999, the new revision (6) should have fixed the problem. Would you like to give it a try? I wrote a somehow nobody-wants-to-reuse function, but got rid of the XXX: note in the process.
Please remove Patch-Needs-Improvement when you've had a chance to review the feedback. Thank you!
FWIW, a new summary endpoint implementation (not deployed yet) avoids this issue by only considering the characters in the first paragraph. Maybe this is preferable to use? Not sure what your use case is. Just want to give a heads-up about potential alternate solutions.
Current implementation at https://zh.wikipedia.org/api/rest_v1/page/summary/ECMAScript:
"extract": "ECMAScript是一种由Ecma国际(前身为欧洲计算机制造商协会)通过ECMA-262标准化的脚本程序设计语言。这种语言在万维网上应用广泛,它往往被称为JavaScript或JScript,但实际上后两者是ECMA-262标准的实现和扩展。", "extract_html": "<p><span lang=\"en\" xml:lang=\"en\"><b>ECMAScript</b></span>是一种由Ecma国际(前身为欧洲计算机制造商协会)通过ECMA-262标准化的脚本程序设计语言。这种语言在万维网上应用广泛,它往往被称为JavaScript或JScript,但实际上后两者是ECMA-262标准的实现和扩展。",
Soon, it will be like this (once the MCS summary endpoint is exposed via RESTBase, preview):
"extract": "是一种由Ecma国际 通过ECMA-262标准化的脚本程序设计语言。这种语言在万维网上应用广泛,它往往被称为JavaScript或JScript,但实际上后两者是ECMA-262标准的实现和扩展。", "extract_html": "<p>是一种由<span>Ecma国际</span> 通过ECMA-262标准化的<span>脚本</span><span class=\"mw-redirect\">程序设计语言</span>。这种语言在<span>万维网</span>上应用广泛,它往往被称为<span>JavaScript</span>或<span>JScript</span>,但实际上后两者是ECMA-262标准的实现和扩展。</p>"
Note the new implementation is not based on TextExtracts anymore but does some DOM transformations on top of Parsoid output. I hope the stripping of parentheses is OK here.
Besides the removed parenthesis, the other significant difference I notice is that in the TextExtract implementation the title ("ECMAScript") is repeated but that's not the case in the new implementation based on Parsoid since Parsoid doesn't show this. (The wiki page does also show the title at the beginning of the content.) I'm wondering if that's a bug in Parsoid or may be due to the missing handling of language variants in Parsoid. Going to check with the Parsoid folks and file a separate ticket for that if needed. Zooming in on this difference:
<p><span lang=\"en\" xml:lang=\"en\"><b>ECMAScript</b></span>是一种由Ecma国际 [...]
vs.
<p>是一种由<span>Ecma国际</span> [...]
Looks like a template ({{lang|en|'''ECMAScript'''}}) does not get transcluded as expected.
Change 343216 abandoned by Jdlrobson:
Remove broken plaintext section markers
Reason:
Assuming work here is abandoned given the timeframe and inactivity. Please comment on ticket before submitting a new patch. Note that TextExtracts is currently in maintenance mode, only accepting security patches and not actively supporting feature requests.