Page MenuHomePhabricator

Space between final 2 words in a page with ≥2 category tags is removed in arabic mediawiki
Closed, ResolvedPublic

Description

This problem occurs in the arabic mediawiki. When creating a page which has at the end of its text two or more category tags, and with no full-stop at the end of the text. The viewed text removes the space between the final two words, which make them appear as a single word.

Tracing the problem shows that it occurs at this line at Parser.php:
$s = rtrim( $s . "\n" ); # bug 87

It appears that when the rtrim removes the whitespace produced by the tags, the error occurs.

You can reproduce the bug by creating a page in the arabic wiki with the previous conditions. I have tried the following in the arabic sand box page.

"
testing space bug

testing space bug

[[تصنيف:الآداب]]
[[تصنيف:شعر عربي]]
"
outputs as:
"
testing space bug

testing spacebug
"

And

"
معاركنا انتهت أفلا تراني

رميت مهنّدي وكسرت رمحي

[[تصنيف:الآداب]]
[[تصنيف:شعر عربي]]
"

outputs as:
"
معاركنا انتهت أفلا تراني

رميت مهنّدي وكسرترمحي
"

Event Timeline

Husseinhilmi raised the priority of this task from to Needs Triage.
Husseinhilmi updated the task description. (Show Details)
Husseinhilmi added a subscriber: Husseinhilmi.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 28 2015, 12:40 PM
Husseinhilmi renamed this task from Space between the final two words in a page with two or more categories tags is removed in the arabic mediawiki to Space between the final two words in a page with two or more category tags is removed in the arabic mediawiki.Jan 28 2015, 12:42 PM
Husseinhilmi set Security to None.
Aklapper renamed this task from Space between the final two words in a page with two or more category tags is removed in the arabic mediawiki to Space between final 2 words in a page with ≥2 category tags is removed in arabic mediawiki.Jan 28 2015, 8:06 PM
Aklapper triaged this task as Low priority.
Louperivois raised the priority of this task from Low to High.EditedMay 29 2017, 4:34 AM
Louperivois added a subscriber: Louperivois.

In fact, the bad handling is happening sooner than this line. Arabic has $useLinkPrefixExtension = true because determiners may not be separated from the nouns. As soon as the regex $e2 is used in the Parser to redefine $s, the last word of the article is considered as a prefix for the "link" of the category and therefore disappear with this link (print($s) just after the regex shows that the last word has disappeared at this point).

cscott added a comment.EditedAug 13 2017, 4:05 PM

Test page at https://ar.wikipedia.org/wiki/%D9%85%D8%B3%D8%AA%D8%AE%D8%AF%D9%85:Cscott/T87753

This seems to be an over-agressive link prefix match, as it is explained to me.

Something like:
" foo\n[[category:bar]]" -> "[[category:bar| foo\n]]"
?

This seems to be an unintended consequence of a fix for T2087 in ancient times.

@tstarling @ssastry This might be relevant to the Tidy cleanup as well, as the whitespace removal around [[Category]] tags performed in T2087 might interact with tidy's whitespace cleanup?

This was brought to my attention at Wikimania 2017 by a user of Arabic Wikipedia.

cscott added a comment.EditedAug 13 2017, 4:18 PM

Parsoid seems to not have this bug, it appears to implement the "feature" of T2087 differently.

$ (echo "foo bar"; echo "foo bar" ; echo "[[Category:foo]]" ; echo "[[Category:bar]]" ) | bin/parse.js --domain ar.wikipedia.org
<!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/"><head prefix="mwr: http://ar.wikipedia.org/wiki/Special:Redirect/"><meta charset="utf-8"/><meta property="mw:pageNamespace" content="0"/><meta property="mw:html:version" content="1.5.0"/><link rel="dc:isVersionOf" href="//ar.wikipedia.org/wiki/Main%20Page"/><title></title><base href="//ar.wikipedia.org/wiki/"/><link rel="stylesheet" href="//ar.wikipedia.org/w/load.php?modules=mediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Cskins.vector.styles%7Csite.styles%7Cext.cite.style%7Cmediawiki.page.gallery.styles&amp;only=styles&amp;skin=vector"/></head><body data-parsoid='{"dsr":[0,50,0,0]}' lang="ar" class="mw-content-rtl sitedir-rtl rtl mw-body-content parsoid-body mediawiki mw-parser-output" dir="rtl"><p data-parsoid='{"dsr":[0,15,0,0]}'>foo bar
foo bar</p>
<link rel="mw:PageProp/Category" href="./تصنيف:Foo" data-parsoid='{"stx":"simple","a":{"href":"./تصنيف:Foo"},"sa":{"href":"Category:foo"},"dsr":[16,32,null,null]}'/>
<link rel="mw:PageProp/Category" href="./تصنيف:Bar" data-parsoid='{"stx":"simple","a":{"href":"./تصنيف:Bar"},"sa":{"href":"Category:bar"},"dsr":[33,49,null,null]}'/>
</body></html>

Note that there are newlines surrounding the <link> tags in Parsoid output. Could that show up as whitespace in the output in certain situations?

Change 371735 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Fix link prefix/suffixes around Category and Language links.

https://gerrit.wikimedia.org/r/371735

Parsoid did this correctly, but it turns out that the PHP code for both language links and categories had bugs. <sigh>

Change 371735 merged by jenkins-bot:
[mediawiki/core@master] Fix link prefix/suffixes around Category and Language links.

https://gerrit.wikimedia.org/r/371735

I reverted that patch on account of it having a serious error in it, as described in T174639, and there was no response from the developer after 1 day.

Arlolra added a subscriber: Arlolra.Sep 1 2017, 1:13 AM

Sorry, @cscott is on vacation this week and I didn't realize it was an unbreak now situation. Thanks for reverting.

Change 376441 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] [WIP] Fix link prefix/suffixes around Category and Language links (take 2).

https://gerrit.wikimedia.org/r/376441

Change 376441 merged by jenkins-bot:
[mediawiki/core@master] Fix link prefix/suffixes around Category and Language links (take 2).

https://gerrit.wikimedia.org/r/376441

Jdforrester-WMF closed this task as Resolved.May 2 2018, 11:23 PM
Jdforrester-WMF assigned this task to cscott.
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Presumably the above fixed this?

Restricted Application added a subscriber: alanajjar. · View Herald TranscriptMay 2 2018, 11:23 PM