Page MenuHomePhabricator

Space between final 2 words in a page with ≥2 category tags is removed in arabic mediawiki
Closed, ResolvedPublic

Description

This problem occurs in the arabic mediawiki. When creating a page which has at the end of its text two or more category tags, and with no full-stop at the end of the text. The viewed text removes the space between the final two words, which make them appear as a single word.

Tracing the problem shows that it occurs at this line at Parser.php:
$s = rtrim( $s . "\n" ); # bug 87

It appears that when the rtrim removes the whitespace produced by the tags, the error occurs.

You can reproduce the bug by creating a page in the arabic wiki with the previous conditions. I have tried the following in the arabic sand box page.

"
testing space bug

testing space bug

[[تصنيف:الآداب]]
[[تصنيف:شعر عربي]]
"
outputs as:
"
testing space bug

testing spacebug
"

And

"
معاركنا انتهت أفلا تراني

رميت مهنّدي وكسرت رمحي

[[تصنيف:الآداب]]
[[تصنيف:شعر عربي]]
"

outputs as:
"
معاركنا انتهت أفلا تراني

رميت مهنّدي وكسرترمحي
"

Event Timeline

Husseinhilmi raised the priority of this task from to Needs Triage.
Husseinhilmi updated the task description. (Show Details)
Husseinhilmi subscribed.
Husseinhilmi renamed this task from Space between the final two words in a page with two or more categories tags is removed in the arabic mediawiki to Space between the final two words in a page with two or more category tags is removed in the arabic mediawiki.Jan 28 2015, 12:42 PM
Husseinhilmi set Security to None.
Aklapper renamed this task from Space between the final two words in a page with two or more category tags is removed in the arabic mediawiki to Space between final 2 words in a page with ≥2 category tags is removed in arabic mediawiki.Jan 28 2015, 8:06 PM
Aklapper triaged this task as Low priority.
Louperivois raised the priority of this task from Low to High.EditedMay 29 2017, 4:34 AM
Louperivois subscribed.

In fact, the bad handling is happening sooner than this line. Arabic has $useLinkPrefixExtension = true because determiners may not be separated from the nouns. As soon as the regex $e2 is used in the Parser to redefine $s, the last word of the article is considered as a prefix for the "link" of the category and therefore disappear with this link (print($s) just after the regex shows that the last word has disappeared at this point).

Test page at https://ar.wikipedia.org/wiki/%D9%85%D8%B3%D8%AA%D8%AE%D8%AF%D9%85:Cscott/T87753

This seems to be an over-agressive link prefix match, as it is explained to me.

Something like:
" foo\n[[category:bar]]" -> "[[category:bar| foo\n]]"
?

This seems to be an unintended consequence of a fix for T2087 in ancient times.

@tstarling @ssastry This might be relevant to the Tidy cleanup as well, as the whitespace removal around [[Category]] tags performed in T2087 might interact with tidy's whitespace cleanup?

This was brought to my attention at Wikimania 2017 by a user of Arabic Wikipedia.

Parsoid seems to not have this bug, it appears to implement the "feature" of T2087 differently.

$ (echo "foo bar"; echo "foo bar" ; echo "[[Category:foo]]" ; echo "[[Category:bar]]" ) | bin/parse.js --domain ar.wikipedia.org
<!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/"><head prefix="mwr: http://ar.wikipedia.org/wiki/Special:Redirect/"><meta charset="utf-8"/><meta property="mw:pageNamespace" content="0"/><meta property="mw:html:version" content="1.5.0"/><link rel="dc:isVersionOf" href="//ar.wikipedia.org/wiki/Main%20Page"/><title></title><base href="//ar.wikipedia.org/wiki/"/><link rel="stylesheet" href="//ar.wikipedia.org/w/load.php?modules=mediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Cskins.vector.styles%7Csite.styles%7Cext.cite.style%7Cmediawiki.page.gallery.styles&amp;only=styles&amp;skin=vector"/></head><body data-parsoid='{"dsr":[0,50,0,0]}' lang="ar" class="mw-content-rtl sitedir-rtl rtl mw-body-content parsoid-body mediawiki mw-parser-output" dir="rtl"><p data-parsoid='{"dsr":[0,15,0,0]}'>foo bar
foo bar</p>
<link rel="mw:PageProp/Category" href="./تصنيف:Foo" data-parsoid='{"stx":"simple","a":{"href":"./تصنيف:Foo"},"sa":{"href":"Category:foo"},"dsr":[16,32,null,null]}'/>
<link rel="mw:PageProp/Category" href="./تصنيف:Bar" data-parsoid='{"stx":"simple","a":{"href":"./تصنيف:Bar"},"sa":{"href":"Category:bar"},"dsr":[33,49,null,null]}'/>
</body></html>

Note that there are newlines surrounding the <link> tags in Parsoid output. Could that show up as whitespace in the output in certain situations?

Change 371735 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Fix link prefix/suffixes around Category and Language links.

https://gerrit.wikimedia.org/r/371735

Parsoid did this correctly, but it turns out that the PHP code for both language links and categories had bugs. <sigh>

Change 371735 merged by jenkins-bot:
[mediawiki/core@master] Fix link prefix/suffixes around Category and Language links.

https://gerrit.wikimedia.org/r/371735

I reverted that patch on account of it having a serious error in it, as described in T174639, and there was no response from the developer after 1 day.

Sorry, @cscott is on vacation this week and I didn't realize it was an unbreak now situation. Thanks for reverting.

Change 376441 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] [WIP] Fix link prefix/suffixes around Category and Language links (take 2).

https://gerrit.wikimedia.org/r/376441

Change 376441 merged by jenkins-bot:
[mediawiki/core@master] Fix link prefix/suffixes around Category and Language links (take 2).

https://gerrit.wikimedia.org/r/376441

Jdforrester-WMF assigned this task to cscott.
Jdforrester-WMF subscribed.

Presumably the above fixed this?