Page MenuHomePhabricator

External links surrounded by unicode quotation marks break search index
Closed, ResolvedPublic

Description

Author: a.koethur

Description:
When a page contains an external Link which is surrounded by unicode quotation marks (U+201E double low-9 quotation mark and U+201C left double quotation mark), then the article's entry in the searchindex table (field si_text) will be an empty string.

Reproduce: Just add the following text to an article and save/update fulltext index:

„http://example.com“

I've done some investigation.
I found out that the first problem arises in includes/search/SearchUpdate.php starting at line 64 where external URLs should be stripped. preg_replace destroys the trailing quotation mark and leaves illegal unicode sequence in $text. At some later stage in processing $text gets truncated to an empty string, presumably because of the illegal unicode sequence.


Version: 1.18.x
Severity: major

Details

Reference
bz32712

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 12:04 AM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz32712.

a.koethur wrote:

Testscript to demonstrate preg_replace misbehavior

Attached:

Yeah, that regex'll be breaking off partway through the e2 80 9c sequence for the closing quote.

Need to either change it to proper unicode support, or let it take anything \x80-\xff. This regex dates back to at least 2003, when we still didn't have UTF-8 on everything. :P

Oh handy -- should be possible to turn that test script into a PHPUnit test case! See https://www.mediawiki.org/wiki/Manual:PHP_unit_testing for some background.

Fixed in r104635 / r104636 on trunk, including a unit test. Thanks for the example!

Merged to REL1_18 branch for 1.18.1 in r104637.

Change 182153 had a related patch set uploaded (by Indielives010):
Clarifies the meaning of the function which tests the bug T34712

https://gerrit.wikimedia.org/r/182153

Patch-For-Review

Change 182153 merged by jenkins-bot:
Clarifies the meaning of the function which tests the bug T34712

https://gerrit.wikimedia.org/r/182153