Page MenuHomePhabricator

Special:Listfiles doesn't find anything when search term contains non-ascii characters
Open, NormalPublic

Description

Special:Listfiles doesn't find anything if the search term contains german umlauts (the letters äöüÄÖÜß) - for example, searching for "Möhre" will give zero results, but I have files like "Möhre-Wilde-Blüte.jpg" and "Pfalz-Möhren-1.jpg".

The problem seems to be somewhere in $dbr->buildLike which replaces the "ö" with "\xc3\xb6". This "\xc3\xb6" doesn't match anything in the database.

At the moment, I'm using the patch pasted below as workaround - but it can give too many results (searching for "möhr" will also find "mähr" or even "mxyr").
Additionally, my patch replaces everything except the listed chars, so you might get even more results if a user searches for a special character.
(I can live with that for now - better than finding nothing ;-)

I'm also not sure where this should be fixed - I added my workaround to SpecialListfiles.php to avoid unintentional side effects, but it would probably be better to fix the code in $dbr->buildLike.

Patch with workaround (see limitations/known issues above)

===================================================================
--- SpecialListfiles.php        (Revision 86040)
+++ SpecialListfiles.php        (Arbeitskopie)
@@ -34,11 +34,13 @@ class ImageListPager / __construct()
                }
                $search = $wgRequest->getText( 'ilsearch' );
                if ( $search != '' && !$wgMiserMode ) {
+$search = preg_replace('/[^a-zA-Z0-9_ .-]/', '@@uml@@', $search);
                        $nt = Title::newFromURL( $search );
                        if( $nt ) {
                                $dbr = wfGetDB( DB_SLAVE );
-                               $this->mQueryConds = array( 'LOWER(img_name)' . $dbr->buildLike( $dbr->anyString(), 
-                                       strtolower( $nt->getDBkey() ), $dbr->anyString() ) );
+$cond = $dbr->buildLike( $dbr->anyString(), strtolower( $nt->getDBkey() ), $dbr->anyString() );
+$cond = str_replace('@@uml@@', '_', $cond);
+                               $this->mQueryConds = array( 'LOWER(img_name)' . $cond );
                        }
                }

Version: 1.16.x
Severity: normal

Details

Reference
bz32207

Event Timeline

bzimport raised the priority of this task from to Normal.
bzimport set Reference to bz32207.
bzimport added a subscriber: Unknown Object (MLST).
Cboltz created this task.Nov 4 2011, 4:41 PM
brion added a comment.Nov 4 2011, 5:54 PM

I can't reproduce this on 1.16 or trunk; the ö passed through buildLike just fine:

brion@stormcloud:~/pages/rel1.16$ php maintenance/eval.php 
> return wfGetDB(DB_MASTER)->buildLike("Pfalz-M\xc3\xb6hren-1.jpg");
 LIKE 'Pfalz-Möhren-1.jpg'

(I used the \xc3\xb6 in the double-quoted string here to ensure it passes through the terminal ok; that is just a plain UTF-8 ö in the string.)

Is there anything special about your database configuration? Are you using MySQL or one of the less-well supported databases? Any special options?

Cboltz added a comment.Nov 4 2011, 6:34 PM

Interestingly it works for me on the shell - the output is "LIKE 'Möhre'".

I'm using MySQL 5.0.67. My config isn't very special IMHO, maybe except this:

[client]
default-character-set=latin1
[mysqld]
default-character-set=latin1
default-collation=latin1_german1_ci

Note that this is just a default, and any client (including mediawiki) can specify the charset to use when connecting to MySQL.

On the mediawiki side, my configuration is quite boring and doesn't contain anything related to the charset.

In the meantime I noticed that PHP's error_log() escapes special characters (like umlauts) - if I just echo out the query, it contains "Möhre" in valid UTF-8.
In other words: there must be something wrong on the way between mediawiki and mysql. Let me check...

show create table page

[...]

`page_title` varchar(255) character set latin1 collate latin1_bin NOT NULL,

[...]
) ENGINE=MyISAM AUTO_INCREMENT=6244 DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci

In other words: The page_title is latin1 (aka ISO-8815-1) in the database, which matches my mySQL defaults.

select page_title from page where page_title like '%hre%';

+----------------------------------------------------+

page_title

+----------------------------------------------------+

Möhre

[...]

Yes, the UTF-8 sequence for "ö" is really displayed as two bytes :-(
Looks like mediawiki didn't tell mysql that it will hand over UTF-8 strings, and MySQL handled them as ISO-8859-1 then...

In case it matters: The wiki was started in 2009 (IIRC MediaWiki 1.14) and updated since then.

Let me post an update on this:

The problem is that the column contains utf8, but is labeled as latin1.

I could fix this by modifying the charset in the database. The trick is to change the field to varbinary first and then to varchar utf8. The varbinary step is needed to avoid that MySQL does an automatic charset conversion which would result in double-encoded utf8.

In SQL, this means:

alter table t modify column f varbinary(255);
alter table t modify column f varchar(255) charset utf8;

That's the easy part. I then found out that I had this problem in various tables and columns, so it took me some hours to do this fix on all of them.

To sum it up: This looks an upgrade problem from a (very?) old version. AFAIK the openSUSE wiki was hit by a similar problem (sorry, I don't know exactly in which MediaWiki version).

Ljonka added a subscriber: Ljonka.EditedSep 16 2015, 2:24 PM

I created a patch here: https://gerrit.wikimedia.org/r/#/c/238745/ please give some feedback if its work for you to.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 16 2015, 2:24 PM
Ciencia_Al_Poder renamed this task from Special:Listfiles doesn't find anything when search term contains umlauts to Special:Listfiles doesn't find anything when search term contains non-ascii characters.Nov 7 2017, 10:21 AM
Ciencia_Al_Poder updated the task description. (Show Details)
Ciencia_Al_Poder removed a subscriber: wikibugs-l-list.