Can't upload file with non-ASCII name (eg cyrillic) on Windows host
OpenPublic

Description

Author: vershigora

Description:
Im runing mediawiki under Apache2 & Windows 2k. And I cant upload file with
russian name. File name becomes wrong when MD saves it to disk, so link on the
file becomes wrong -> 404. I think the solution it to convert Cyrilic file name
into translit (http://en.wikipedia.org/wiki/Cyr), but Im not very good PHP
programmer.

sorry for my english.


Version: 1.20.x
Severity: normal
OS: Windows XP
Platform: PC
URL: http://meta.wikimedia.org/wiki/Image:Bug_1780_non_ascii_%C3%A4%C3%B6%C3%BC%C3%9F.png
See Also:
https://bugs.php.net/bug.php?id=33350

bzimport added projects: MediaWiki-Uploading, I18n.Via ConduitNov 21 2014, 8:18 PM
bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz1780.
bzimport created this task.Via LegacyMar 30 2005, 8:36 AM
brion added a comment.Via ConduitMar 30 2005, 8:39 AM

May be a similar issue to bug 362; the OS and filesystem expects certain formatting different from what it's getting (in this case, UTF-8).

bzimport added a comment.Via ConduitMar 30 2005, 2:21 PM

vershigora wrote:

Similar, but not the same. File was create, but with wrond name.
Should be : Вера.jpg
But it is : ??????????.jpg (I cant past real name, couse it contains wrong
characters)

bzimport added a comment.Via ConduitMar 30 2005, 4:35 PM

jeluf wrote:

can you provide a link to your wiki that we could use for testing?

bzimport added a comment.Via ConduitMar 31 2005, 4:47 AM

vershigora wrote:

limp.iceberg-m.ru:81/wiki/

brion added a comment.Via ConduitOct 17 2005, 7:14 AM
  • Bug 3724 has been marked as a duplicate of this bug. ***
bzimport added a comment.Via ConduitMay 26 2006, 11:38 PM

gunter.schmidt wrote:

I have the same bug with V.1.6.5.

Try to upload any image with the name: Bug_1780_non_ascii_äöüß.png (hope you can read this on your system)

I tried to show you on mediawiki, but the bug is not there!
http://meta.wikimedia.org/wiki/Image:Bug_1780_non_ascii_%C3%A4%C3%B6%C3%BC%C3%9F.png

Maybe 1.7 works differently?

brion added a comment.Via ConduitMay 26 2006, 11:49 PM

That's because our site doesn't run on Windows servers.

bzimport added a comment.Via ConduitJan 27 2007, 5:15 PM

codemonk wrote:

The problem persists in MediaWiki 1.8 on Windows XP. Generally everything works
fine on Windows, except this bug, that is very disturbing. Is it possible to do
something around it, or is it a fatal incompatibility forever?

bzimport added a comment.Via ConduitMay 29 2007, 12:49 AM

codemonk wrote:

I've got a temporary solution (at least, for my MediaWiki 1.8.2 on Windows XP), though it is far from perfection and involves iconv function.

Firstly,
In SpecialUpload.php file, in processUpload() function, right before closing the last "if( $this->saveUploadedFile(..." block, update the source code as follows:

...

} else {
  $wgOut->showFileNotFoundError( $this->mUploadSaveName );
}
rename( $this->mSavedFile, iconv ('UTF-8', 'CP1251', $this->mSavedFile) ); # NEW	}

...

Secondly,
In Image.php file, in reallyRenderThumb() function, in the middle of "elseif ( $wgUseImageMagick ) {..." block, update the source code as follows:

...
wfDebug("reallyRenderThumb: running ImageMagick: $cmd\n");
if (file_exists(iconv('UTF-8', 'CP1251', $thumbPath)) == false) # NEW

rmdir( substr_replace($thumbPath, '', strrpos($thumbPath, "/")));	# NEW

mkdir( substr_replace( iconv('UTF-8', 'CP1251', $thumbPath), '', # NEW

strrpos(iconv('UTF-8', 'CP1251', $thumbPath), "/")));	# NEW

$cmd = iconv ('UTF-8', 'CP1251', $cmd); # NEW
wfProfileIn( 'convert' );
...

If you use something other than ImageMagick for image processing, you should transfer the second code fragment to appropriate block and adapt it to that program, if required.

IMPORTANT: If your Windows uses some other code page than Windows-1251, than in code above you should change 'CP1251' to your code page identifier. And DO NOT use this code on non-Windows machines.
brion added a comment.Via ConduitOct 27 2007, 9:06 PM
  • Bug 11758 has been marked as a duplicate of this bug. ***
Mormegil added a comment.Via ConduitMar 20 2008, 4:31 PM

Created attachment 4734
A basic configurable workaround for this bug

The patch adds a global configuration variable $wgLocalFilesystemCharsetOverride that can be set to the charset of the local file system (e.g. 'CP1250'), and all names of the uploaded files are converted to this charset (using iconv) when talking with the filesystem. However, this works correctly only when the destination filename contains only characters from this charset, so this is not a perfect solution.

But the support for file uploads on Windows (and other OSes) is limited in many other ways (there is no filename syntax checking other than stripping path components, which is far from being sufficient on Windows), anyway.

The correct solution to this might depend on the mysterious image backend rewrite. ;-)

Attached: EncodingOverride.diff

brion added a comment.Via ConduitMar 20 2008, 9:00 PM

Yeah, this would still break with other chars, or if iconv() isn't present... the generated URLs might be wrong, too; depends what charset the web server is going to be expecting!

MaxSem added a comment.Via ConduitJul 25 2008, 6:30 PM
  • Bug 14924 has been marked as a duplicate of this bug. ***
bzimport added a comment.Via ConduitJul 25 2008, 8:24 PM

dj.bauch wrote:

(In reply to comment #13)

> *** Bug 14924 has been marked as a duplicate of this bug. ***

Thanks for redirecting me from bug 14924. The patch attachment for this bug, with code page set to CP1250 in LocalSettings.php seems to fix most of the problems I've been seeing with images on IIS6/SQL Server/Windows 2003/Mediawiki 1.13 -- including the one I identified in my bug submission and several others, such as the recent POTD Image:CT of brain of Mikael Häggström large.png and Image:Bandeira do Município do Rio de Janeiro.png. It does not, however fix all of them. For example:
Image:Ostredok, Veľká Fatra (SVK) - NW slope.jpg (http:.../index.php?title=Image:Ostredok%2C_Ve%C4%BEk%C3%A1_Fatra_%28SVK%29_-_NW_slope.jpg) image still does not show up.
Image:Hors d'œuvre (Bosnian).jpg (Image:Hors_d%27%C5%93uvre_%28Bosnian%29.jpg) causes iconv to complain [function.iconv]: Detected an illegal character in input string in W:\Inetpub\wwwroot\mediawiki\includes\filerepo\File.php on line 68

brion added a comment.Via ConduitOct 6 2008, 5:58 PM
  • Bug 15863 has been marked as a duplicate of this bug. ***
brion added a comment.Via ConduitOct 6 2008, 6:12 PM

DJ, CP1250 is for Central Europe and doesn't include the "œ" character, hence the failure.

"Ostredok, Veľká Fatra (SVK) - NW slope.jpg" presumably ought to work, but it's hard to debug without an instance to check... However...

My suspicions:

  1. It's possibly safest to just create UTF-8 URLs -- that is, don't try to encode the generated URLs to the locale charset. IIS is probably smart enough to detect UTF-8 and load the files correctly (the filesystem stores filenames as UTF-16 Unicode.)
  1. Suddenly I'm not sure whether you actually want the "ANSI" codepage or the "OEM" codepage for filesystem storage. *shudder*

Ugh.

The best thing would probably just be to have a switch to encode filenames in some nice ASCII-safe hex encoding, rather than mess around with charsets.

bzimport added a comment.Via ConduitOct 6 2008, 9:25 PM

fran wrote:

The problem is in PHP's handling, or lack thereof, of Unicode. NTFS uses UTF-16 internally, as Brion pointed out; the problem is that the Win32 API provides separate wchar_t oriented versions of stdio functions (like _wfopen()) for working with Unicode filenames, while the traditional char versions (like fopen()) translate the current legacy 8-bit code page into the corresponding Unicode representation for backwards compatibility. Unfortunately, PHP's innards are completely eight-bit, and has no knowledge of wchar_t stdio, so it's limited to characters in the current code page. :/ Using setlocale() to change the code page to UTF-8 might work, but setlocale() looks very brittle and ugly.

Indeed, mangling Unicode characters to ASCII in a predictable way is probably the best/only way to work around it.

brion added a comment.Via ConduitOct 6 2008, 9:37 PM

dumpHtml uses a fun hack that shells out to a VBScript to rename files to a Unicode destination... That's probably not the nicest way to do it in active use. ;)

http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/DumpHTML/rename-hack.vbs?view=markup

Even if we used such a hack to *create* files, we couldn't *manipulate* them again without doing really weird crap like looking up the 8.3 version of the file path. So ASCII mangling is definitely going to be the safest thing.

bzimport added a comment.Via ConduitOct 6 2008, 10:03 PM

dj.bauch wrote:

Brion, et. al.,
Thanks for your attention. I'm hoping that the official mechanism does change to one that's more compatible with Windows. In the mean time, I've switched from CP1250 to 'ISO-8859-1//TRANSLIT' as the character set that gives me the best results. Most images work now, but not all. This also doesn't fix problems with filenames that have '%' in the name. Sometimes that appears to be used to indicate the degree of transparency of some icons on Wikipedia, and I've had no luck getting those to display.

demon added a comment.Via ConduitApr 2 2010, 2:58 PM
  • Bug 23028 has been marked as a duplicate of this bug. ***
TheDJ added a comment.Via ConduitNov 9 2010, 8:23 PM

Apparently PHP 6 will have full unicode support:

http://bugs.php.net/bug.php?id=46990

I can't believe that something like PHP still has bugs like this. I ran into it today trying to help a user understand why his images were not working, and first we suspected it was just instantcommons, but eventually tracked it down to this issue.

demon added a comment.Via ConduitDec 14 2010, 1:53 PM

PHP6 is dead, so who knows when this will be fixed.

In the meantime, I'd suggest adding a warning to Special:Upload when wfIsWindows() and you try to upload a file with unicode in the name.

bzimport added a comment.Via ConduitMay 15 2011, 12:38 PM

Bryan.TongMinh wrote:

I forbid uploading non-ascii files on Windows in r88165.

bzimport added a comment.Via ConduitMay 15 2011, 5:07 PM

paolobenve wrote:

Well, this isn't a fix, it's a limitation...

bzimport added a comment.Via ConduitMay 15 2011, 5:56 PM

Bryan.TongMinh wrote:

It's a fix in the sense that it is no longer possible to upload a file which then can't be viewed anymore. A proper fix would be to make PHP use wide character functions.

brion added a comment.Via ConduitAug 15 2011, 8:13 PM

Reopening -- doesn't seem to fix it, just makes some of your pages platform-dependent.

bzimport added a comment.Via ConduitAug 16 2011, 7:32 AM

Bryan.TongMinh wrote:

(In reply to comment #26)

Reopening -- doesn't seem to fix it, just makes some of your pages
platform-dependent.

A way to fix this would be to make filenames on disk no longer map to titles. We have a bug open for that somewhere.

bzimport added a comment.Via ConduitNov 10 2011, 3:06 AM

sumanah wrote:

Thank you for the patch, Mormegil.
(In reply to comment #12)
Adding the "reviewed" keyword. Also adding the internationalization keyword so the internationalisation/localisation team knows to look at this bug.

demon added a comment.Via ConduitNov 10 2011, 3:18 AM

Not really an i18n bug, it's an issue with filerepo.

bzimport added a comment.Via ConduitJan 1 2013, 9:34 PM

Bryan.TongMinh wrote:

This is now finally fixable with the filebackend!

I'm thinking about writing a custom backend which implements [[quoted-printable]] encoding. Any opinions on the encoding to use? It's a pity that the filebackend implements a listFiles method, otherwise we could have simply used a one-way hashing function.

aaron added a comment.Via ConduitFeb 2 2013, 12:24 AM

(In reply to comment #30)

This is now finally fixable with the filebackend!

I'm thinking about writing a custom backend which implements
[[quoted-printable]] encoding. Any opinions on the encoding to use? It's a
pity
that the filebackend implements a listFiles method, otherwise we could have
simply used a one-way hashing function.

Why not add that to FSFileBackend in the form of configurable escape/unescape functions? The default ones could just pass throw the raw input. One issue with any encoding scheme is handling URLs correctly, so users get file/thumbnail urls that actually are mapped to the encoded file names. I suppose a redirection module could be used. img_auth and thumb_handler would cover some of the obvious cases, though they don't handle RANGE requests. Another option would be a redirector module which would redirect requests to the encoded URL. CDN caching would be slightly trickier in any case.

It's hard to resist saying "just use Linux" though...

That said, it would be nice if FilRepo stored files based on hash and used a redirection or service layer to make readable URLs to files anyway. It would solve a lot of problems like weird race conditions, the poor performance and lack of atomicity for file moves/deletes/undeletes and re-uploads (especially for large files or if there are many versions), and issues like this bug as well (what characters a system allows). That's another story though...

bzimport added a comment.Via ConduitApr 24 2013, 11:27 AM

orbartal wrote:

How to fix the bug in Hebrew (and in any other language that windows support)

  1. In windows OS change the language for non-Unicode to your local MediaWiki language. E.g. the language of the files names you wish to upload. Usually it is the same as $wgLanguageCode language. See how on this link.
  2. Windows NTFS file system uses special encoding, not ascii or utf8. Check the appropriate encoding for your language. For Hebrew I used windows-1255.
  3. Edit the MediaWiki core code, and add these 4 changes. Note to use your language and not windows-1255. I used windows-1255 for Hebrew, but you might need something else.

a. Remove (or put as a comment) the test added by Bryan Tong Minh that prevent from uploading files with non ascii name in windows. Later we shell fix the bug, so that filter is no longer required.
See details: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/88165
MediaWiki/includes/upload/UploadBase.php line 756.
b. Go to the source code file in
MediaWiki/includes/filebackend/ FSFileBackend.php. And in class FileBackendStore, in function FileBackendStore :: doStoreInternal in line 206, add the following lines:

if (strtoupper(substr(PHP_OS, 0, 3)) == 'WIN')
{
$charSetArr = array("ASCII", "JIS", "EUC-JP", "UTF-8", "UTF-16","windows-1251",
"ISO-8859-1", "GBK");

		if (mb_detect_encoding($dest, $charSetArr) =="UTF-8")
		{
				$dest = iconv("UTF-8", "windows-1255",  $dest);
		}

}
Just before the command that copies the file to the path:
$ok = copy( $params['src'], $dest );

Now you can upload files and images in Hebrew. But you can’t view them as thumbnail. Two more similar code fix are required for this task to complete.

c. Go to the source code file in MediaWiki\includes\filerepo\file\File.php. And in class File, in function File:: transform in line 623, add the following lines:
if (strtoupper(substr(PHP_OS, 0, 3)) == 'WIN')
{
$charSetArr = array("ASCII", "JIS", "EUC-JP", "UTF-8", "UTF-16","windows-1251",
"ISO-8859-1", "GBK");
if (mb_detect_encoding($thumbPath, $charSetArr) =="UTF-8")
{

		$thumbPath = iconv("UTF-8", "windows-1255",  $thumbPath);

}
}
Right after the command returns the full path to the folder of the thumbnail file:
$thumbPath = $this->getThumbPath( $thumbName ); // final thumb path
d. Go to the source code file in MediaWiki\includes\media\Bitmap.php. And in class BitmapHandler, in function BitmapHandler::transformGd in line 548, add the following lines:
if (strtoupper(substr(PHP_OS, 0, 3)) == 'WIN')
{
$charSetArr = array("ASCII", "JIS", "EUC-JP", "UTF-8", "UTF-16","windows-1251",
"ISO-8859-1", "GBK");
if (mb_detect_encoding($params['srcPath'], $charSetArr) =="UTF-8")
{

			$params['srcPath'] = iconv("UTF-8", "windows-1255",  $params['srcPath']);

}
}
Right before the command that test if the file exists in that location:
if ( !file_exists( $params['srcPath'] ) )

bzimport added a comment.Via ConduitApr 24 2013, 11:34 AM

orbartal wrote:

How to upload file with non-ASCII name on Windows host

How to enable upload file with non-ASCII name on Windows host with just 3 simple changes to the wiki server.

Attached: How_to_fix_the_bug_in_Hebrew.pdf

bzimport added a comment.Via ConduitApr 24 2013, 11:36 AM

orbartal wrote:

enable upload file with non-ASCII name on Windows host

How to enable upload file with non-ASCII name on Windows host with just 3 simple changes to the wiki server.

Attached: How_to_fix_the_bug_in_Hebrew.pdf

bzimport added a comment.Via ConduitSep 9 2013, 9:38 PM

Bryan.TongMinh wrote:

(In reply to comment #31)

(In reply to comment #30)
> This is now finally fixable with the filebackend!
>
> I'm thinking about writing a custom backend which implements
> [[quoted-printable]] encoding. Any opinions on the encoding to use? It's a
> pity
> that the filebackend implements a listFiles method, otherwise we could have
> simply used a one-way hashing function.

Why not add that to FSFileBackend in the form of configurable escape/unescape
functions? The default ones could just pass throw the raw input. One issue
with
any encoding scheme is handling URLs correctly, so users get file/thumbnail
urls that actually are mapped to the encoded file names. I suppose a
redirection module could be used. img_auth and thumb_handler would cover some
of the obvious cases, though they don't handle RANGE requests. Another option
would be a redirector module which would redirect requests to the encoded
URL.
CDN caching would be slightly trickier in any case.

It's hard to resist saying "just use Linux" though...

That said, it would be nice if FilRepo stored files based on hash and used a
redirection or service layer to make readable URLs to files anyway. It would
solve a lot of problems like weird race conditions, the poor performance and
lack of atomicity for file moves/deletes/undeletes and re-uploads (especially
for large files or if there are many versions), and issues like this bug as
well (what characters a system allows). That's another story though...

I would not add a complicated redirector, but just modify File::getUrl() to apply the encoding. I can't really find out though if there currently is any interaction between filerepo and filebackend regarding the file url.

brion added a comment.Via ConduitSep 12 2013, 5:53 PM

So I found an old upstream bug from 2005 on the low-level API problem here:
https://bugs.php.net/bug.php?id=33350

Added a comment that this is still a live issue. :)

bzimport added a comment.Via ConduitSep 12 2013, 6:05 PM

Bryan.TongMinh wrote:

Alternatively to hacking filebackend, we could wrap the FileSystemObject using PHPs COM extension. If somebody really wants to put effort into this ;)

gerritbot added a comment.Via ConduitApr 12 2014, 10:28 PM

Change 125573 had a related patch set uploaded by Aaron Schulz:
[WIP] Added path encoding to FileBackendStore for Windows support

https://gerrit.wikimedia.org/r/125573

gerritbot added a comment.Via ConduitMay 8 2014, 8:59 PM

Change 132298 had a related patch set uploaded by Aaron Schulz:
Added better path encoding to FileBackend for Windows

https://gerrit.wikimedia.org/r/132298

gerritbot added a comment.Via ConduitMay 8 2014, 9:01 PM

Change 125573 abandoned by Aaron Schulz:
Added path encoding to FileBackendStore for Windows support

Reason:
Mostly not needed since given the SHA1 storage name patch, which also handles the same problem and more

https://gerrit.wikimedia.org/r/125573

Bawolff added a comment.Via ConduitJul 19 2014, 3:18 PM
  • Bug 68268 has been marked as a duplicate of this bug. ***
bzimport added a comment.Via ConduitOct 5 2014, 12:26 PM

dgiim wrote:

I am using mediawiki in Korean environment.

When will completely fix this?

I have resolved to hack Upload problem.

But I can not see the thumbnail.

Help me.

bzimport added a comment.Via ConduitOct 5 2014, 4:39 PM

orbartal wrote:

Try using the in the pdf file: "How to fix the bug in Hebrew". It works for all languages, not just for Hebrew. And it fixes the thumbnail bug as well. Tell me if it works. And if it’s not, I will try to help you solved it.

bzimport added a comment.Via ConduitOct 6 2014, 5:11 AM

dgiim wrote:

First of all, thank you give a quick get attention. orbartal.

I've had to change a thumbnail below to display the file.php.

...
$ thumbPath = $ this-> getThumbPath ($ thumbName); Final thumb path
CP949 is a windows charset system for hangul, a korean character.
$ thumbPath = iconv ("UTF-8", "CP949", $ thumbPath);
...

Also, I've had to change as follows bitmap.php.

...
$ params ['srcPath'] = iconv ("UTF-8", "CP949", $ params ['srcPath']);
if (! file_exists ($ params ['srcPath'])) {
...

Currently, it is well Hangul file upload. However, no thumbnail is displayed. Instead, in the following locations, are displayed in the thumbnail spot an error: 'filemissing'

Please help me!

[More]

  • MediaWiki Version: 1.23.3
  • System: Windows 7 (hangul)

Thank you.

jayvdb added a comment.Via ConduitOct 7 2014, 3:29 AM

(In reply to Gerrit Notification Bot from comment #40)

Change 125573 abandoned by Aaron Schulz:
Added path encoding to FileBackendStore for Windows support

Reason:
Mostly not needed since given the SHA1 storage name patch, which also
handles the same problem and more

https://gerrit.wikimedia.org/r/125573

That patch has been abandoned, but I have asked on the changeset whether the patch might still be useful for older versions of MediaWiki which have this bug.

gerritbot added a comment.Via ConduitOct 7 2014, 5:12 PM

Change 125573 restored by Aaron Schulz:
Added path encoding to FileBackendStore for Windows support

Reason:
Rebasing (then closing again)

https://gerrit.wikimedia.org/r/125573

gerritbot added a comment.Via ConduitOct 7 2014, 5:48 PM

Change 125573 abandoned by Aaron Schulz:
Added path encoding to FileBackendStore for Windows support

https://gerrit.wikimedia.org/r/125573

Gilles added a project: Multimedia.Via WebNov 24 2014, 3:41 PM
epriestley added a commit: Unknown Object (Commit).Via DaemonsMar 4 2015, 8:23 AM

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.