Zimlib is not able to deal correctly with unicode path on Windows
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	Kelson
	Aug 27 2013, 11:13 AM

Description

Windows uses utf16 to encode unicode paths. The zimlib uses char* to store the paths. For this reason, if you are unlucky (so use paths with characters over the first 8 bits) you are not able to open your ZIM file using the zimlib. We should provide a way to open paths in wchar_t* format, at least on Windows.

Version: unspecified
Severity: major

Details

Reference: bz53398

Event Timeline

• bzimport raised the priority of this task from to Lowest.Nov 22 2014, 1:53 AM

• bzimport added a project: openZIM-zimlib.

• bzimport set Reference to bz53398.

• bzimport added a subscriber: Unknown Object (MLST).

Kelson created this task.Aug 27 2013, 11:13 AM

Can this be done with the standard POSIX file APIs? Or do we need to use Win32 APIs for file access? :P

(Use of legacy 8-bit encodings in POSIX APIs is a problem for PHP, and thus MediaWiki, when running on Windows servers as well.)

Sounds like you gotta use the Win32 wide-char APIs to get Unicode filename support... at least that's what glib does on Windows for this reason -- https://developer.gnome.org/glib/2.26/glib-File-Utilities.html#glib-File-Utilities.description

lvandeve wrote:

Hello,

I have done a little bit of research on this (I must admit it was on a Linux computer however).

I hope this may help at least with fixing this issue.

Most File, FileImpl, zim::ifstream and zim::streambuf works with string. There is one intermediate string -> const char* -> string conversion in FileImpl, which the attached patch removes (to use string for everything). This to hopefully support the NUL-characters that may appear in the octets of UTF-16 strings put in a 8-bit char string.

In addition, the patch has a comment at the locations in fstream.cpp where probably different Windows API needs to be called.

I think that this way, file.h does not need to be changed to have wchar_t or wstring constructors.

What follows below the line is the patch:

diff --git a/zimlib/include/zim/file.h b/zimlib/include/zim/file.h
index a6ac75b..8aa5f0d 100644

a/zimlib/include/zim/file.h

+++ b/zimlib/include/zim/file.h
@@ -39,7 +39,7 @@ namespace zim

File()
  { }
explicit File(const std::string& fname)

: impl(new FileImpl(fname.c_str()))

+ : impl(new FileImpl(fname))

  { }
 
const std::string& getFilename() const   { return impl->getFilename(); }

diff --git a/zimlib/include/zim/fileimpl.h b/zimlib/include/zim/fileimpl.h
index 1cf584d..ada65ee 100644

a/zimlib/include/zim/fileimpl.h

+++ b/zimlib/include/zim/fileimpl.h
@@ -53,7 +53,7 @@ namespace zim

  offset_type getOffset(offset_type ptrOffset, size_type idx);
 
public:

explicit FileImpl(const char* fname);

+ explicit FileImpl(const std::string& fname);

time_t getMTime() const   { return zimFile.getMTime(); }

diff --git a/zimlib/src/fileimpl.cpp b/zimlib/src/fileimpl.cpp
index 8c072eb..d7a7f4a 100644

a/zimlib/src/fileimpl.cpp

+++ b/zimlib/src/fileimpl.cpp
@@ -38,7 +38,7 @@ namespace zim

//////////////////////////////////////////////////////////////////////
// FileImpl
//

FileImpl::FileImpl(const char* fname)

+ FileImpl::FileImpl(const std::string& fname)

: zimFile(fname),
  direntCache(envValue("ZIM_DIRENTCACHE", DIRENT_CACHE_SIZE)),
  clusterCache(envValue("ZIM_CLUSTERCACHE", CLUSTER_CACHE_SIZE))

diff --git a/zimlib/src/fstream.cpp b/zimlib/src/fstream.cpp
index 5ce72f5..e4accf8 100644

a/zimlib/src/fstream.cpp

+++ b/zimlib/src/fstream.cpp
@@ -59,6 +59,24 @@ namespace zim
//
streambuf::OpenfileInfo::OpenfileInfo(const std::string& fname_)

: fname(fname_),

+ I think regular std::string with 8-bit characters
+ is OK up to this point, that is, in the ctor and fields of File,
+ FileImpl, ifstream, streambuf etc... (that is, no wchar_t or wstring
+ necessary there and no API change needed), the string just may have
+ NULL-characters if bytes of UTF-16 are present, but std::string
+ supports that (const char* does not).
+ So c_str() or any C/C++ API that uses C strings should not be used at
+ this point or anywhere before.
+ So here a Windows-specific API that supports UTF-16 is needed for
+ _WIN32, and the fname_ string needs to converted here to what it
+ needs if necessary. That is still TODO, hence the comment here.
+ Apparently the fname received here on Windows is UTF-16 encoded with
+ two chars per wchar_t, if used by Kiwix (the other possibility would
+ have been that it was UTF-8 encoded and needed to be converted to
+ UTF-16). Maybe a comment in the File constructor could also be nice
+ to ensure all users will be consistent with this?
+ This concludes my investigation, I hope it can help fixing the
+ issue :)
#ifdef HAVE_OPEN64

fd(::open64(fname.c_str(), O_RDONLY | O_LARGEFILE | O_BINARY))

#else
@@ -293,6 +311,8 @@ time_t streambuf::getMTime() const

if (mtime || files.empty())
  return mtime;

+ See comment under streambuf::OpenfileInfo::OpenfileInfo about UTF-16:
+ the same applies here as well to the c_str() call and stat.

const char* fname = files.front()->fname.c_str();

#ifdef HAVE_STAT64

The whole bug has been moved here:
https://github.com/openzim/libzim/issues/74

Quite optimistic to get that bug fix in the next 6 months now.

Zimlib is not able to deal correctly with unicode path on WindowsClosed, InvalidPublicActions

Description

Details

Event Timeline

Zimlib is not able to deal correctly with unicode path on Windows
Closed, InvalidPublic
Actions