Page MenuHomePhabricator

Zimlib is not able to deal correctly with unicode path on Windows
Closed, InvalidPublic

Description

Windows uses utf16 to encode unicode paths. The zimlib uses char* to store the paths. For this reason, if you are unlucky (so use paths with characters over the first 8 bits) you are not able to open your ZIM file using the zimlib. We should provide a way to open paths in wchar_t* format, at least on Windows.


Version: unspecified
Severity: major

Details

Reference
bz53398

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 1:53 AM
bzimport added a project: openZIM-zimlib.
bzimport set Reference to bz53398.
bzimport added a subscriber: Unknown Object (MLST).

Can this be done with the standard POSIX file APIs? Or do we need to use Win32 APIs for file access? :P

(Use of legacy 8-bit encodings in POSIX APIs is a problem for PHP, and thus MediaWiki, when running on Windows servers as well.)

Sounds like you gotta use the Win32 wide-char APIs to get Unicode filename support... at least that's what glib does on Windows for this reason -- https://developer.gnome.org/glib/2.26/glib-File-Utilities.html#glib-File-Utilities.description

lvandeve wrote:

Hello,

I have done a little bit of research on this (I must admit it was on a Linux computer however).

I hope this may help at least with fixing this issue.

Most File, FileImpl, zim::ifstream and zim::streambuf works with string. There is one intermediate string -> const char* -> string conversion in FileImpl, which the attached patch removes (to use string for everything). This to hopefully support the NUL-characters that may appear in the octets of UTF-16 strings put in a 8-bit char string.

In addition, the patch has a comment at the locations in fstream.cpp where probably different Windows API needs to be called.

I think that this way, file.h does not need to be changed to have wchar_t or wstring constructors.

What follows below the line is the patch:


diff --git a/zimlib/include/zim/file.h b/zimlib/include/zim/file.h
index a6ac75b..8aa5f0d 100644

  • a/zimlib/include/zim/file.h

+++ b/zimlib/include/zim/file.h
@@ -39,7 +39,7 @@ namespace zim

File()
  { }
explicit File(const std::string& fname)
  • : impl(new FileImpl(fname.c_str()))

+ : impl(new FileImpl(fname))

  { }
 
const std::string& getFilename() const   { return impl->getFilename(); }

diff --git a/zimlib/include/zim/fileimpl.h b/zimlib/include/zim/fileimpl.h
index 1cf584d..ada65ee 100644

  • a/zimlib/include/zim/fileimpl.h

+++ b/zimlib/include/zim/fileimpl.h
@@ -53,7 +53,7 @@ namespace zim

  offset_type getOffset(offset_type ptrOffset, size_type idx);
 
public:
  • explicit FileImpl(const char* fname);

+ explicit FileImpl(const std::string& fname);

time_t getMTime() const   { return zimFile.getMTime(); }

diff --git a/zimlib/src/fileimpl.cpp b/zimlib/src/fileimpl.cpp
index 8c072eb..d7a7f4a 100644

  • a/zimlib/src/fileimpl.cpp

+++ b/zimlib/src/fileimpl.cpp
@@ -38,7 +38,7 @@ namespace zim

//////////////////////////////////////////////////////////////////////
// FileImpl
//
  • FileImpl::FileImpl(const char* fname)

+ FileImpl::FileImpl(const std::string& fname)

: zimFile(fname),
  direntCache(envValue("ZIM_DIRENTCACHE", DIRENT_CACHE_SIZE)),
  clusterCache(envValue("ZIM_CLUSTERCACHE", CLUSTER_CACHE_SIZE))

diff --git a/zimlib/src/fstream.cpp b/zimlib/src/fstream.cpp
index 5ce72f5..e4accf8 100644

  • a/zimlib/src/fstream.cpp

+++ b/zimlib/src/fstream.cpp
@@ -59,6 +59,24 @@ namespace zim
//
streambuf::OpenfileInfo::OpenfileInfo(const std::string& fname_)

: fname(fname_),

+ I think regular std::string with 8-bit characters
+
is OK up to this point, that is, in the ctor and fields of File,
+ FileImpl, ifstream, streambuf etc... (that is, no wchar_t or wstring
+
necessary there and no API change needed), the string just may have
+ NULL-characters if bytes of UTF-16 are present, but std::string
+
supports that (const char* does not).
+ So c_str() or any C/C++ API that uses C strings should not be used at
+
this point or anywhere before.
+ So here a Windows-specific API that supports UTF-16 is needed for
+
_WIN32, and the fname_ string needs to converted here to what it
+ needs if necessary. That is still TODO, hence the comment here.
+
Apparently the fname received here on Windows is UTF-16 encoded with
+ two chars per wchar_t, if used by Kiwix (the other possibility would
+
have been that it was UTF-8 encoded and needed to be converted to
+ UTF-16). Maybe a comment in the File constructor could also be nice
+
to ensure all users will be consistent with this?
+ This concludes my investigation, I hope it can help fixing the
+
issue :)
#ifdef HAVE_OPEN64

fd(::open64(fname.c_str(), O_RDONLY | O_LARGEFILE | O_BINARY))

#else
@@ -293,6 +311,8 @@ time_t streambuf::getMTime() const

if (mtime || files.empty())
  return mtime;

+ See comment under streambuf::OpenfileInfo::OpenfileInfo about UTF-16:
+
the same applies here as well to the c_str() call and stat.

const char* fname = files.front()->fname.c_str();

#ifdef HAVE_STAT64

The whole bug has been moved here:
https://github.com/openzim/libzim/issues/74

Quite optimistic to get that bug fix in the next 6 months now.