Page MenuHomePhabricator

Writer function inefficient mimetype dirent entry rewriting
Closed, InvalidPublic

Description

During the development of zimdiff/zimpatch we had the problem that two ZIM files were almost equal, except that the mimetypes were not sorted in the same way, so all dirent entry mimetype values were different.

This was an issue because zimpatches files were not equal to the original files. To avoid this, the zimlib forces currently the order of the mimetypes in the list in the header. They are sorted alphabeticaly.

Unfortunately, I see two problems with this:

  • This changes the specification of the format (we still don't have changed anything in the specifications)
  • The sorting of the mimetypes is done after all articles are inserted and this needs to rewrite all the dirent entries. Something which is really not efficient/elegant.

I think an alternative approach would be to allow to force the mime-type list before inserting the articles. This would shortcut the dynamic creation of this mime-type list and consequently avoid the two problems listed above.


Version: unspecified
Severity: enhancement

Details

Reference
bz55363

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:37 AM
bzimport added a project: openZIM-zimlib.
bzimport set Reference to bz55363.

tommi wrote:

  1. Sorting the mime types do not break the current specification. The specification just does not force the mime types to be sorted. If we specify the mime types, that they must be sorted, the previously created zim files break the specification. This is not really a big problem since even when the file break the sorting rule, they remain readable with the new zimlib.
  1. Sorting is done after collecting the directory entries. The directory entries are held in memory anyways and hence I do not expect sorting to slow down generation of zim files. Compression the data is by far more expensive than sorting the directory entries. I see no need to change anything here.

Note that if we force the generator to deliver the mime types prior to directory entries, we break the interface of the generator and make it more difficult to implement the generator interface. Delivering the mime types may be even more expensive in the generator than in the zimcreator.

I re-assign the bug to me. It's not clear here how we want to proceed.