Page MenuHomePhabricator

Please upload server-sided large number of files to Wikimedia Commons
Closed, ResolvedPublic

Description

A few months ago I developed the Dememorixer, a tool that lets you download the full resolution of images presented in the Memorix Maior image viewer. A fair number of Dutch cultural heritage institutions use this viewer to reduce server load. I categorized images that had already been transferred to Wikimedia Commons to Category:Dememorixer.

Now I've prepared for upload the images in this category that had a proper permalink to their source in the meta data.

I was hoping this could be done in one batch using server sided upload.

The zip file with the images can be found at http://veradekok.nl/dememorixer.zip
It's a little over 630MB

The files in the zip file have the same filename as their lower resolution counterpart on Wikimedia Commons.
My username on 'Commons is also 1Veertje

Event Timeline

1Veertje raised the priority of this task from to Needs Triage.
1Veertje updated the task description. (Show Details)
1Veertje subscribed.
Frankfurter Wertpapierb�rse plan 01.jpg could not be imported; a valid title cannot be produced
Jan Gabrielsz. Sonj� 001.jpg could not be imported; a valid title cannot be produced
P.Lycklama � Nijeholt.jpg could not be imported; a valid title cannot be produced
Wybrand Hendriks - family Ensched�.jpg could not be imported; a valid title cannot be produced
Anatomische les van dr. Willem R�ell.jpg could not be imported; a valid title cannot be produced
Fran�ois Gerard Abraham Gevers Deynoot (1814-1882), burgemeester van Den Haag.jpg could not be imported; a valid title cannot be produced
Single, la Tour de Jan-Rooden-Poort, et la nouvelle Eglise Lutherienne � Amsterdam.jpg could not be imported; a valid title cannot be produced
Juriaen Pool and Rachel Pool-Ruysch - Family portrait with flower still-life in the making - 1716 - Stadtmuseum D�sseldorf.jpg could not be imported; a valid title cannot be produced

Everything else imported successfully.

that's an encoding problem, those are the items with things like à, ö or é in the title.
I'll do these manually but is this a fault on my end or yours? How could this be avoided in the future?

When I use python's zipfile.ZipFile('dememorixer.zip').printdir() I get entries like this:

download/Fran�ois Gerard Abraham Gevers Deynoot (1814-1882), burgemeester van Den Haag.jpg 2015-10-27 20:56:42       303122
download/Frankfurter Wertpapierb�rse plan 01.jpg 2015-10-31 12:08:28      2379183
download/Wybrand Hendriks - family Ensched�.jpg 2015-10-31 11:57:04      1879692

etc.
(it was extracted with .extractall() - see http://serverfault.com/a/530117/160305)

I also found a host with jar installed, but that wasn't happy about the zip at all:

$ jar tvf T117351.zip 
1695237 Tue Oct 27 20:48:48 UTC 2015 download/17th-century Netherlandish artist - Democritus with a skull.jpeg
 32710 Sat Oct 31 11:42:28 UTC 2015 download/Aaltje Maathuis (1790-1835).jpg
1244309 Sat Oct 31 11:42:44 UTC 2015 download/Aarnoud Jan van Beeck Calkoen (1805-1874).jpg
170511 Sat Oct 31 11:42:46 UTC 2015 download/Abraham Gevers2 (1712-1780).jpg
1218174 Tue Oct 27 20:49:10 UTC 2015 download/Adriaan de Lelie - Voordracht over de anatomie door Andreas Bonn voor het departement der Tekenkunde van Felix Meritus (1792).jpg
290100 Tue Oct 27 20:49:14 UTC 2015 download/Adriaan de Lelie 001.jpg
297835 Sat Oct 31 11:42:50 UTC 2015 download/Adriaan Jacob van der Does (1756-1830) door Izaak Schmidt.jpg
825341 Tue Oct 27 20:49:30 UTC 2015 download/Adriaen van der Werff 020.jpg
1398258 Tue Oct 27 20:49:48 UTC 2015 download/Adriaen van Ostade Goyer en Questiers 1650.jpg
415027 Tue Oct 27 20:49:58 UTC 2015 download/Aken Rocky landscape.JPG
268258 Tue Oct 27 20:50:02 UTC 2015 download/Albert van Spiers - A floral frontispiece with a portrait medallion of Agnes Block.jpg
1019608 Sat Oct 31 11:43:00 UTC 2015 download/Amelia Maria Ruhle (1794-1859).jpg
java.lang.IllegalArgumentException: MALFORMED
	at java.util.zip.ZipCoder.toString(ZipCoder.java:58)
	at java.util.zip.ZipFile.getZipEntry(ZipFile.java:531)
	at java.util.zip.ZipFile.access$900(ZipFile.java:56)
	at java.util.zip.ZipFile$1.nextElement(ZipFile.java:513)
	at java.util.zip.ZipFile$1.nextElement(ZipFile.java:483)
	at sun.tools.jar.Main.list(Main.java:1061)
	at sun.tools.jar.Main.run(Main.java:246)
	at sun.tools.jar.Main.main(Main.java:1231)

that's an encoding problem, those are the items with things like à, ö or é in the title.
I'll do these manually but is this a fault on my end or yours? How could this be avoided in the future?

I think its on your end.

In the zip file, the ö seems to be encoded as \x94. I can't figure out any encoding where ö is represented as 0x94.

The InfoZip (unzip on unix) program seems to treat it as a Ф (Which is \x94 in Code page 866, which seems pretty obscure).


Anyways, for the future, its probably best to provide the files in a tar archive instead of zip (7zip can make the .tar type of archive). It is preferable if all filenames are encoded in UTF-8 NFC format, if possible (I have no idea how to do that on most platforms though).