Page MenuHomePhabricator

Please upload large files to Wikimedia Commons
Closed, ResolvedPublic

Description

Please upload 16 large TIFFs

URL: http://www.ub.unibas.ch/digi/wikicommons/out/gt1gb2load.tar (46 GB)
Username: Basel University Library

Thank you very much
Andreas Bigger on behalf of Basel University Library

Event Timeline

Sorry, my mistake. This URL is restricted to the IP of the GLAM WikiToolset.

Use instead: http://www.ub.unibas.ch/digi/a100/diverse_projekte/gt1gb2load.tar

Matanya triaged this task as Medium priority.
Matanya set Security to None.

Unfortunately Wikimedia's internal network proxy responds with 403 when I request this URL... Will try to find out why

The file is too big for the proxy to handle. It doesn't handle files 1GB or larger. But I shouldn't have to proxy this file via my own laptop to get it onto a mediawiki host...
However, bast1001 allows external downloads without needing to go via url-downloader (I checked bast2001 and hooft too but they're both tiny), so:

For this archive, we could get ops to set up rsyncd on terbium (generally the host used for server-side uploads, but too small for this whole archive at once), download the file to bast1001, extract the files and rsync what we can over to terbium, upload, delete uploaded files from terbium, rsync the next batch and repeat until all are done.
Alternatively, we could get ops to set up rsyncd to tin (not generally used for server-side uploads, so probably temporarily only), download the file to bast1001, rsync to tin, extract and upload each file.

(and then remove the extra files so we're not taking up disk space on bast1001/tin/terbium/wherever indefinitely)

We could ask @Basel_University_Library to split the archive up, but that wouldn't really solve any issues, just remove the extract step.

Or can we download it on labs, split it into parts, make the parts accessible over the web, download them on tin and piece it all back together and extract?

trying to break this down:

issue:

  • we need to download really large files; then upload them on commons

status:

  • usually terbium is used for that. that's also where people.wm is. so quite a few have access

problem 1 - disk space

  • terbium doesn't have that much disk, there's only 25G right now, it is not enough for a request like this, so that's not just a one-off
  • tin could also be used, has >100G, but still not huge and also behind bastion
  • bast1001 on the other hand , 330G / and another 862G on /srv/home_pmtpa ,

problem 2 - http proxy issue

  • users on terbium or tin have to use url-downloader as http_proxy but can't download files larger than 1GB due to squid3 config
  • i don't think that we want to change the squid config to allow files 50 times larger than that, but ..what's the real limit?
  • from bastions we allow downloads without proxy. is that really how we want it to be? that makes the users treat bastions as a work host, maybe that should be another host instead
  • since users cant download directly on terbium, another issue arises, how to copy it from bastion to terbium without agent forwarding
  • .. and that is why rsyncd came up. we have that puppetized, so it's relatively easy to add an rsyncd to terbium that allows uploads but only from a specific IP (bastion or other), this solves the agent issue.

Hmm, I downloaded a 29 gigs OSM dump not so long ago without problems with curl -O -x webproxy.eqiad.wmnet:8080 <url>

webproxy.eqiad.wmnet seems promising. Downloading to tin in a screen called T111941.

... and the difference is url-downloader was used as proxy and is maximum_object_size 1010 MB (squid config on chromium). while webproxy.eqiad is on carbon and does not have that same limit.

Downloaded and extracted. This one is too big I'm afraid:

-rw-r--r--  1 krenair wikidev  16G Sep  9 11:26 UBBasel_Map_1568_Kartenslg_AA_26-48.tif
-rw-r--r--  1 krenair wikidev 1.6K Sep  9 11:12 UBBasel_Map_1568_Kartenslg_AA_26-48.tif.txt

I think the rest should be fine.

Oh, sorry, I misremembered the limit - it's 4GB rather than 5GB. That also rules out this one:

-rw-r--r-- 1 krenair wikidev 4.5G Sep  9 11:18 UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_2.tif
-rw-r--r-- 1 krenair wikidev 2.2K Sep  9 11:12 UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_2.tif.txt``

Uploading is happening in the same screen on tin. First one is already up

/bin/bash: line 1: 20109 Killed                  '/usr/bin/tiffinfo' '/home/krenair/upload-T111941/out/try2/UBBasel_Map_1700-1799_VB_A2-2-120a.tif' 2>&1

Resulted in a 0x0 but 2.02GB file. Retried with no luck. Any ideas @aaron?

These are done:

-rw-r--r-- 1 krenair wikidev 1.1G Sep  9 11:27 UBBasel_Map_1556_Kartenslg_AA_86-89.tif
-rw-r--r-- 1 krenair wikidev 3.2G Sep  9 11:29 UBBasel_Map_1556_Kartenslg_Schw_A_1a.tif
-rw-r--r-- 1 krenair wikidev 1.4G Sep  9 11:26 UBBasel_Map_1564_Kartenslg_AA_110-113.tif
-rw-r--r-- 1 krenair wikidev 1.4G Sep  9 11:20 UBBasel_Map_1564_Kartenslg_AA_6-7.tif
-rw-r--r-- 1 krenair wikidev 1.4G Sep  9 11:26 UBBasel_Map_1567_Kartenslg_AA_98-99.tif
-rw-r--r-- 1 krenair wikidev 1.4G Sep  9 11:28 UBBasel_Map_1568_Kartenslg_Schw_Ca_1.tif
-rw-r--r-- 1 krenair wikidev 1.6G Sep  9 11:21 UBBasel_Map_1569_Kartenslg_AA_3-5.tif
-rw-r--r-- 1 krenair wikidev 1.4G Sep  9 11:27 UBBasel_Map_1572_Kartenslg_AA_119-120.tif
-rw-r--r-- 1 krenair wikidev 2.3G Sep  9 11:21 UBBasel_Map_1572_Kartenslg_AA_8-10.tif
-rw-r--r-- 1 krenair wikidev 1.2G Sep  9 11:20 UBBasel_Map_18uu-1615_Kartenslg_Schw_Ml_4e.tif
-rw-r--r-- 1 krenair wikidev 2.3G Sep  9 11:28 UBBasel_Map_Bayern_Niederbayern_Oberbayern_1579_Kartenslg_Mappe_246-76.tif
-rw-r--r-- 1 krenair wikidev 3.0G Sep  9 11:18 UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_3.tif
-rw-r--r-- 1 krenair wikidev 2.8G Sep  9 11:19 UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_4.tif

(files were moved around a bit)

krenair@tin:~$ time tiffinfo upload-T111941/broken/UBBasel_Map_1700-1799_VB_A2-2-120a.tif >/dev/null
TIFFReadDirectory: Warning, upload-T111941/broken/UBBasel_Map_1700-1799_VB_A2-2-120a.tif: wrong data type 7 for "RichTIFFIPTC"; tag ignored.
TIFFReadDirectory: Warning, upload-T111941/broken/UBBasel_Map_1700-1799_VB_A2-2-120a.tif: unknown field with tag 37724 (0x935c) encountered.

real	1m59.473s
user	1m57.899s
sys	0m1.368s
krenair@tin:~$
Krenair subscribed.

Oh, sorry, I misremembered the limit - it's 4GB rather than 5GB. That also rules out this one:

-rw-r--r-- 1 krenair wikidev 4.5G Sep  9 11:18 UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_2.tif
-rw-r--r-- 1 krenair wikidev 2.2K Sep  9 11:12 UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_2.tif.txt``

Try running it through

vips tiffsave UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_2.tif out.tif --compression deflate

That should probably get it below the 4gb limit without any loss of picture detail (Some exif-like metadata might be stripped)

vips tiffsave UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_2.tif out.tif --compression deflate

That should probably get it below the 4gb limit without any loss of picture detail (Some exif-like metadata might be stripped)

Figured out a way to get the file to an imagescaler (so vips is installed). That command actually gets it below the 1GB limit at which we would run server-side uploads. I'm not entirely convinced though:

krenair@mw2086:~$ ls -alh *.tif
-rw-rw-r-- 1 krenair wikidev 597M Sep 14 13:30 out.tif
-rw-rw-r-- 1 krenair wikidev 4.5G Sep  9 11:18 UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_2.tif

4.5GB -> 0.6GB? really?

And:

-rw-r--r--  1 mwdeploy mwdeploy  16G Sep 14 13:52 UBBasel_Map_1568_Kartenslg_AA_26-48.tif
-rw-rw-r--  1 mwdeploy mwdeploy 239M Sep 14 13:56 compressed_UBBasel_Map_1568_Kartenslg_AA_26-48.tif

:|

Sorry for all the inconvenience caused ...

UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_2.tif ist a multitiff. vips seems to keep the first page, throwing the rest away ... Not really a solution.

I propose, you ignore this file and the UBBasel_Map_1568_Kartenslg_AA_26-48.tif. I will think of a better solution to get them in.

I will also check, what seems to be wrong with UBBasel_Map_1700-1799_VB_A2-2-120a.tif.

UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_2.tif ist a multitiff. vips seems to keep the first page, throwing the rest away ... Not really a solution.

We all knew it was too good to be true :)

I recreated the UBBasel_Map_1700-1799_VB_A2-2-120a.tif

New URL: http://www.ub.unibas.ch/digi/a100/diverse_projekte/UBBasel_Map_1700-1799_VB_A2-2-120a.tif

(it comes now without the warnings, is a bit smaller, but seems to be OK)

I found a different command that should work on multipage tiffs

tiffcp -c zip:p9 infile.tif outfile.tif

I believe tiffcp is part of libtiff.

@Bawolff - yes we used tiffcp to create the multitiffs in the first place. I will try, what it can do for UBBasel_Map_1568_Kartenslg_AA_26-48.tif.

For UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_2.tif, I have also prepared a smaller version (thrown out two pages that were not really that important)

New URL: http://www.ub.unibas.ch/digi/a100/diverse_projekte/UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_2.tif (3 GB now)

I recreated the UBBasel_Map_1700-1799_VB_A2-2-120a.tif

New URL: http://www.ub.unibas.ch/digi/a100/diverse_projekte/UBBasel_Map_1700-1799_VB_A2-2-120a.tif

(it comes now without the warnings, is a bit smaller, but seems to be OK)

krenair@tin:~/upload-T111941/working$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --overwrite --user=Basel_University_Library .
Import Images

UBBasel_Map_1700-1799_VB_A2-2-120a.tif exists, overwriting...done.

Found: 1
Overwritten: 1
krenair@tin:~/upload-T111941/working$

For UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_2.tif, I have also prepared a smaller version (thrown out two pages that were not really that important)

New URL: http://www.ub.unibas.ch/digi/a100/diverse_projekte/UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_2.tif (3 GB now)

krenair@tin:~/upload-T111941/working$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Basel_University_Library .
Import Images

Importing UBBasel_Map_Kanton_Bern_1672_Kartenslg_Schw_Cb_2.tif...done.

Found: 1
Added: 1

Anything else to do? I think the only thing missing is the 16GB file UBBasel_Map_1568_Kartenslg_AA_26-48.tif, but it sounds like that's not going to be possible.

Nope for UBBasel_Map_1568_Kartenslg_AA_26-48.tif - it's definitely too big, sorry.

So it's done. Great job! Thanks to all!

Nope for UBBasel_Map_1568_Kartenslg_AA_26-48.tif - it's definitely too big, sorry.

So it's done. Great job! Thanks to all!

btw, be advised that due to quirks in how we render tiff thumbnails, the limit on large files where we don't display thumbnails is much higher for the first page then the other page, so we might not display later pages on some of your really big files.

btw, be advised that due to quirks in how we render tiff thumbnails, the limit on large files where we don't display thumbnails is much higher for the first page then the other page, so we might not display later pages on some of your really big files.

Thanks for the information.

btw, be advised that due to quirks in how we render tiff thumbnails, the limit on large files where we don't display thumbnails is much higher for the first page then the other page, so we might not display later pages on some of your really big files.

Thanks for the information.

I realized that this hasn't been filed as a separate bug, so I did that at T117349