Page MenuHomePhabricator

Undertake a mass upload of 14 million files (1.5 TB) to Commons
Closed, ResolvedPublic

Description

I am currently coordinating with a volunteer to upload 14 million public domain image files to Wikimedia Commons by way of GLAM-Wiki Toolset. These files will be around 1.5 TB in size. I would like to ask Operations if I can do this without causing any particular disruption to the technical infrastructure of Wikimedia Commons. I am also tagging GLAM-Wiki Toolset in case I need to do anything special with the XML file that will be generated. (Would a 14 million entry XML file be too big?)

Event Timeline

Harej raised the priority of this task from to Needs Triage.
Harej updated the task description. (Show Details)
Harej subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

For special cases like this it might also be an option to send a hard disk.

I happen to live in DC, so it wouldn't take that much effort for me to hand off a hard disk to someone in Ashburn. Except I don't have the files myself; they reside with the Internet Archive. The best option is to have the toolset download the files directly from Internet Archive.

Speaking of which, for those who understand the toolset, how many download requests are sent out at a time? This number is important, given the volume and the fact that the images are stored on the server as ZIP files. The Internet Archive does on-the-fly decompression through their API and serves the decompressed image. Thus it could potentially be very taxing if too many requests are done at once.

Yes, a 14 million entry xml file would be too big. I'm not sure how big files have actually been uploaded using the GWT, maybe @dan-nl or the multimedia team can advise about that. I wouldn't be surprised if it needs to be in batches of 10k or less, to prevent overload (which GWT has done numerous times before), and possible OOM. Uploading a 14M item xml file seems somewhat of a pain to begin with

Then @fgiunchedi (or other opsen) can advise how much free space we've got in our swift cluster (where do these aggregated metrics live? total space, space used, space free etc), and whether we can safely ingest this amount of new files, or if we need to look at expanding the swift clusters storage capacity. A quick glance at the ms-be in ganglia suggests a lot of the boxes have less than 20% disk space free

So it turns out all those files are on archive.org. So it's an option to just provide all the download links to those files instead of the actual file contents. I'm not sure yet what makes more sense, but one way would be shipping a disk and the other one to just download the files directly from archive to our servers and then mass-import.

< mutante> harej: so instead of the actual files you could also just provide the links to all those files?
< mutante> on archive.org ?

< harej> Yes. And meta-data!
< harej> And if you want something other than XML, just let me know. The metadata file is going to be produced specifically for this purpose.

The earlier overload issues weren't related to upload batch size, but rather to image size, and in theory should not happen again (although I don't think that theory has been tested yet) - see T67217.

GWToolset re-reads the XML file after every dozen or so uploads, which is probably not going to be efficient with 14M entries.

AFAIK @Fae has been the largest-scale user of GWT to date, with some 100K+ collections, but I think he did those in smaller batches.

thanks for the heads up @Harej ! space-wise we are in the process of expanding our swift cluster capacity in T1268 and currently waiting for the machines to be delivered. ETA is likely another two weeks for delivery plus gradual deployment time (three/four weeks perhaps)
what's your timeline for this? to play it safe I'd prefer to wait for full swift expansion to be finished or at least ongoing, we are running at 84% utilization ATM. A 1.5TB upload would speed up growth by ~10days since we plan for ~140GB/day growth

@Reedy to answer your question, related phab contains a link for daily growth but no explicit graph for swift utilization yet besides looking at ganglia graphs and swift-recon

ms-fe1001:~$ sudo swift-recon -d --human-readable
===============================================================================
--> Starting reconnaissance on 15 hosts
===============================================================================
[2015-02-06 08:55:46] Checking disk usage now
Distribution Graph:
  7%    1 
  8%    1 
  9%    1 
 10%    2 
 11%    1 
 18%    5 **
 19%    2 
 20%    2 
 21%    4 *
 22%    2 
 23%    1 
 24%    4 *
 26%    2 
 29%    1 
 30%    1 
 83%    1 
 84%  160 *********************************************************************
 85%   19 ********
Disk usage: space used: 335 TB of 399 TB
Disk usage: space free: 63 TB of 399 TB
Disk usage: lowest: 7.5%, highest: 85.43%, avg: 83.9938577939%
===============================================================================

The timeline is whatever you tell me it is. ;)

In any case, the plan was to start off small and then increment from there. Nothing that your existing infrastructure couldn't handle. This will help work out kinks in the system in general. But we should wait until about a month from now before turning the toolset loose on the entire collection?

Also, do I read that correctly? Does Wikimedia have a total storage capacity of 399 TB, or just the part that handles Commons? (Such a huge number, but it somehow seems small...)

The timeline is whatever you tell me it is. ;)

In any case, the plan was to start off small and then increment from there. Nothing that your existing infrastructure couldn't handle. This will help work out kinks in the system in general. But we should wait until about a month from now before turning the toolset loose on the entire collection?

Also, do I read that correctly? Does Wikimedia have a total storage capacity of 399 TB, or just the part that handles Commons? (Such a huge number, but it somehow seems small...)

Noting that includes 3 copies of each file, so if we lose one host, we don't lose the file etc

The timeline is whatever you tell me it is. ;)

In any case, the plan was to start off small and then increment from there. Nothing that your existing infrastructure couldn't handle. This will help work out kinks in the system in general. But we should wait until about a month from now before turning the toolset loose on the entire collection?

sounds good to me, 20-30GB/day shouldn't cause much trouble

Also, do I read that correctly? Does Wikimedia have a total storage capacity of 399 TB, or just the part that handles Commons? (Such a huge number, but it somehow seems small...)

those 400TB raw are swift, i.e. image (originals+thumbs) uploads from all wikis, what @Reedy said that doesn't include 3x replication so usable is 1/3 of that

Dzahn triaged this task as Medium priority.Feb 13 2015, 9:37 PM

@Harej,

Would a 14 million entry XML file be too big?

yes, as @Reedy already mentioned. i believe the largest xml file i used during testing was a 10,000 record file that was 58.7mb in size, which is a good size to limit each xml file to - 10,000 records.

how many download requests are sent out at a time?.... it could potentially be very taxing if too many requests are done at once.

there are a few throttles in place that take care of the flow of requests in order to prevent gwtoolset from creating a dos on the external media server and from overloading the wikimedia servers that process the media files. there have been a few adjustments, so i’m not sure what the current flow is set at. you could take a look at the gwtoolset log on commons to get an idea. and as @Tgr mentioned, the overloading we’ve seen up until now, has been on the wikimedia side where the servers that process the media files ran into an issue with processing too many large images at once, which should, in theory, no longer be an issue - good to test it :)

test runs
i suggest that you start now with a few test runs; maybe 10 records per test on commons beta.

here are a few things to sort out:

  • user account(s) that can use gwtoolset
    • you are already listed as a gwtoolset user on commons beta and commons production, so that’s taken care of for your account.
    • add any additional users to the gwtoolset group as necessary on beta and production
  • whitelist archive.org
  • create the initial test(s)
  • run those test on commons beta in order to work out any issues that may arise

what's the progress? FWIW we should be fully done with the swift expansion in ~10d but no harm to try out uploads in the meantime

Haven't started yet. There's a lot I need to work out on my end first.

Jdforrester-WMF renamed this task from Can Commons support a mass upload of 14 million files (1.5 TB)? to Undertake a mass upload of 14 million files (1.5 TB) to Commons.Sep 15 2015, 3:49 PM

afaik this isn't blocked on operations, see https://phabricator.wikimedia.org/T88758#1156490

what's the progress? FWIW we should be fully done with the swift expansion in ~10d but no harm to try out uploads in the meantime

I presume that's finished? ;)

afaik this isn't blocked on operations, see https://phabricator.wikimedia.org/T88758#1156490

what's the progress? FWIW we should be fully done with the swift expansion in ~10d but no harm to try out uploads in the meantime

I presume that's finished? ;)

it is, however we are in the process of expanding the swift cluster again as per https://phabricator.wikimedia.org/T1268#1640922
what is the status of this btw @Harej ? if it could held off a bit until we have expanded again that might be better (eta I think is ~50d)

The project seems to be delayed indefinitely in general. What's another 50 days? :)

Reedy changed the task status from Open to Stalled.Sep 16 2015, 11:13 AM
matmarex changed the task status from Stalled to Open.Dec 3 2015, 4:20 PM
matmarex subscribed.

78 days passed, has that happened?

The upload hasn't happened yet, no.

@fgiunchedi just confirmed that the swift cluster has been expanded, so you should be good to do it whenever, if it's not still delayed indefinitely...

Multichill subscribed.

@Harej : Last activity over a year ago, no blockers so I'm closing this one as resolved? If you're going to do the actual upload you probably want to coordinate with the Commons community.