Page MenuHomePhabricator

Assess approximate total upload size for DPLA projects
Open, Needs TriagePublic

Description

While chatting with @Dominicbm about T248151: Big number of uploads from DPLA bot it emerged that the whole dataset to be uploaded is quite big (good!). For capacity planning purposes I'd like to know what ways we have to estimate the expected total upload bytes for the whole planned uploads from DPLA (and number of files, if possible but not as critical).

Note that I'm not familiar with the project itself so I might be missing some critical details and/or information!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 7 2020, 9:27 AM
Kizule added a subscriber: Kizule.Apr 7 2020, 9:40 AM

Hi @fgiunchedi. Can you please associate at least one active project with this task (via the Add Action...Change Project Tags dropdown)? This will allow others to get notified, or see this task when searching via projects. Thanks.

In T249597#6035612, @Zoranzoki21 wrote:

Hi @fgiunchedi. Can you please associate at least one active project with this task (via the Add Action...Change Project Tags dropdown)? This will allow others to get notified, or see this task when searching via projects. Thanks.

I would, but at this time I don't know which project would be most appropriate yet

RhinosF1 added a subscriber: RhinosF1.

@fgiunchedi: Assuming this task is about uploading to Commons, hence adding that project tag so other people can also find this task when searching via projects.

RhinosF1 moved this task from Incoming to Uploading on the Commons board.Apr 20 2020, 10:35 PM

Hi @Dominicbm, would you (or other folks? not sure) be able to assist with this? Thanks!

It's a bit hard to estimate total size in bytes, especially because the total is a moving target, based on how many partners sign on. @SandraF_WMF I recently gave a webinar on DPLA's work (https://youtu.be/0BSoKSYBcBI). Basically, we are now actively doing outreach to our partners (essentially, and US-based GLAM), and doing uploads on-request for partners that have appropriate rights and can give us the media. There are 37 million items in DPLA's dataset, of which 1.7 million are currently licensed appropriately for Commons upload. An item can have any number of media files, and we don't really know the aggregate number (or byte size), because we don't host them—we're just working with the partners to download the assets they host and reupload them to Commons.

I can do some *very* rough estimation based on best guesses about average uploads. For example, if we guess the average file is 2 MB, and the average item is 20 files, then if we uploaded 500,000 of the items (current opted-in partners are less than 200,000 so far), we'd be at 20 TB by my estimation.

I don't know if this is no big deal or not, so let me know if there are issues we need to consider with the plan.

Thank you @Dominicbm for the extensive information and context, this is very useful!

I can do some *very* rough estimation based on best guesses about average uploads. For example, if we guess the average file is 2 MB, and the average item is 20 files, then if we uploaded 500,000 of the items (current opted-in partners are less than 200,000 so far), we'd be at 20 TB by my estimation.

I don't know if this is no big deal or not, so let me know if there are issues we need to consider with the plan.

Great, a ballpark estimation is quite useful already in terms of capacity planning. To give you an idea, the on going rate of originals upload is ~80GB/day on average (i.e. including spikes from bulk uploads) for the last 90 days so 20TB would be approximately 8 months of uploads at the current rate. We're in the capacity planning phase for media storage for next fiscal year and this information is very useful to aid with capacity needs!