Page MenuHomePhabricator

Determine how to upload Zim files to Swift infrastructure
Closed, InvalidPublic

Description

Swift appears to be a candidate for hosting Zim files. Several steps need to be accomplished to validate this assumption:

  1. Figure out how to upload from the Cloud VPS instance where the Zim files will be created
  2. Figure out how to chunk the files so they can be hosted efficiently
  3. Upload several test files which are representative of the Zim files that would be hosted
  4. Test downloading the files to the Android app

For the test files we should generate and upload the following for the en project (Based on this sheet):

  1. Wiki Medicine
  2. Core Wikipedia
  3. 5000 Most Read
  4. 50000 Most Read

Each of those collections should be generated with:

  1. All content
  2. No audio or video
  3. No images, audio or video

This is a total of 12 collections

Some references:
https://wikitech.wikimedia.org/wiki/Swift/How_To
https://wikitech.wikimedia.org/wiki/Media_storage
https://wikitech.wikimedia.org/wiki/Swift

Event Timeline

I'll add some thoughts/braindump below:

  1. For production swift access isn't permitted from cloud vps, does the zim generation and upload need to happen in cloud vps?
  1. The simplest way for chunking I believe could be via swift's "dynamic large objects", essentially upload the file chunks with specific filenames (in the same swift container) so that swift can reassemble the chunks correctly. (TODO, verify e.g. range requests work as expected) See also https://docs.openstack.org/swift/latest/overview_large_objects.html#module-swift.common.middleware.dlo

Since for DLO the chunks need to be in the same container we could shard the containers e.g. 256 ways based on the filename hash (like it happens in production for big wikis and commons).
e.g. using the swift commandline client and a file with hash 4b712aeabedb0109b1099b0b5fe34508: swift upload zim_4b -S 1073741824 4b712aeabedb0109b1099b0b5fe34508 and the file would be available for download at https://swift_hostname/zim_4b/4b712aeabedb0109b1099b0b5fe34508

  1. To test download/upload in beta let me know a swift username to be used and I'll create the account in beta and production

Hey @fgiunchedi,

  1. For production swift access isn't permitted from cloud vps, does the zim generation and upload need to happen in cloud vps?

In principle it doesn't have to happen in Cloud VPS, that's just where I'm prototyping right now, though we were thinking (or hoping) it would be an appropriate place for an occasional (~1x/month) zim generation job, if possible to just keep it running there after the prototyping is finished. If that's out of the question, then we'd have to talk to someone about finding some appropriate production cluster hardware...

I shouldn't have any trouble accessing the beta cluster Swift instance from Cloud VPS, correct?

  1. The simplest way for chunking I believe could be via swift's "dynamic large objects", essentially upload the file chunks with specific filenames (in the same swift container) so that swift can reassemble the chunks correctly. (TODO, verify e.g. range requests work as expected) See also https://docs.openstack.org/swift/latest/overview_large_objects.html#module-swift.common.middleware.dlo

Since for DLO the chunks need to be in the same container we could shard the containers e.g. 256 ways based on the filename hash (like it happens in production for big wikis and commons).
e.g. using the swift commandline client and a file with hash 4b712aeabedb0109b1099b0b5fe34508: swift upload zim_4b -S 1073741824 4b712aeabedb0109b1099b0b5fe34508 and the file would be available for download at https://swift_hostname/zim_4b/4b712aeabedb0109b1099b0b5fe34508

Thanks! I'll try that approach first when I get to the uploading stage.

  1. To test download/upload in beta let me know a swift username to be used and I'll create the account in beta and production

Let's use the username mdholloway.

Thanks again!

Hey @fgiunchedi,

  1. For production swift access isn't permitted from cloud vps, does the zim generation and upload need to happen in cloud vps?

In principle it doesn't have to happen in Cloud VPS, that's just where I'm prototyping right now, though we were thinking (or hoping) it would be an appropriate place for an occasional (~1x/month) zim generation job, if possible to just keep it running there after the prototyping is finished. If that's out of the question, then we'd have to talk to someone about finding some appropriate production cluster hardware...

Yeah it shouldn't be hard to find a spot in the production cluster to run the generation job, running everything on a cloud vm for prototyping now is more than adeguate.

I shouldn't have any trouble accessing the beta cluster Swift instance from Cloud VPS, correct?

From inside beta it should be already accessible, from other projects I'm not sure but I'm assuming your vm runs in beta already?

  1. To test download/upload in beta let me know a swift username to be used and I'll create the account in beta and production

Let's use the username mdholloway.

I'd like to keep the username tied to the service rather than a person, maybe pagecompilation ? zimdumps ?

I shouldn't have any trouble accessing the beta cluster Swift instance from Cloud VPS, correct?

From inside beta it should be already accessible, from other projects I'm not sure but I'm assuming your vm runs in beta already?

Looks like the beta cluster is itself a project in Labs/Cloud VPS (is that what you mean by beta?) as well, so yes, I think we're on the same page.

  1. To test download/upload in beta let me know a swift username to be used and I'll create the account in beta and production

Let's use the username mdholloway.

I'd like to keep the username tied to the service rather than a person, maybe pagecompilation ? zimdumps ?

Good point. I like pagecompilation.

Change 371579 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: create pagecompilation account

https://gerrit.wikimedia.org/r/371579

Change 371579 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: create pagecompilation account

https://gerrit.wikimedia.org/r/371579

ema triaged this task as Medium priority.Sep 28 2017, 2:49 PM

This is stalled, possibly indefinitely. Consider reopening if and when this work picks back up.