Page MenuHomePhabricator

Determine where to host zim files for the Android app
Closed, ResolvedPublic

Description

We would like to host Zim files on WMF infrastructure for the upcoming Android compilations feature.

On average, these files will vary in size from 1 GB - 20 GB.
Clients will use HTTP to download and will need support for byte offsets and ranges… and/or chunking.

  1. Where can we host these files?
  2. What caching considerations do we need to make?
  3. What about CDNs?
  4. Do we host any existing large files that we can use as a template?

So far we have 2 candidates for where to host:

  1. Swift (along with Commons media)
  2. Dumps

For example, Kiwix hosts its Zim files on their own server, with various academic and private mirrors in Europe and other continents; MirrorBrain directs every download to a suitable mirror (the dumps host serves only a negligible part of the total bandwidth):
http://wiki.kiwix.org/wiki/Content_in_all_languages

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 17 2017, 4:12 PM

@ArielGlenn It was suggested that the servers that host XML dumps would be a good fit for this use case. Do you have any thoughts here? Do you know if there is space or have any information on how large files are served efficiently (caching, CDN, etc…)

These would go on the new labstore boxes once they are up (replacing the dumps servers). I don't remember how much space those boxes have, adding @chasemp for that bit. What is he expected total size for all files, as a rough guess?

Thanks @ArielGlenn!

We will know the definitive list (and then the total size) once this task is complete: T169905: Determine the list of content packs to make available in V1

(@JMinor and @atgo are working out which compilations we will host on that task)

@JMinor do you have an estimate of when you will know?

We just figured this out for the first pass, @JMinor should be updating the
task soon.

I'm confused on if this is archival content or something actively used by the app?

@chasemp these Zim files will be used to support offline reading within the Android app. So they will actively be used by the users who are interested in having large amounts of content for offline consumption.

This is not dissimilar to content hosted by Kiwix here:
http://wiki.kiwix.org/wiki/Content_in_all_languages

Except that it is formatted specifically for our clients and the content itself will be specifically curated for our use cases.

Does this answer your question?

bd808 added a subscriber: bd808.Jul 19 2017, 9:59 PM

@Fjalapeno so you want to stream 20G binary files off of the dumps.wikimedia.org backend server(s) to anyone running the Android app? I think we need to talk a lot more about this before the Cloud Services team can +1 that. Dumps is not as far as I know currently a "tier 1" supported service, and we are at the very beginning of taking it over.

@bd808 np... that’s what this ticket is trying to find out - we don’t know how and where this fits in to our infrastructure.

I’ll set a meeting up to chat. Anyone that should be there besides you?

It was suggested that the servers that host XML dumps would be a good fit for this use case.

Do you have a more complete description of the use-case and a pointer to who pointed you to Dumps as a solution? I'm going to guess that the need for fairly large pile of disk storage and an HTTP frontend was what let to thinking about the Dumps boxes.

The Swift cluster (which holds all the media for Commons) might be another thing to look into. Swift is a tier 1 service for sure. I can't speak for its ability to support your use-case though. We would need to first know more details and then talk to @fgiunchedi about the suitability.

@bd808 np... that’s what this ticket is trying to find out - we don’t know how and where this fits in to our infrastructure.
I’ll set a meeting up to chat. Anyone that should be there besides you?

I can probably talk to the Cloud Services side of things, at least enough for an initial meeting.

Nuria added a subscriber: Nuria.Jul 20 2017, 6:48 PM

@Fjalapeno: Do ping analytics too please, dumps is for archival content as @bd808 pointed out and , as far as i know, it is not fronted by varnish.

@Fjalapeno: Do ping analytics too please, dumps is for archival content as @bd808 pointed out and , as far as i know, it is not fronted by varnish.

No it is not. Some of these files are quite large, multiple GB of content. It's possble that in the future varnish could be used with chunked content transfers, but we have not looked into this.

@ArielGlenn thanks for the confirmation.

@GWicke do you have any insights here on caching/delivering large files? Not sure if you have experience with this in your current infrastructure

Fjalapeno updated the task description. (Show Details)Jul 21 2017, 7:12 PM
Nuria added a comment.Jul 21 2017, 7:24 PM

@Fjalapeno : dumps infrastructure is on labs which doesn't have the same level of availability that our production infrastructure, neither it communicates with it for the most part. Given that it seems not the best choice for a production level feature or even, a choice at all? (cc @bd808 to keep me honest here in that we do not production level services from labs)

@fgiunchedi I was directed to you by @bd808 who said you might have some info on the possibility of hosting these types of files using our Swift cloud storage infrastructure.

Do you have any input here? Would you be able to assist us in determining what is possible? Thanks!

bd808 added a comment.Jul 21 2017, 9:09 PM

@Fjalapeno : dumps infrastructure is on labs which doesn't have the same level of availability that our production infrastructure, neither it communicates with it for the most part. Given that it seems not the best choice for a production level feature or even, a choice at all? (cc @bd808 to keep me honest here in that we do not production level services from labs)

What is this 'labs' you speak of? ;) Cloud Services hosts a variety of projects, but you are correct that by and large we do not guarantee high availability in those projects. Our most ambitious goal is 99.5% availability for "core" OpenStack services which itself is poorly defined. We have not officially taken over the public facing dumps infrastructure yet nor set any SLA levels for the HTTP component of that service. These are some of the issues I talked through with @Fjalapeno in a conference call today that have not been documented here yet.

@Nuria after discussing with @bd808 we are looking more at Swift than Cloud Services. But Dumps is still being considered as a (temporary) fallback option depending on the outcome and timeline.

For now, the feature will be scoped to just a small set of compilations and a single platform. Usage is projected to be quite low.
We have also determined that we don't necessarily need logging (as we can handle analytics by using EL on the clients)

Having said that, we would like to get this on production hardware rather than Cloud Services. Swift seems to have many HTTP features that we need for downloading large files, and so that seems like the place to investigate first.

@fgiunchedi I was directed to you by @bd808 who said you might have some info on the possibility of hosting these types of files using our Swift cloud storage infrastructure.
Do you have any input here? Would you be able to assist us in determining what is possible? Thanks!

We do store "big" multimedia files in swift, though swift's built in limit is 5GB per file. When the file exceeds that size there are options in swift for "large files". Swift is fronted by varnish so Traffic should be also looped in.

Some followup questions I have: (not looking for exact answers immediately, ballpark figures are ok too now)

  1. expected initial size (number of files and bytes) and growth e.g. month over month (possibly for T169905)
  2. how a full deployment would look like size-wise
  3. how the files will be accessed by clients, since you mentioned chunked transfers and ranges
  4. expected req/s
  5. acceptable latency from swift
Nuria added a comment.Jul 25 2017, 2:58 PM

Analytics infrastructure to measure sift traffic already exist, thus you would only need to query the data store and set up a job that harvest the data, once you know the architecture of your feature I woudl start thinking of a metric that represents usage (absolute counts of downloads is not a good one). We can advise on that regard.

@fgiunchedi thanks for the info…

I'm working on the answers for you now, based on the outcome of the other ticket you linked to.

Also adding Traffic so it is on their radar.

Per our email, will set up a meeting so we can discus. Thanks again!

@Nuria thanks for the info on analytics as well… will keep you looped in as we figure things out.

Restricted Application added a project: Operations. · View Herald TranscriptJul 25 2017, 3:00 PM
Fjalapeno updated the task description. (Show Details)Jul 25 2017, 4:51 PM
Krinkle updated the task description. (Show Details)Jul 27 2017, 2:39 AM
Nemo_bis updated the task description. (Show Details)Jul 28 2017, 6:45 AM
ema moved this task from Triage to Caching on the Traffic board.Aug 1 2017, 12:45 PM
Fjalapeno closed this task as Resolved.Aug 9 2017, 7:18 PM

To follow up here, we most likely are using Swift for this. We have filed several follow up tasks for implementing and testing this solution. Closing this task and tracking progress there. If we find that Swift can't actually be used, we will reopen this task.