Page MenuHomePhabricator

Storage request for datasets published by research team
Open, Needs TriagePublic

Description

Storage request for datasets published by research team

The research team would like to release public datasets hosted on wikimedia infrastructure. These datasets are dumps-esque in nature, consisting of a number of larger files that together represent a full dataset.

Following a conversation with Lukasz, this storage request is for a proof of concept dataset that the research team will use to develop the process and tooling for future datasets. We will also request additional storage resources for the upcoming Misc swift cluster.

PoC Dataset

The dataset to serve as a proof of concept are the commons images at a 300px thumbnail resolution, released for machine learning research. The total size of the dataset is ~3TB, stored in e.g. 3000 files that are ~1GB each. The files are gzipped, newline delimited json strings; each line corresponds to an image, and the columns include the base64 encoded image bytes. The dataset is generated using spark on the data engineering cluster, so the number and size of the files is configurable.

Growth

Generally:

  • no duplicated data. ie. if there is a monthly schedule to update the dataset, the growth of the dataset comes from the additional data from the new month.
  • not all datasets need to be regularly updated
  • for data privacy reasons, some data might be deleted after a period of time; resulting in a little growth over time

For the poc dataset, the growth would mirror the growth of the commons images; though we don't expect to update the poc dataset regularly before moving to the new misc swift cluster.

Expected traffic / hit rate

Datasets released by team are generally used for academic research. This results in a burst of traffic for the initial download of the data, possibly from a distributed system. In addition, upon initial release of a dataset, there are possibly multiple users that download the dataset concurrently.

  • overall, few distinct users that download data
  • expect most users to download the full dataset
  • a partitioning scheme can be used to enable downloading subsets of the data. The partitioning depends on the type of dataset, most common is to use wiki project and/or time based partitions (e.g. "commons/2021/10/15/file001_005.json.gzipped")

Given the above usage description, I am not sure whether the hit rate is meaningful in this context.

Permissions for writing (new account)

Likely, a new account and/or swift container for swift container is needed?

Misc swift cluster

What is the process for requesting resources on the future misc swift cluster?

Event Timeline

Hi,

Sorry for the delay in getting back to you. I have a couple of questions about your request, if I may:

  1. Are you OK with using the S3 protocol (rather than the Swift protocol)?
  2. Do you have an idea of how much performance/bandwidth you need (read and write)?
  3. What sort of timescale do you need this storage on?

Thanks :)

Thank you for the reply!

  1. Are you OK with using the S3 protocol (rather than the Swift protocol)?

Yes that would work well with the largish file like objects we intend to store. However, I was under the impression that the S3 protocol will only be enabled for the new misc swift cluster which will not be available for a while

  1. Do you have an idea of how much performance/bandwidth you need (read and write)?

That is hard to say. Looking at previous dataset releases (which are much smaller in scale, at most a couple gigabytes), we can expect the number of downloads of the full dataset to be in the 100s, with few concurrent reads (we could add instructions/tooling for the recommended way to download a full dataset). Do you have statistics about the xml dumps downloads for reference? That would likely be a (high) upper bound of what to expect.

In regards to writes, this will only be done using pipelines maintained by the research team; so we will able to configure these according to your recommended/requested write load.

  1. What sort of timescale do you need this storage on?

The approach we discussed with Lukasz is to start with a proof of concept dataset to develop our tooling as soon as possible, hosted on the existing swift cluster. This is under the assumption that the scale of the proof of concept dataset is limited given the overall resources available.

Hi,

  1. Are you OK with using the S3 protocol (rather than the Swift protocol)?

Yes that would work well with the largish file like objects we intend to store. However, I was under the impression that the S3 protocol will only be enabled for the new misc swift cluster which will not be available for a while

It's true that the main Swift cluster doesn't support S3, but the Thanos cluster does, and that has enough capacity at least for your proof-of-concept needs.

  1. What sort of timescale do you need this storage on?

The approach we discussed with Lukasz is to start with a proof of concept dataset to develop our tooling as soon as possible, hosted on the existing swift cluster. This is under the assumption that the scale of the proof of concept dataset is limited given the overall resources available.

In which case, I think we can use Thanos for the proof-of-concept, and depending on how that goes look to move to MOSS (the newer cluster) in due course.

That seem good?

Hi,

That sounds good to me. Thank you!

Please let me know what the next steps are.

Change 737913 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] profile::thanos::swift: add account for research datasets poc

https://gerrit.wikimedia.org/r/737913

Change 737915 had a related patch set uploaded (by MVernon; author: MVernon):

[labs/private@master] profile::thanos::swift: fake creds for research_poc

https://gerrit.wikimedia.org/r/737915

Hi,

Please let me know what the next steps are.

I need to make some credentials (which I've not done before, but there is a process to follow ) and then you'll need to get hold of them. Do you have access to the private puppet repository?

[the obvious person to review these CRs is away this week, so I can't promise a rapid turnaround]

Hi,

Thanks for getting this started, no worries about the delay for review.

I presume I don't have access to the private puppet repository, I haven't been involved in changes to that previously.

Change 737913 merged by MVernon:

[operations/puppet@production] profile::thanos::swift: add account for research datasets poc

https://gerrit.wikimedia.org/r/737913

Change 737915 merged by MVernon:

[labs/private@master] profile::thanos::swift: fake creds for research_poc

https://gerrit.wikimedia.org/r/737915

Mentioned in SAL (#wikimedia-operations) [2021-11-23T15:27:01Z] <Emperor> rolling restart of thanos frontends T294380

Account is created; I gather the usual approach is to instruct puppet to write a configuration file with the relevant details in it (taken from profile::thanos::swift::accounts_keys ), similar to how objstore.yaml is written by modules/thanos/manifests/compact.pp or the lookups in modules/profile/manifests/docker_registry_ha/registry.pp.

Thank you for the updates.

I am not familiar with the additional steps required to activate/use the credentials in puppet. Would you be able to provide a more concrete pointer/link to an example or phab, or a IRC/person that could ask for assistance?

Here's how we do it for the analytics_admin swift account:
https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/cluster/secrets.pp#L55-L64

Where do you need these creds deployed? And it what format? We've only used swift auth environment variables ST_AUTH, ST_USER and ST_KEY.

At first we would like to use the swift credentials from yarn containers, both from spark and skein based applications. This will mostly used for write operations using the S3 protocol.

What are the format options are you referring to? I was looking at the options in the docs, I don't have a preference; whatever options is easiest to setup and work with.

Hm, I haven't attempted to access swift using the S3 protocol. How does swift auth work there? I was about to render an env file you could use for the swift CLI or python client: https://docs.openstack.org/python-swiftclient/latest/cli/index.html

But, will that work with just S3 REST? Probably not? I suppose I could render the env file, you could source the auth vars you need from that, and then pass them in the appropriate HTTP headers for the S3 REST protocol?

For S3, you need three things - access key, secret key, endpoint.

For thanos, these are:
access key: the username
secret key: the passphrase
endpoint: https://thanos-swift.discovery.wmnet/

Your S3 client will have a way to be passed those things (e.g. ~/.s3cfg for s3cmd)

@MatthewVernon endpoint should just be the host URL, without the /auth/v1.0 path?

And one more question:

Once the authentication is working, and some files are inserted with a public-read ACL - will these files be publicly accessible from outside the WMF? If not, I assume one option is to use a API service to "route" these files, which is something plan on doing for other use cases that need to manipulate the swift data before returning a response. However, it would be great if there is a way to make files publicly available directly.