Storage request for datasets published by research team
The research team would like to release public datasets hosted on wikimedia infrastructure. These datasets are dumps-esque in nature, consisting of a number of larger files that together represent a full dataset.
Following a conversation with Lukasz, this storage request is for a proof of concept dataset that the research team will use to develop the process and tooling for future datasets. We will also request additional storage resources for the upcoming Misc swift cluster.
PoC Dataset
The dataset to serve as a proof of concept are the commons images at a 300px thumbnail resolution, released for machine learning research. The total size of the dataset is ~3TB, stored in e.g. 3000 files that are ~1GB each. The files are gzipped, newline delimited json strings; each line corresponds to an image, and the columns include the base64 encoded image bytes. The dataset is generated using spark on the data engineering cluster, so the number and size of the files is configurable.
Growth
Generally:
- no duplicated data. ie. if there is a monthly schedule to update the dataset, the growth of the dataset comes from the additional data from the new month.
- not all datasets need to be regularly updated
- for data privacy reasons, some data might be deleted after a period of time; resulting in a little growth over time
For the poc dataset, the growth would mirror the growth of the commons images; though we don't expect to update the poc dataset regularly before moving to the new misc swift cluster.
Expected traffic / hit rate
Datasets released by team are generally used for academic research. This results in a burst of traffic for the initial download of the data, possibly from a distributed system. In addition, upon initial release of a dataset, there are possibly multiple users that download the dataset concurrently.
- overall, few distinct users that download data
- expect most users to download the full dataset
- a partitioning scheme can be used to enable downloading subsets of the data. The partitioning depends on the type of dataset, most common is to use wiki project and/or time based partitions (e.g. "commons/2021/10/15/file001_005.json.gzipped")
Given the above usage description, I am not sure whether the hit rate is meaningful in this context.
Permissions for writing (new account)
Likely, a new account and/or swift container for swift container is needed?
Misc swift cluster
What is the process for requesting resources on the future misc swift cluster?