Reading dumps is painfully slow on Toolforge Kubernetes. Below some ideas for fixing this, but I’m not sure if any of them would work.
Steps to reproduce:
- On Toolforge Kubernetes, launch a job that runs the following command: time cat /public/data/public/wikidatawiki/entities/20210215/wikidata-20210215-all.json.bz2 >/dev/null Since the dump files are compressed, typical tools will do sequential reads; cat seems like a reasonable model for this access pattern.
- Divide 62424 (the file size in Megabytes, assuming 1 Megabyte = 1 million bytes) by the reported time in seconds. This gives the read throughput in MByte/s.
Observed: 4.8 MByte/s
Expected: 100–200 MByte/s
To find this expected throughput number, I've done some benchmarking on Digital Ocean, a commercial cloud provider that according to their blog uses Ceph for mounted block storage. I created a new Ubuntu 20.10 virtual machine in their NYC1 datacenter, using the minimal $5/month offering. To that machine, I attached a 100 GB block storage volume, and then downloaded wikidata-20210215-all.json.bz2 into the mounted volume. After a reboot to clear caches, cat could read the 62424 MB Wikidata dump from Ceph in 296 seconds. So, on Digital Ocean, the read throughput from mounted Ceph storage was 210 MByte/s, about 44 times faster than reading dumps on Toolforge from NFS.
Admittedly, this comparison is flawed: On Digital Ocean, Ceph volumes get exclusively mounted into one single machine; whereas on Toolforge, the Wikimedia dump files would get concurrently accessed by multiple Kubernetes nodes. Also, Wikimedia’s networking equipment might be different than Digital Ocean’s. Still, the current toolforge setup makes it hard to write tools that process dumps in reasonable time. Some potential approaches, but not sure if they’d work:
- Kubernetes has a ReadOnlyMany option in its persistent volume claims, which is supported by Ceph RBD (not sure about Cinder). However, probably this can’t be used at the same time as ReadWriteOnce, so it’s unclear how the the dump volume would get populated. On the other hand, as a tool writer, I wouldn’t care if the dumps volume doesn’t change after the cronjob has been launched, as long as the mount is reasonably current upon the next run.
- Maybe replicate the dump files to Swift, making the S3 API accessible to Toolforge? As a tool developer, I wouldn’t mind calling a special command (or an S3-compatible client library) to fetch the data, as long as it’s significantly faster than the current NFS mount.
- Could dumps.wikimedia.org offer better bandwidth to Toolforge? Currently, dumps can be fetched from Toolforge over HTTP at about 4 MByte/sec, which is even slower than NFS. If the publicly accessible dumps.wikimedia.org server is physically co-located with the Toolforge cluster (not sure if that’s the case), perhaps the bandwidth limit could be increased? If it’s located elsewhere, perhaps partner with some CDN?