There are a few use cases for downloading dumps on internal hosts (initialization of WDQS hosts from wikidata dumps, for example). With the current rate limit at 2MB/s, this can take more time than needed. I don't think that we have a good reason to rate limit internally, so we probably should not. Or at least increase this limit for internal clients.
|operations/puppet||production||+1 -1||dumps distribution: increase the rate limit to 5MBps|
The use cas being run currently is actually the cirrus dumps to initialize cloudelastic servers. They are downloaded on mwmaint1002 with curl -s https://dumps.wikimedia.org/other/cirrussearch/20190429/enwiki-20190429-cirrussearch-general.json.gz
The other use case would be wdqs downloading https://dumps.wikimedia.org/wikidatawiki/entities/20170123/wikidata-20170123-all-BETA.ttl.bz2 (well, the latest similar dump) to one of the wdqs* server.
The particular limit we are running into is part of the nginx config in modules/dumps/templates/web/xmldumps/nginx.conf.erb which specifies limit_rate 2048k. I'm not sure about the particulars of varying this between requests from internal and external networks.
Ok, it took a bit of doing, but here's what I found:
The limits that are in place are based on having a server that was on 10G Ethernet but with less RAM, etc. Interesting thing is that notes in tasks and git state that the problem was that while the pipe was fast-flowing, the disks could not keep up. I see we are using the exact same speed of disk (7200RPM), but we are using SATA now instead of SAS...so we went to a slower and less reliable connection with larger individual disks which also reduces performance in a RAID.
I would expect any problems we saw on dataset1001 to be exactly the same on these servers, in other words. I also see no reason to rate limit internal connections differently than external ones because the problem will be just as bad if not actually worse because of the faster network speeds possible if the bottleneck is the disks. All that said, the restriction seems pretty draconian. It might be reasonable to try stepping up the rate a little bit. I wouldn't just remove it or make a huge jump, though.
I propose lifting the overall limit to 5120k to see how that affects the server. If that's ok, we could try inching it up more or splitting the limits somehow.
Repeating here some things from a chort chat in irc:
The original limits were set because we had one host doing all of
- web service to the public
- nfs service to analytics and labs
- rsync to public mirrors
- back-end nfs share for dumps generation
And one hoggy downloader could bring all of that to a halt (including an internal downloader).
The current web server does just web service and rsync to mirrors, iiuc, plus receiving rsyncs of dump files as they are ready, and grabbing other datasets via rsync periodically. This should mean there's a bunch of headroom to tweak things.
I opened a task about this back in the day: T191491
One thing to keep in mind is that from time to time all the services (web, rsync to mirrors, nfs to labs/analytics) wind up residing on the same host for maintenance, so during those times limits may need to be stricter.
So it looks like for the foreseeable future, using external dumps mirror will still be the way to go to retrieve full dumps internally. Unless there is more work to do that I don't see, we can probably close this task for now.
As fast as possible! using an external mirror, we get a download speed of ~60M/s, which means that a Wikidata dump can be downloaded in ~15 minutes. From our own dumps servers, at 5M/s this takes about 3 hours. So ideally the same speed would be great. We would already switch to using our own dump servers even with 2x slower than external.
Can we schedule rsyncs for the specifiic use cases with a higher bandwidth cap?
Would rsync put less load on the dump servers? From my point of view, we don't care much about rsync vs curl.
The thing about rsync is that we could set up a job to update just the dumps required at the interval wanted, with a single connection open. I'd feel a lot better about high bandwidth limits for something like that than generally opening up web service.
In the case of WDQS, we don't really have a schedule. It's an on demand requirement, whenever we need to do a data reload, which could happen for a number of reasons, and on different servers depending on the need.
I don't think we should spend much more time on this. We have a reasonable solution by using an external mirror. It just felt a bit weird to use an external mirror when we are the source of those dumps, but as it works and does not seem to be shocking for anyone, let's keep it at that!
I'm re-opening this as a follow up from a chat in this CR.
I think that we should find a solution for this as I find not ideal that we have to rely on an external source for our own data and raises some concerns:
- the total size of the 3 files to download is over 100GB
- with the external URL I guess you have to use the HTTP proxies, adding unnecessary strain there
- the integrity of those files should be verified against a checksum coming from our internal and authoritative dumps, and this doesn't seem the case AFAICT
Some alternatives that are worth to investigate:
- Fix the internal rate-limiting issue for internal clients only, the current dumps host has a 10G NIC so it shouldn't be a networking problem, not sure for the disk side of it.
- evaluate rsync for the transfer. For example we could have a slow rsync that copies only the required files to another host periodically and then have the cookbook pick them from this other location (either rsync or curl at that point) quickly.
- evaluate transfer.py for this use case ( see https://wikitech.wikimedia.org/wiki/Transfer.py )
Note that the checksum files for those dumps are available for download as well, since they are provided along with the main dump output files to all mirrors.
Someone from WMCS will probably need to look at this (again) if the discussion is being re-opened. They should have insight into the impact on existing services from any change.