Do not rate limit dumps from internal network
Closed, ResolvedPublic
Actions

Description

There are a few use cases for downloading dumps on internal hosts (initialization of WDQS hosts from wikidata dumps, for example). With the current rate limit at 2MB/s, this can take more time than needed. I don't think that we have a good reason to rate limit internally, so we probably should not. Or at least increase this limit for internal clients.

Details

Subject	Repo	Branch	Lines +/-
query_service: Allow query hosts to rsync data from clouddumps	operations/puppet	production	+1 -1
Mount labstore to wcqs/wdqs instance for dumps reload	operations/puppet	production	+49 -49
Mount labstore to wcqs/wdqs instance for dumps reload	operations/puppet	production	+66 -1
Mount labstore to wcqs/wdqs instance for dumps reload	operations/puppet	production	+58 -1
dumps distribution: increase the rate limit to 5MBps	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T327689: Use rsync instead of NFS for wdqs data reload cookbook
rCCKB26a7d01d0851: Use proxy server to download dumps
T316236: Reload WCQS from dumps
T161863: Support searching for external links in CirrusSearch
T258709: Add nfs mount point to labstore for wdqs servers (wikibase dumps access)
T128874: How can downloaders get good bandwidth with no impact on dumps production?
Mentioned Here: T327689: Use rsync instead of NFS for wdqs data reload cookbook
T191491: Adjust bandwidth/connection limits, memory settings on labstore1006,7 as appropriate

Event Timeline

Gehel created this task.May 2 2019, 9:34 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 2 2019, 9:34 AM

Where are these dumps being downloaded from, and by what means? (I may need to add someone else to this task to weigh in, depending on the answer.)

The use cas being run currently is actually the cirrus dumps to initialize cloudelastic servers. They are downloaded on mwmaint1002 with curl -s https://dumps.wikimedia.org/other/cirrussearch/20190429/enwiki-20190429-cirrussearch-general.json.gz

The other use case would be wdqs downloading https://dumps.wikimedia.org/wikidatawiki/entities/20170123/wikidata-20170123-all-BETA.ttl.bz2 (well, the latest similar dump) to one of the wdqs* server.

Ok, so adding @Bstorm to weigh in about these limits or to bounce it to someone else on the WMCS team. There's an open ticket for that too: T191491

EBernhardson moved this task from needs triage to Ops / SRE on the Discovery-Search board.May 2 2019, 5:04 PM

The particular limit we are running into is part of the nginx config in modules/dumps/templates/web/xmldumps/nginx.conf.erb which specifies limit_rate 2048k. I'm not sure about the particulars of varying this between requests from internal and external networks.

Dzahn triaged this task as Medium priority.May 3 2019, 8:39 PM

Gehel added a project: Wikidata-Query-Service.Oct 25 2019, 2:36 PM

Restricted Application added a project: Wikidata. · View Herald TranscriptOct 25 2019, 2:36 PM

Addshore moved this task from incoming to monitoring on the Wikidata board.Oct 30 2019, 1:57 PM

It looks like the limit was last raised 5 years ago. I'll double check a couple things, but I suspect that's old stuff we can raise.

Ok, it took a bit of doing, but here's what I found:
The limits that are in place are based on having a server that was on 10G Ethernet but with less RAM, etc. Interesting thing is that notes in tasks and git state that the problem was that while the pipe was fast-flowing, the disks could not keep up. I see we are using the exact same speed of disk (7200RPM), but we are using SATA now instead of SAS...so we went to a slower and less reliable connection with larger individual disks which also reduces performance in a RAID.

I would expect any problems we saw on dataset1001 to be exactly the same on these servers, in other words. I also see no reason to rate limit internal connections differently than external ones because the problem will be just as bad if not actually worse because of the faster network speeds possible if the bottleneck is the disks. All that said, the restriction seems pretty draconian. It might be reasonable to try stepping up the rate a little bit. I wouldn't just remove it or make a huge jump, though.

I propose lifting the overall limit to 5120k to see how that affects the server. If that's ok, we could try inching it up more or splitting the limits somehow.

Change 555632 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: increase the rate limit to 5MBps

https://gerrit.wikimedia.org/r/555632

gerritbot added a project: Patch-For-Review.Dec 7 2019, 12:14 AM

• Bstorm added projects: Data-Services, cloud-services-team (Kanban).Dec 7 2019, 12:15 AM

• Bstorm moved this task from Backlog to Dumps on the Data-Services board.

• Bstorm moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Repeating here some things from a chort chat in irc:

The original limits were set because we had one host doing all of

web service to the public
nfs service to analytics and labs
rsync to public mirrors
back-end nfs share for dumps generation

And one hoggy downloader could bring all of that to a halt (including an internal downloader).

The current web server does just web service and rsync to mirrors, iiuc, plus receiving rsyncs of dump files as they are ready, and grabbing other datasets via rsync periodically. This should mean there's a bunch of headroom to tweak things.

I opened a task about this back in the day: T191491

One thing to keep in mind is that from time to time all the services (web, rsync to mirrors, nfs to labs/analytics) wind up residing on the same host for maintenance, so during those times limits may need to be stricter.

Change 555632 merged by Bstorm:
[operations/puppet@production] dumps distribution: increase the rate limit to 5MBps

https://gerrit.wikimedia.org/r/555632

Maintenance_bot removed a project: Patch-For-Review.Dec 17 2019, 4:11 PM

So it looks like for the foreseeable future, using external dumps mirror will still be the way to go to retrieve full dumps internally. Unless there is more work to do that I don't see, we can probably close this task for now.

How fast a download do folks want? Can we schedule rsyncs for the specifiic use cases with a higher bandwidth cap?

In T222349#5751029, @ArielGlenn wrote:

How fast a download do folks want?

As fast as possible! using an external mirror, we get a download speed of ~60M/s, which means that a Wikidata dump can be downloaded in ~15 minutes. From our own dumps servers, at 5M/s this takes about 3 hours. So ideally the same speed would be great. We would already switch to using our own dump servers even with 2x slower than external.

Can we schedule rsyncs for the specifiic use cases with a higher bandwidth cap?

Would rsync put less load on the dump servers? From my point of view, we don't care much about rsync vs curl.

In T222349#5753559, @Gehel wrote:

In T222349#5751029, @ArielGlenn wrote:

How fast a download do folks want?

As fast as possible! using an external mirror, we get a download speed of ~60M/s, which means that a Wikidata dump can be downloaded in ~15 minutes. From our own dumps servers, at 5M/s this takes about 3 hours. So ideally the same speed would be great. We would already switch to using our own dump servers even with 2x slower than external.

Can we schedule rsyncs for the specifiic use cases with a higher bandwidth cap?

Would rsync put less load on the dump servers? From my point of view, we don't care much about rsync vs curl.

The thing about rsync is that we could set up a job to update just the dumps required at the interval wanted, with a single connection open. I'd feel a lot better about high bandwidth limits for something like that than generally opening up web service.

In the case of WDQS, we don't really have a schedule. It's an on demand requirement, whenever we need to do a data reload, which could happen for a number of reasons, and on different servers depending on the need.

I don't think we should spend much more time on this. We have a reasonable solution by using an external mirror. It just felt a bit weird to use an external mirror when we are the source of those dumps, but as it works and does not seem to be shocking for anyone, let's keep it at that!

Closing this as it seems that we have the rate limiting that we want at this point. We'll continue relying on external mirrors for the time being.

dcausse mentioned this in T128874: How can downloaders get good bandwidth with no impact on dumps production?.Jul 23 2020, 1:02 PM

dcausse mentioned this in T258709: Add nfs mount point to labstore for wdqs servers (wikibase dumps access).Jul 23 2020, 2:10 PM

I'm re-opening this as a follow up from a chat in this CR.
I think that we should find a solution for this as I find not ideal that we have to rely on an external source for our own data and raises some concerns:

the total size of the 3 files to download is over 100GB
with the external URL I guess you have to use the HTTP proxies, adding unnecessary strain there
the integrity of those files should be verified against a checksum coming from our internal and authoritative dumps, and this doesn't seem the case AFAICT

Some alternatives that are worth to investigate:

Fix the internal rate-limiting issue for internal clients only, the current dumps host has a 10G NIC so it shouldn't be a networking problem, not sure for the disk side of it.
evaluate rsync for the transfer. For example we could have a slow rsync that copies only the required files to another host periodically and then have the cookbook pick them from this other location (either rsync or curl at that point) quickly.
evaluate transfer.py for this use case ( see https://wikitech.wikimedia.org/wiki/Transfer.py )

Adding @ayounsi and @cmooney if they have any comment on the network side of it.

Restricted Application added a project: [DEPRECATED] wdwb-tech. · View Herald TranscriptDec 16 2021, 10:32 AM

Note that the checksum files for those dumps are available for download as well, since they are provided along with the main dump output files to all mirrors.

Someone from WMCS will probably need to look at this (again) if the discussion is being re-opened. They should have insight into the impact on existing services from any change.

Gehel moved this task from Incoming to Watching / Waiting on the Wikidata-Query-Service board.Dec 20 2021, 4:16 PM

Gehel moved this task from Watching / Waiting to Operations/SRE on the Wikidata-Query-Service board.Jan 13 2022, 3:06 PM

Gehel removed Gehel as the assignee of this task.Aug 22 2022, 6:35 PM

EBernhardson mentioned this in T161863: Support searching for external links in CirrusSearch.Sep 15 2022, 7:11 PM

EBernhardson mentioned this in T316236: Reload WCQS from dumps.Sep 15 2022, 7:17 PM

Change 832543 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):