Page MenuHomePhabricator

Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive)
Closed, DeclinedPublic

Description

I was trying to download the latest MediaWiki installer from http://dumps.wikimedia.org/mediawiki/1.20/mediawiki-1.20.2.tar.gz and it was going super-slow (about 25 KB/sec). My connection seems to be fine.

When I tried to debug the issue, at one point http://dumps.wikimedia.org/mediawiki/1.20/mediawiki-1.20.2.tar.gz threw a 403 Forbidden error. Something seems a bit off.


Version: unspecified
Severity: major

Details

Reference
bz43647

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:21 AM
bzimport set Reference to bz43647.

MZMcBride asked me to test it, so using wget my average speed was 47.3 KB/s and it took a total of 6m 18s.

<carl-m> I am getting really slow download speeds from dumps.wikimedia.org to the toolserver - slow like 68KB/sec . Is that expected?

Right now it's very fast, it saturates my home connection and is about 10 MiB/s on Toolserver: lowering priority.
At the time this bug was filed, and for several days on, people from several different places had super-slow download as reported; Ariel didn't see any obvious network problem, and there were people happily downloading at 5 MiB/s.
We also suspected the sitenotice on es.wiki in early January making people download Kiwix's ZIM files mirrored on dataset2, but the network graphs don't show any big peak (plus, Kelson comments that his personal server has about the same traffic, just for Kiwix): http://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&c=Miscellaneous+pmtpa&h=dataset2.wikimedia.org&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS
However, at the moment CPU nice is 6 times the yearly average (7,92 vs. 1,33) and CPU wio 3 (9,92 vs. 3,61).

Danny_B> dumps site is slow... downloading small rss file takes 10 secs... :-/

Network activity for dataset2 only 30 MB/s now vs. 50 average, download of a big files starts at some 10 KB/s speed for a while and eventually is something like 2-300 on Toolserver.

(In reply to comment #3)

However, at the moment CPU nice is 6 times the yearly average (7,92 vs. 1,33)
and CPU wio 3 (9,92 vs. 3,61).

This seems to vary with little effect on download speed.
The only other change I can see in ganglia is that, about the time this bug was filed, ~2 TB were freed on disk. http://ganglia.wikimedia.org/latest/graph.php?c=Miscellaneous%20pmtpa&h=dataset2.wikimedia.org&v=7317.328&m=disk_free&r=year&z=default&jr=&js=&st=1361113707&vl=GB&ti=Disk%20Space%20Available

From the weekly graph it would seem that something managed to pull 60 MB/s about a week ago, which matches the observation that while download is very slow for most people there are still some who manage to download very very quickly.

Nemo_bis reopened this task as Open.EditedFeb 12 2015, 7:49 PM

Happening again.

Henrik Abelsson wrote:

I don't know, but I get similar speeds both from the (colocated) server running stats.grok.se and from my laptop at home. Just for testing, I tried downloading a ubuntu iso and got 50 MB/sec on the server.

Jhs (also from Scandinavia) reported similar issues and speeds some weeks ago about https://dumps.wikimedia.org/wikidatawiki/20150113/wikidatawiki-20150113-pages-articles.xml.bz2 (which, I noticed, was slower to download than other dumps for me too)

I confirm curl https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-02/pagecounts-20150210-170000.gz > /dev/null is very slow for me from Italy (80 KB/s) and Finland (250 KB/s).

After we moved to the nginx web server on the dumps host, I let bandwidth and cap limits expire. Sure enough, some folks, unwittingly or not, took advantage of that so I reinstated them but was too agressive about the bw limits. You should see better behavior now. The caps are a little lower than they were but the connection limits are up to 3 per ip from the previous 2.

I still get ~100KB/sec with the current limits. At that speed, downloading an hours worth of views takes ~15 minutes, so downloading a day is thus ~6 hours. It's normally not a problem, since it's done fairly continuously, but catching up stats.grok.se when the collection has failed for a week will now take a few days, which is perhaps a bit unfortunate - I'm getting requests. :)

can you go with three connections all downloading and see how that is? Or is that with multiple connections?

No, about 100KB/sec is with a single connection. I can code up a parallel downloader to try, but fortunately there's no pressing need at the moment: it's caught up now.

Now, with three parallel downloads, the speed is about 1.3 MB/s for each of them, from a server in Finland.

Disk I/O is saturated:

root@dataset1001:~# iostat -h -m -t -x -d 5 dm-0
Linux 3.2.0-75-generic (dataset1001) 	02/16/2015 	_x86_64_	(4 CPU)

02/16/2015 09:33:01 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0
                  0.00     0.00  326.21   65.89    35.84     3.57   205.85     2.94    7.50    6.52   12.40   1.75  68.63

02/16/2015 09:33:06 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0
                  0.00     0.00   45.60  708.40     5.49    87.69   253.08   165.71  226.03  475.91  209.95   1.33 100.00

02/16/2015 09:33:11 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0
                  0.00     0.00   47.00 1231.60     5.49   152.35   252.81   167.46  128.40  398.01  118.11   0.78 100.00
In T45647#1042316, @ori wrote:

Disk I/O is saturated:

I guess that explains the erratic speeds I saw to a server in the UK - from over 2MB/s to barely a few KB/s

reedy@ko-kra:~$ wget https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-02/pagecounts-20150210-170000.gz
--2015-02-16 21:25:46--  https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-02/pagecounts-20150210-170000.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620:0:861:1:208:80:154:11, 208.80.154.11
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620:0:861:1:208:80:154:11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 116432248 (111M) [application/octet-stream]
Saving to: ‘pagecounts-20150210-170000.gz’

100%[==============================================================================================================================>] 116,432,248  850KB/s   in 2m 18s

2015-02-16 21:28:10 (823 KB/s) - ‘pagecounts-20150210-170000.gz’ saved [116432248/116432248]

reedy@ko-kra:~$
In T45647#1042316, @ori wrote:

Disk I/O is saturated:

This is known; this is the reason bandwidth & connection limits were instituted (this was troubleshot on IRC and sadly, not copied here). It was supposed to be a hotfix until we get more resources for it, although I don't remember seeing any hardware request for it. Maybe @ArielGlenn has something?

The speed of downloads from dumps varies considerably depending on the *file* you download, probably due to different disks involved. If some disk/partition happened to be less busy than others, some directory reshuffling could be an easy win.

(This week is probably not very representative because a big sync is ongoing.)

It's an lvm on hardware raid 6 so files should be spread about across disks. We don't have multiple filesystems/partitions so there's no shuffling around of files to be done.

After scrying through the access logs to check traffic patterns, looking at the behavior before we went to a 10G nic on dataset1001, and staring some more at iotop, lsof and etc., I think something we could do quickly that should help alot would be to add ms1001.wikimedia.org in as a host that responds to dumps/download.wm.org via rr or so. Its rsync is a bit behind so I'd need to fix that up, make sure that it gets an ipv6 address and so on, It's got comparable disks but more of them, more memory than dataset1001 and more cores too.

Nemo, what sort of variance are you seeing on file downloads? If you try the same file a few times in a row do you see a change then as well or no?

Note that the kiwix rsync finished up but the ms1001 rsync will throw things off again, though it, like the kiwix rsync, is bw capped.

The ms1001 rsync should kick off in about an hour, I'll give an ETA of finishing time about an hour after that. Not more than a couple of days I would think.

Nemo, what sort of variance are you seeing on file downloads?

When jhs reported the Wikidata XML dump was very slow, I tried downloading that and other XML dumps; I remember that day the Wikidata XML dump was three times slower than other (latest) XML dumps of similar size. It didn't matter whether I started the downloads in parallel or sequentially, IIRC, but I didn't rigorously test.

Could it be something about a file being accessed by multiple connections at the same time?

Forgot to give an ETA. But the rsync just finished. We're back to regular rsyncs out of cron now. Waiting for bond of the ethernet interfaces on the host, need to figure out the best way to pool it with dataset1001 into a cluster of 2, then we'll see how network traffic is at that point.

Nemo, that's pretty weird and I can't think of a good reason we'd see that behavior now and not in the past, as it's always been the case that when certain files are available the hordes come to download at the same time. I'm going to put off looking into it until we have a second machine serving files and we'll see how that does.

It would be nice to have those nginx stats for dumps hosts: https://ganglia.wikimedia.org/latest/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=.*&mreg%5B%5D=.*nginx.*&gtype=line&glegend=show&aggregate=1

Jeff, can you copy the relevant module from frack puppet to operations/puppet, please?

jcrespo raised the priority of this task from Medium to High.Sep 11 2015, 8:20 PM

@ArielGlenn Please look at the new report at T112190, which I think has the same root issues.

Unless something has changed on the throttling, we need more hardware for the dumps to survive the peaks you commented about.

Setting to high, as the merged ticket.

jcrespo renamed this task from dumps.wikimedia.org seems super-slow right now to At peak usage, dumps.wikimedia.org becomes very slow for users (sometimes unresponsive).Sep 11 2015, 8:22 PM

Note the tech press recently ran some articles (e.g. https://thestack.com/cloud/2015/09/09/wikipedia-anne-hathaway-open-source-web-trends-japan-research/ ) about this research paper: http://arxiv.org/pdf/1509.02218v1.pdf . The paper itself cites our pageview data on dumps.wm.o as the source they're looking at, and encouraging others to look at because we're offering some pretty unique timely and information that commercial entities find desirable. I wouldn't be surprised if this lead to a recent traffic increase on dumps.

I wouldn't be surprised if this lead to a recent traffic increase on dumps.

Traffic doesn't look exceptional though, around 100 MB/s. There were 4 other similar peaks in the last 7 months https://ganglia.wikimedia.org/latest/?r=month&cs=02%2F01%2F2015+00%3A00&ce=09%2F30%2F2015+00%3A00&m=cpu_report&c=Miscellaneous+eqiad&h=dataset1001.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS

P.s. The new summary is unproven. Most of the times, when we get such reports, there is no peak in usage at all. Moreover, we often experience that only some files are slow, not others, so the total usage of the machine can't be at fault.

Nemo_bis renamed this task from At peak usage, dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) to Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive).Sep 12 2015, 9:00 AM

Please note than on the original ticket, I noted that I do not think this is a network bandwith problem at all. I think, however, that there is some kind of contention on the server. The timouts are real and the semi-outage is real as detected by our own monitoring. I cannot demonstrate, however, that it is caused by too many requests or other problem (the contention could make the accesses being actually lower than average). These are the hourly statistics of access in the last days:

1hour (from XX:00:00 to XX:59:59) - number of accesses - number of errors (includes throttling)
2--
308 Sep
4======
500 38433 26969
601 22771 13009
702 15756 5771
803 29913 3921
904 15263 6093
1005 25680 5484
1106 16974 7060
1207 24549 14151
1308 23160 13073
1409 37329 10163
1510 16517 6391
1611 13274 4025
1712 16530 4784
1813 25188 4963
1914 16878 8642
2015 34661 8134
2116 15991 6435
2217 15793 4496
2318 16069 3249
2419 16731 6419
2520 21278 12054
2621 48623 10543
2722 14862 6369
2823 11640 3210
29
3009 Sep
31======
3200 15473 3586
3301 12493 3565
3402 12929 2713
3503 28769 1319
3604 16248 6644
3705 32255 6701
3806 21405 5103
3907 13788 3543
4008 19608 8688
4109 38852 11051
4210 22322 11269
4311 24678 13581
4412 35360 21701
4513 56388 34709
4614 30369 20657
4715 44061 16708
4816 12999 3148
4917 10707 2059
5018 14828 5101
5119 11652 1374
5220 12173 2192
5321 42421 1911
5422 11338 1309
5523 11801 1907
56
5710 Sep
58======
59
6000 18639 5088
6101 21393 8016
6202 22728 10088
6303 33119 3382
6404 19633 3182
6505 27269 2356
6606 12352 1910
6707 11028 1606
6808 16244 5675
6909 45812 12329
7010 17476 7385
7111 14299 4635
7212 17681 4924
7313 35240 9811
7414 20304 9796
7515 35359 6252
7616 12473 1984
7717 12521 2168
7818 13036 3699
7919 19477 9713
8020 20427 9682
8121 42154 2449
8222 16039 2019
8323 15045 2074
84
8511 Sep
86======
8700 16692 2668
8801 18877 3537
8902 18166 3225
9003 23961 3374
9104 18301 2579
9205 29176 2825
9306 31005 5002
9407 24364 5052
9508 24795 5568
9609 42619 7841
9710 15882 5092
9811 12803 4132
9912 20636 7989
10013 28609 7659
10114 20758 9048
10215 38373 9939
10316 22916 12973
10417 26520 16989
10518 24820 14283
10619 23422 12904
10720 15803 6488
10821 45504 6155
10922 13896 4461
11023 14402 5557
111
11212 Sep
113======
11400 17119 4581
11501 13912 5436
11602 12724 4719
11703 32781 6813
11804 16156 7261
11905 36609 13492
12006 25919 13464
12107 18580 5597
12208 15651 4023

The CPU wait time went up to 40% and the load went up to 10- I think we are not serving requests as fast as we can. At those points, network speed is lower than normally, if any: https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=dataset1001.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS_|_cpu_|_disk_|_load_|_network_|_process

I'll see if I can correlate the times to server activity to get a lead on this.

I did not see anything strange happening this month's run, which has now concluded. Can folks on this task let me know if there was a period during which they saw a real slowdown, and if so, when?

Reminder that the monthly fulls are running again; if people could keep an eye out and let me know particular intervals when they have issues, that would help a lot. Thanks!

I've yet to find a file or occasion where average download speed goes over 1.5-2 MB/s... rather painful when trying to do ad hoc analysis.

Is someone still suffering from this issue anymore? If not, it should be closed.

Given that the hosting setup for this service is different now, this might as well be closed. If folks notice problems in the future they can create a new task.