Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive)
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	• MZMcBride
	Jan 5 2013, 12:59 AM

Description

I was trying to download the latest MediaWiki installer from http://dumps.wikimedia.org/mediawiki/1.20/mediawiki-1.20.2.tar.gz and it was going super-slow (about 25 KB/sec). My connection seems to be fine.

When I tried to debug the issue, at one point http://dumps.wikimedia.org/mediawiki/1.20/mediawiki-1.20.2.tar.gz threw a 403 Forbidden error. Something seems a bit off.

Version: unspecified
Severity: major

Details

Reference: bz43647

Related Objects
Search...

Status	Assigned	Task
Duplicate	ArielGlenn	T122917 Provide a good download service of dumps from Wikimedia
Declined	ArielGlenn	T45647 Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive)
Declined	ArielGlenn	T120425 dumps.wikimedia.org seems to have poor throughput towards some destinations

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:21 AM

• bzimport added projects: Datasets-General-or-Unknown, acl*sre-team.

• bzimport set Reference to bz43647.

• MZMcBride created this task.Jan 5 2013, 12:59 AM

MZMcBride asked me to test it, so using wget my average speed was 47.3 KB/s and it took a total of 6m 18s.

<carl-m> I am getting really slow download speeds from dumps.wikimedia.org to the toolserver - slow like 68KB/sec . Is that expected?

Right now it's very fast, it saturates my home connection and is about 10 MiB/s on Toolserver: lowering priority.
At the time this bug was filed, and for several days on, people from several different places had super-slow download as reported; Ariel didn't see any obvious network problem, and there were people happily downloading at 5 MiB/s.
We also suspected the sitenotice on es.wiki in early January making people download Kiwix's ZIM files mirrored on dataset2, but the network graphs don't show any big peak (plus, Kelson comments that his personal server has about the same traffic, just for Kiwix): http://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&c=Miscellaneous+pmtpa&h=dataset2.wikimedia.org&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS
However, at the moment CPU nice is 6 times the yearly average (7,92 vs. 1,33) and CPU wio 3 (9,92 vs. 3,61).

Danny_B> dumps site is slow... downloading small rss file takes 10 secs... :-/

Network activity for dataset2 only 30 MB/s now vs. 50 average, download of a big files starts at some 10 KB/s speed for a while and eventually is something like 2-300 on Toolserver.

(In reply to comment #3)

However, at the moment CPU nice is 6 times the yearly average (7,92 vs. 1,33)
and CPU wio 3 (9,92 vs. 3,61).

This seems to vary with little effect on download speed.
The only other change I can see in ganglia is that, about the time this bug was filed, ~2 TB were freed on disk. http://ganglia.wikimedia.org/latest/graph.php?c=Miscellaneous%20pmtpa&h=dataset2.wikimedia.org&v=7317.328&m=disk_free&r=year&z=default&jr=&js=&st=1361113707&vl=GB&ti=Disk%20Space%20Available

From the weekly graph it would seem that something managed to pull 60 MB/s about a week ago, which matches the observation that while download is very slow for most people there are still some who manage to download very very quickly.

dataset1001 network is fine, unlikely this is still an issue

http://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=dataset1001.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS

Happening again.

Henrik Abelsson wrote:

I don't know, but I get similar speeds both from the (colocated) server running stats.grok.se and from my laptop at home. Just for testing, I tried downloading a ubuntu iso and got 50 MB/sec on the server.

Jhs (also from Scandinavia) reported similar issues and speeds some weeks ago about https://dumps.wikimedia.org/wikidatawiki/20150113/wikidatawiki-20150113-pages-articles.xml.bz2 (which, I noticed, was slower to download than other dumps for me too)

I confirm curl https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-02/pagecounts-20150210-170000.gz > /dev/null is very slow for me from Italy (80 KB/s) and Finland (250 KB/s).

• ggellerman set Security to None.Feb 12 2015, 7:51 PM

• ggellerman added a subscriber: Ottomata.

jayvdb subscribed.Feb 12 2015, 10:42 PM

After we moved to the nginx web server on the dumps host, I let bandwidth and cap limits expire. Sure enough, some folks, unwittingly or not, took advantage of that so I reinstated them but was too agressive about the bw limits. You should see better behavior now. The caps are a little lower than they were but the connection limits are up to 3 per ip from the previous 2.

I still get ~100KB/sec with the current limits. At that speed, downloading an hours worth of views takes ~15 minutes, so downloading a day is thus ~6 hours. It's normally not a problem, since it's done fairly continuously, but catching up stats.grok.se when the collection has failed for a week will now take a few days, which is perhaps a bit unfortunate - I'm getting requests. :)

can you go with three connections all downloading and see how that is? Or is that with multiple connections?

No, about 100KB/sec is with a single connection. I can code up a parallel downloader to try, but fortunately there's no pressing need at the moment: it's caught up now.

Now, with three parallel downloads, the speed is about 1.3 MB/s for each of them, from a server in Finland.

Disk I/O is saturated:

root@dataset1001:~# iostat -h -m -t -x -d 5 dm-0
Linux 3.2.0-75-generic (dataset1001) 	02/16/2015 	_x86_64_	(4 CPU)

02/16/2015 09:33:01 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0
                  0.00     0.00  326.21   65.89    35.84     3.57   205.85     2.94    7.50    6.52   12.40   1.75  68.63

02/16/2015 09:33:06 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0
                  0.00     0.00   45.60  708.40     5.49    87.69   253.08   165.71  226.03  475.91  209.95   1.33 100.00

02/16/2015 09:33:11 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0
                  0.00     0.00   47.00 1231.60     5.49   152.35   252.81   167.46  128.40  398.01  118.11   0.78 100.00

In T45647#1042316, @ori wrote:

Disk I/O is saturated:

I guess that explains the erratic speeds I saw to a server in the UK - from over 2MB/s to barely a few KB/s

reedy@ko-kra:~$ wget https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-02/pagecounts-20150210-170000.gz
--2015-02-16 21:25:46--  https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-02/pagecounts-20150210-170000.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620:0:861:1:208:80:154:11, 208.80.154.11
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620:0:861:1:208:80:154:11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 116432248 (111M) [application/octet-stream]
Saving to: ‘pagecounts-20150210-170000.gz’

100%[==============================================================================================================================>] 116,432,248  850KB/s   in 2m 18s

2015-02-16 21:28:10 (823 KB/s) - ‘pagecounts-20150210-170000.gz’ saved [116432248/116432248]

reedy@ko-kra:~$

In T45647#1042316, @ori wrote:

Disk I/O is saturated:

This is known; this is the reason bandwidth & connection limits were instituted (this was troubleshot on IRC and sadly, not copied here). It was supposed to be a hotfix until we get more resources for it, although I don't remember seeing any hardware request for it. Maybe @ArielGlenn has something?

The speed of downloads from dumps varies considerably depending on the *file* you download, probably due to different disks involved. If some disk/partition happened to be less busy than others, some directory reshuffling could be an easy win.

(This week is probably not very representative because a big sync is ongoing.)

It's an lvm on hardware raid 6 so files should be spread about across disks. We don't have multiple filesystems/partitions so there's no shuffling around of files to be done.

After scrying through the access logs to check traffic patterns, looking at the behavior before we went to a 10G nic on dataset1001, and staring some more at iotop, lsof and etc., I think something we could do quickly that should help alot would be to add ms1001.wikimedia.org in as a host that responds to dumps/download.wm.org via rr or so. Its rsync is a bit behind so I'd need to fix that up, make sure that it gets an ipv6 address and so on, It's got comparable disks but more of them, more memory than dataset1001 and more cores too.

Nemo, what sort of variance are you seeing on file downloads? If you try the same file a few times in a row do you see a change then as well or no?

Note that the kiwix rsync finished up but the ms1001 rsync will throw things off again, though it, like the kiwix rsync, is bw capped.

The ms1001 rsync should kick off in about an hour, I'll give an ETA of finishing time about an hour after that. Not more than a couple of days I would think.

Nemo, what sort of variance are you seeing on file downloads?

When jhs reported the Wikidata XML dump was very slow, I tried downloading that and other XML dumps; I remember that day the Wikidata XML dump was three times slower than other (latest) XML dumps of similar size. It didn't matter whether I started the downloads in parallel or sequentially, IIRC, but I didn't rigorously test.

Could it be something about a file being accessed by multiple connections at the same time?

Forgot to give an ETA. But the rsync just finished. We're back to regular rsyncs out of cron now. Waiting for bond of the ethernet interfaces on the host, need to figure out the best way to pool it with dataset1001 into a cluster of 2, then we'll see how network traffic is at that point.

Nemo, that's pretty weird and I can't think of a good reason we'd see that behavior now and not in the past, as it's always been the case that when certain files are available the hordes come to download at the same time. I'm going to put off looking into it until we have a second machine serving files and we'll see how that does.

Ottomata unsubscribed.Mar 20 2015, 3:21 PM

It would be nice to have those nginx stats for dumps hosts: https://ganglia.wikimedia.org/latest/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=.*&mreg%5B%5D=.*nginx.*&gtype=line&glegend=show&aggregate=1

Jeff, can you copy the relevant module from frack puppet to operations/puppet, please?

Dzahn mentioned this in rOPUP0dec98ea840c: dumps: improve nginx disk utilisation via directio.Apr 29 2015, 8:05 PM

Dzahn mentioned this in T112190: download.wikimedia.org is slow from Telecom Italia.Sep 11 2015, 8:03 PM

• jcrespo merged a task: T112190: download.wikimedia.org is slow from Telecom Italia.Sep 11 2015, 8:17 PM

• jcrespo added subscribers: • jcrespo, Dzahn, Krenair and 4 others.

@ArielGlenn Please look at the new report at T112190, which I think has the same root issues.

Unless something has changed on the throttling, we need more hardware for the dumps to survive the peaks you commented about.

Setting to high, as the merged ticket.

• jcrespo renamed this task from dumps.wikimedia.org seems super-slow right now to At peak usage, dumps.wikimedia.org becomes very slow for users (sometimes unresponsive).Sep 11 2015, 8:22 PM

Note the tech press recently ran some articles (e.g. https://thestack.com/cloud/2015/09/09/wikipedia-anne-hathaway-open-source-web-trends-japan-research/ ) about this research paper: http://arxiv.org/pdf/1509.02218v1.pdf . The paper itself cites our pageview data on dumps.wm.o as the source they're looking at, and encouraging others to look at because we're offering some pretty unique timely and information that commercial entities find desirable. I wouldn't be surprised if this lead to a recent traffic increase on dumps.

I wouldn't be surprised if this lead to a recent traffic increase on dumps.

Traffic doesn't look exceptional though, around 100 MB/s. There were 4 other similar peaks in the last 7 months https://ganglia.wikimedia.org/latest/?r=month&cs=02%2F01%2F2015+00%3A00&ce=09%2F30%2F2015+00%3A00&m=cpu_report&c=Miscellaneous+eqiad&h=dataset1001.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS

P.s. The new summary is unproven. Most of the times, when we get such reports, there is no peak in usage at all. Moreover, we often experience that only some files are slow, not others, so the total usage of the machine can't be at fault.

Nemo_bis renamed this task from At peak usage, dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) to Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive).Sep 12 2015, 9:00 AM

Please note than on the original ticket, I noted that I do not think this is a network bandwith problem at all. I think, however, that there is some kind of contention on the server. The timouts are real and the semi-outage is real as detected by our own monitoring. I cannot demonstrate, however, that it is caused by too many requests or other problem (the contention could make the accesses being actually lower than average). These are the hourly statistics of access in the last days:

P2017 hourly dump access and error statistics

1	hour (from XX:00:00 to XX:59:59) - number of accesses - number of errors (includes throttling)
2	--
3	08 Sep
4	======
5	00 38433 26969
6	01 22771 13009
7	02 15756 5771
8	03 29913 3921
9	04 15263 6093
10	05 25680 5484
11	06 16974 7060
12	07 24549 14151
13	08 23160 13073
14	09 37329 10163
15	10 16517 6391
16	11 13274 4025
17	12 16530 4784
18	13 25188 4963
19	14 16878 8642
20	15 34661 8134
21	16 15991 6435
22	17 15793 4496
23	18 16069 3249
24	19 16731 6419
25	20 21278 12054
26	21 48623 10543
27	22 14862 6369
28	23 11640 3210
29
30	09 Sep
31	======
32	00 15473 3586
33	01 12493 3565
34	02 12929 2713
35	03 28769 1319
36	04 16248 6644
37	05 32255 6701
38	06 21405 5103
39	07 13788 3543
40	08 19608 8688
41	09 38852 11051
42	10 22322 11269
43	11 24678 13581
44	12 35360 21701
45	13 56388 34709
46	14 30369 20657
47	15 44061 16708
48	16 12999 3148
49	17 10707 2059
50	18 14828 5101
51	19 11652 1374
52	20 12173 2192
53	21 42421 1911
54	22 11338 1309
55	23 11801 1907
56
57	10 Sep
58	======
59
60	00 18639 5088
61	01 21393 8016
62	02 22728 10088
63	03 33119 3382
64	04 19633 3182
65	05 27269 2356
66	06 12352 1910
67	07 11028 1606
68	08 16244 5675
69	09 45812 12329
70	10 17476 7385
71	11 14299 4635
72	12 17681 4924
73	13 35240 9811
74	14 20304 9796
75	15 35359 6252
76	16 12473 1984
77	17 12521 2168
78	18 13036 3699
79	19 19477 9713
80	20 20427 9682
81	21 42154 2449
82	22 16039 2019
83	23 15045 2074
84
85	11 Sep
86	======
87	00 16692 2668
88	01 18877 3537
89	02 18166 3225
90	03 23961 3374
91	04 18301 2579
92	05 29176 2825
93	06 31005 5002
94	07 24364 5052
95	08 24795 5568
96	09 42619 7841
97	10 15882 5092
98	11 12803 4132
99	12 20636 7989
100	13 28609 7659
101	14 20758 9048
102	15 38373 9939
103	16 22916 12973
104	17 26520 16989
105	18 24820 14283
106	19 23422 12904
107	20 15803 6488
108	21 45504 6155
109	22 13896 4461
110	23 14402 5557
111
112	12 Sep
113	======
114	00 17119 4581
115	01 13912 5436
116	02 12724 4719
117	03 32781 6813
118	04 16156 7261
119	05 36609 13492
120	06 25919 13464
121	07 18580 5597
122	08 15651 4023

The CPU wait time went up to 40% and the load went up to 10- I think we are not serving requests as fast as we can. At those points, network speed is lower than normally, if any: https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=dataset1001.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS_|_cpu_|_disk_|_load_|_network_|_process

I'll see if I can correlate the times to server activity to get a lead on this.

ArielGlenn moved this task from Backlog to Active on the Datasets-General-or-Unknown board.Sep 28 2015, 12:10 PM

I did not see anything strange happening this month's run, which has now concluded. Can folks on this task let me know if there was a period during which they saw a real slowdown, and if so, when?

Reminder that the monthly fulls are running again; if people could keep an eye out and let me know particular intervals when they have issues, that would help a lot. Thanks!

Worksforme!

I've yet to find a file or occasion where average download speed goes over 1.5-2 MB/s... rather painful when trying to do ad hoc analysis.

Nemo_bis mentioned this in T114019: Dumps 2.0 for realz (planning/architecture session).Dec 19 2015, 8:37 AM

ArielGlenn added a parent task: T122917: Provide a good download service of dumps from Wikimedia.Jan 5 2016, 8:16 PM

NealMcB subscribed.Jan 7 2016, 10:29 PM

Nemo_bis mentioned this in T44318: Restore WikiStats features disabled for mere performance reasons.Mar 25 2016, 12:24 PM

ArielGlenn moved this task from Active to Up Next on the Datasets-General-or-Unknown board.Mar 29 2016, 1:20 PM

ArielGlenn moved this task from Up Next to Backlog on the Datasets-General-or-Unknown board.Jul 19 2016, 1:13 PM

Aklapper changed the status of subtask T120425: dumps.wikimedia.org seems to have poor throughput towards some destinations from Open to Stalled.Apr 21 2018, 6:40 PM

Aklapper closed subtask T120425: dumps.wikimedia.org seems to have poor throughput towards some destinations as Declined.May 28 2018, 7:12 AM

Is someone still suffering from this issue anymore? If not, it should be closed.

Given that the hosting setup for this service is different now, this might as well be closed. If folks notice problems in the future they can create a new task.

Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive)Closed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive)
Closed, DeclinedPublic
Actions

Related Objects
Search...