Page MenuHomePhabricator

Problem downloading large files from analytics.wikimedia.org
Closed, ResolvedPublic

Description

We have been receiving reports from users, particularly in the Machine-Learning-Team that downloading files using wget or curl from https://analytics.wikimedia.org has become very unreliable since around Feb 1st 2024.

Here is a paste with some test results.
P56346

A specific test case that is not working for me from my workstation is:

wget -4 --continue https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20230810110019/model.pkl

That request took four attemps to succeed, as shown here:

--2024-02-06 17:07:46--  https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20230810110019/model.pkl
Resolving analytics.wikimedia.org (analytics.wikimedia.org)... 185.15.59.224
Connecting to analytics.wikimedia.org (analytics.wikimedia.org)|185.15.59.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2649622715 (2.5G)
Saving to: ‘model.pkl’

model.pkl                                             31%[====================================>                                                                                 ] 803.00M  12.2MB/s    in 68s     

2024-02-06 17:08:55 (11.7 MB/s) - Connection closed at byte 842006528. Retrying.

--2024-02-06 17:08:56--  (try: 2)  https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20230810110019/model.pkl
Connecting to analytics.wikimedia.org (analytics.wikimedia.org)|185.15.59.224|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 2649622715 (2.5G), 1807616187 (1.7G) remaining
Saving to: ‘model.pkl’

model.pkl                                             59%[+++++++++++++++++++++++++++++++++++++===============================>                                                 ]   1.46G  12.1MB/s    in 68s     

2024-02-06 17:10:04 (10.2 MB/s) - Connection closed at byte 1566572544. Retrying.

--2024-02-06 17:10:06--  (try: 3)  https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20230810110019/model.pkl
Connecting to analytics.wikimedia.org (analytics.wikimedia.org)|185.15.59.224|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 2649622715 (2.5G), 1083050171 (1.0G) remaining
Saving to: ‘model.pkl’

model.pkl                                             83%[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++============================>                    ]   2.06G  11.5MB/s    in 68s     

2024-02-06 17:11:15 (9.12 MB/s) - Connection closed at byte 2216689664. Retrying.

--2024-02-06 17:11:18--  (try: 4)  https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20230810110019/model.pkl
Connecting to analytics.wikimedia.org (analytics.wikimedia.org)|185.15.59.224|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 2649622715 (2.5G), 432933051 (413M) remaining
Saving to: ‘model.pkl’

model.pkl                                            100%[++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++===================>]   2.47G  12.0MB/s    in 57s     

2024-02-06 17:12:15 (7.23 MB/s) - ‘model.pkl’ saved [2649622715/2649622715]

The first recorded occurrence was 2024-01-31T17:22:01+0100 and it has been happening frequently since then.

The web server behind analytics.wikimedia.org, that's an-web1001, was upgraded to bullseye on Feb 6th under ticket: T349398
However, this behaviour was observed both prior to and after the upgrade.

This is causing some inconvenience for the ML team, as they have to retry their downloads.

I have tried a wget from stat1004 to analytics.wikimedia.org and that also required two attempts.

btullis@stat1004:~$ wget https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20230810110019/model.pkl
--2024-02-06 17:16:38--  https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20230810110019/model.pkl
Resolving analytics.wikimedia.org (analytics.wikimedia.org)... 2620:0:861:ed1a::1, 208.80.154.224
Connecting to analytics.wikimedia.org (analytics.wikimedia.org)|2620:0:861:ed1a::1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2649622715 (2.5G)
Saving to: ‘model.pkl’

model.pkl                                             74%[=======================================================================================>                              ]   1.85G  28.3MB/s    in 66s     

2024-02-06 17:17:44 (28.6 MB/s) - Connection closed at byte 1987051520. Retrying.

--2024-02-06 17:17:45--  (try: 2)  https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20230810110019/model.pkl
Connecting to analytics.wikimedia.org (analytics.wikimedia.org)|2620:0:861:ed1a::1|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 2649622715 (2.5G), 662571195 (632M) remaining
Saving to: ‘model.pkl’

model.pkl                                            100%[++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++=============================>]   2.47G  28.6MB/s    in 39s     

2024-02-06 17:18:24 (16.2 MB/s) - ‘model.pkl’ saved [2649622715/2649622715]

Event Timeline

BTullis triaged this task as High priority.Feb 6 2024, 5:20 PM

I have a feeling that this is more related to the CDN than to the webserver behind it.

I tried the following test from stat1004, which is still routed through the CDN. It fails repeatedly after about 75%.

btullis@stat1004:~$ curl -o /dev/null https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20230810110019/model.pkl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 75 2526M   75 1898M    0     0  28.7M      0  0:01:28  0:01:06  0:00:22 28.7M
curl: (18) transfer closed with 658481389 bytes remaining to read

btullis@stat1004:~$ curl -o /dev/null https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20230810110019/model.pkl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 75 2526M   75 1895M    0     0  28.6M      0  0:01:28  0:01:06  0:00:22 28.3M
curl: (18) transfer closed with 662217445 bytes remaining to read

Then I tried the following test, which bypasses the CDN and send the traffic direct to an-web1001 on (10.64.21.14) and to its envoyproxy port on 8443.

btullis@stat1004:~$ curl -o /dev/null --resolve analytics.wikimedia.org:8443:10.64.21.14 https://analytics.wikimedia.org:8443/published/wmf-ml-models/revertrisk/multilingual/20230810110019/model.pkl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2526M  100 2526M    0     0   111M      0  0:00:22  0:00:22 --:--:--  108M

Notice that the speed is about 108 MB/s, which is saturating the 1 Gbps port on an-web1001. Roughly 4 times as fast as going through dyna.wikimedia.org
However, this request is successful each time, without retries.

I'm going to add the Traffic team to see if they have any insight as to why this might have started happening recently.

I can reproduce via text@drmrs, I'll take a look ASAP :)

@BTullis it's origin related:

vgutierrez@bast6003:~$ curl -o /dev/null --resolve analytics.wikimedia.org:8443:10.64.21.14 --limit-rate 30M https://analytics.wikimedia.org:8443/published/wmf-ml-models/revertrisk/multilingual/20230810110019/model.pkl 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 79 2526M   79 2005M    0     0  30.2M      0  0:01:23  0:01:06  0:00:17 27.3M
curl: (18) transfer closed with 546874085 bytes remaining to read

some timeout around ~65 seconds

@Vgutierrez - Thanks so much. Let me know if I can help with anything.

Yes, the timeout happens reliably at around 65 seconds. Thanks @Vgutierrez - I'll remove the Traffic tag and focus on envoy.

Change 998345 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure analytics.wikimedia.org to support large downloads

https://gerrit.wikimedia.org/r/998345

Thanks again @Vgutierrez - That makes perfect sense now. As per: https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/timeouts#route-timeouts

A route timeout is the amount of time that Envoy will wait for the upstream to respond with a complete response.

Then in more detail from here...

This spans between the point at which the entire downstream request (i.e. end-of-stream) has been processed and when the upstream response has been completely processed. A value of 0 will disable the route’s timeout.

I see other places in puppet where we have disabled this timeout, so I believe that it's reasonable to do so on an-web1001 as well.

Change 998345 merged by Btullis:

[operations/puppet@production] Configure analytics.wikimedia.org to support large downloads

https://gerrit.wikimedia.org/r/998345