Page MenuHomePhabricator

Very slow downloads from Wikimedia sites in eqiad on Wikimania hotel network
Closed, ResolvedPublic

Description

On the Wikimania hotel wifi ('hhonors' network) I'm seeing very, VERY slow downloads from en.wikipedia.org,, upload.wikimedia.org, phabricator.wikimedia.org, ogvjs-testing.wmflabs.org, and any other Wikimedia site I test that's in eqiad.

Downloads are on the order of 20-30 kbytes/sec, whereas uploads run at a full 500-600 kbytes/sec (the same as I get downloads from everything else...)

I see no such slowdown when hitting ulsfo and esams load balancers manually from this network:

# slow:
wget --no-check-certificate --header='Host: en.wikipedia.org' 'https://text-lb.eqiad.wikimedia.org/wiki/Mexico'

# fast:
wget --no-check-certificate --header='Host: en.wikipedia.org' 'https://text-lb.ulsfo.wikimedia.org/wiki/Mexico'
wget --no-check-certificate --header='Host: en.wikipedia.org' 'https://text-lb.esams.wikimedia.org/wiki/Mexico'

Traceroute (output from 'mtr en.wikipedia.org'):

                                    My traceroute  [v0.86]
Orac.local (0.0.0.0)                                                   Wed Jul 15 20:39:37 2015
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                       Packets               Pings
 Host                                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 172.20.1.1                                        0.0%    40    1.0   1.2   0.9   1.7   0.0
 2. 33.6.149.201.in-addr.arpa                         0.0%    39    7.1  10.3   6.4  19.1   3.2
 3. 49.73.66.200.in-addr.arpa                         0.0%    39    1.9   2.7   1.5  21.2   3.1
 4. 1.11.149.201.in-addr.arpa                         0.0%    39    1.8   2.3   1.6   8.9   1.1
 5. 102.66.52.200.in-addr.arpa                        0.0%    39    1.4   3.8   1.3  33.6   6.6
 6. 73.88.149.201.in-addr.arpa                        0.0%    39    1.7   2.1   1.4  11.8   1.5
 7. te0-0-0-7.rcr21.mex02.atlas.cogentco.com          0.0%    39    1.9   2.3   1.8   3.6   0.2
 8. be2082.rcr21.mex01.atlas.cogentco.com             0.0%    39    3.8   3.1   2.4   6.8   0.8
 9. be2445.rcr21.mfe01.atlas.cogentco.com             0.0%    39   15.9  16.2  15.4  17.9   0.3
10. be2421.ccr22.iah01.atlas.cogentco.com             0.0%    39   23.7  24.2  23.6  26.3   0.3
11. be2443.ccr22.dfw01.atlas.cogentco.com             0.0%    39   30.8  29.4  29.0  32.8   0.6
12. be2032.ccr21.dfw03.atlas.cogentco.com             0.0%    39   29.3  29.9  29.1  34.9   1.1
13. telia.dfw03.atlas.cogentco.com                    0.0%    39   57.4  57.8  57.3  59.7   0.2
14. ash-bb4-link.telia.net                            0.0%    39   86.9  87.4  86.8  89.7   0.5
15. ash-b1-link.telia.net                             0.0%    39   85.3  85.4  84.9  86.3   0.0
16. wikimedia-ic-308845-ash-b2.c.telia.net            0.0%    39  107.2 113.4  88.7 192.9  16.9
17. text-lb.eqiad.wikimedia.org                       0.0%    39  126.8 129.2 108.9 135.3   5.4

Event Timeline

brooke raised the priority of this task from to Needs Triage.
brooke updated the task description. (Show Details)
brooke subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Legoktm set Security to None.

Back in the hacking space after 10pm I see up to 100-200 kbytes/sec on eqiad, better than from upstairs but still much slower than it should be. On the Hackathon network, ulsfo pumps out up to 2 MBytes/sec whereas eqiad is still usually under 200 kBytes/sec.

I'm also seeing better performance on the Mexico article (on text-lb) while it's still much slower on uploads:

# eqiad slow; sustained at 150-200 kBytes/sec on Hackathon network
wget --no-check-certificate --header='Host: upload.wikimedia.org' 'https://upload-lb.eqiad.wikimedia.org/wikipedia/commons/transcoded/b/b7/How_Open_Access_Empowered_a_16-Year-Old_to_Make_Cancer_Breakthrough.ogv/How_Open_Access_Empowered_a_16-Year-Old_to_Make_Cancer_Breakthrough.ogv.360p.ogv'

# ulsfo fast; up to 2 MBytes/sec on Hackathon network
wget --no-check-certificate --header='Host: upload.wikimedia.org' 'https://upload-lb.ulsfo.wikimedia.org/wikipedia/commons/transcoded/b/b7/How_Open_Access_Empowered_a_16-Year-Old_to_Make_Cancer_Breakthrough.ogv/How_Open_Access_Empowered_a_16-Year-Old_to_Make_Cancer_Breakthrough.ogv.360p.ogv'

CC'ing faidon for on-site network investigation if possible :D

Note the route back from eqiad to the hotel ISP goes through GTT, not through Telia/Cogent as the upstream route.

mtr 201.149.6.36 from a wmflabs instance:

                                  My traceroute  [v0.85]
ogvjs-testing (0.0.0.0)                                          Thu Jul 16 14:10:34 2015
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                 Packets               Pings
 Host                                          Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 10.68.16.1                                  0.0%    13    0.2   0.2   0.2   0.3   0.0
 2. ae2-1118.cr2-eqiad.wikimedia.org            0.0%    13    3.6   1.1   0.4   5.4   1.4
 3. xe-1-2-0.was10.ip4.gtt.net                  0.0%    13    0.4   0.6   0.4   2.3   0.4
 4. xe-7-2-2.was14.ip4.gtt.net                  0.0%    13    0.4   0.6   0.4   1.0   0.0
 5. 213.200.84.122                              0.0%    13    0.7   1.1   0.5   6.0   1.5
 6. 64.213.104.42                               0.0%    13   63.7  64.1  63.6  67.6   1.0
 7. 74.88.149.201.in-addr.arpa                  0.0%    13   64.1  64.1  63.9  64.5   0.0
 8. 109.66.52.200.in-addr.arpa                  0.0%    13   64.3  64.3  64.1  64.8   0.0
 9. 60.89.52.200.in-addr.arpa                   0.0%    12   64.7  65.6  64.3  73.2   2.4
10. 28.89.52.200.in-addr.arpa                   0.0%    12   64.4  66.0  64.3  71.0   2.4
11. 50.73.66.200.in-addr.arpa                   0.0%    12   70.0  71.1  69.5  79.7   2.7
12. 36.6.149.201.in-addr.arpa                   0.0%    12   64.3  64.3  64.1  64.9   0.0
faidon claimed this task.

The Wikimania network is behind AS 14178 (MEGACABLE). For 14178, we're seeing paths via two major networks on their side 3549 (GBLX/Level3) and 174 (Cogent), but mostly the former as the latter is prepended x2.

For routing to 3549, we have routes via 3257 (GTT), 1299 (Telia), 2914 (NTT), all of equivalent length, from both eqiad & ulsfo; we were picking 3257 semi-randomly. From the report above and my own experience, this seemed to have congestion issues
especially apparent at peak times.

I downprefed the route to 14178 via 3257 at eqiad, effectively making the best path going via 1299 and this seemed to have an immediate effect. Subsequently, I downprefed all of 3549 via 3257, as this is likely something of a more broader issue.