Page MenuHomePhabricator

Commons connectivity issue in Hong-Kong
Closed, DeclinedPublic

Description

It looks to be impossible to upload videos to Commons, from Hong-Kong, from a standard Internet access solution, due to a lack of bandwidth:

"Because Hong Kong is on the opposite side of the planet from the WMF datacenters, we can only eke out an average speed of 50kB/s even on a "fast" connection in Hong Kong."
Wikimania HK organisation team
http://lists.wikimedia.org/pipermail/wikimania-l/2014-January/005466.html

The problem seems to be neither on the server side, nor on the client side, but somewhere in-between.

From a user perspective, this is a problem: it's not possible to upload big content to Commons, to our projects.


Version: wmf-deployment
Severity: major

Details

Reference
bz60283

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:51 AM
bzimport set Reference to bz60283.
bzimport added a subscriber: Unknown Object (MLST).

I don't see how this is fixable via a bug report - this feels like something to discuss with ops instead, maybe.

impossible to upload videos to Commons

Are there specific examples available of chunked uploads from that area *failing*? The email only implies that it is slow and takes a while.

(Organizers could also send the videos via post to the datacenter, if all goes wrong.)

I'm not the uploader myself and I don't live in HK. I can't provide other examples.

But, that commons fails (generate an error) or is too slow; this is IMO as bad. The consequence is exactly the same: Wikimania videos are on Youtube and not on our own video platform.

This is an availability incident and indeed something to treat with operations.

If Bugzilla is not the right place to track this, what would be a better place?

This isn't a Wikimedia bug. The same problem applies when you try to upload something from *any* domestic connection in Hong Kong to *any* USA-based website which doesn't have its dedicated worldwide network backbone.

Pre-2007 it was impossible to do en.wp vandalism work from the Far East because the connection was too slow to do anything useful. When WMF staff first visited Hong Kong for CWMC 2006 / around Wikimania 2007, the first thing they exclaimed was "I'm surprised how slow it is to load Wikipedia here."

If we had infinite budget, we could build our own network backbone like Google did (hence YouTube was a sensible uploading option when Commons isn't). But sending large files to a foreign country over the internet is in general a luxury that only North America and Western Europe can afford. It is a problem, but not a Wikimedia bug, that we can't do the same between HK and WMF datacenters.

I disagree that this is an unsolvable problem: it needs to be properly debugged and measured. Hong Kong has a world top20 IX according to [[List of Internet exchange points by size]], we're not talking of the moon or an island in the middle of the ocean.

It's surely necessary to attach some traceroute data from various machines/cities to the WMF servers, to start with, because there are several links: http://www.glif.is/publications/maps/GLIF_5-11_AP_2k.jpg (this website via multichill). We need to know which is/are being used and what's the latency.

Then, can someone point out a procedure to measure speed/quality of down/up connections to WMF servers compared to, say, http://ftp.hk.debian.org/debian/ or whichever fast local server there is?

While I agree with Deryck that typically HK connections to the US are not speedy, it's not always a lost cause. When I was based there, you could get fast speeds to the US for certain servers/peers. So simply being "in HK" does not doom you to poor speeds. It's possible to find a triangle route that will give you fast uploads, or other upload sites that might be quick. Try Internet Archive first to see if that might be faster?

If you're struggling uploading files, if you can find somewhere else to upload them to (which is "fast" enough for you) and we can pull them over this way.

Certainly testing elsewhere is a good start

Do speed tests give sensible values for various servers around the world? http://www.speedtest.net/ etc

Speedtests: (from my 10Mbps domestic connection in a building where most other households have 100Mbps)

Hong Kong (STC) D 9.4Mbps, U 9.4Mbps, ping 10ms
Tokyo (Alocac) D 9.5Mbps, U 7.5Mbps, ping 67ms
London (Namesco) D 4.9Mbps, U 3.1Mbps, ping 220ms
San Francisco (Unwired) D 9.3Mbps, U 7.3Mbps, ping 195ms
Irvine, CA (Fireline) D 8.2Mbps, U 4.9Mbps, ping 235ms
Melbourne (Telstra) D 9.2Mbps, U 8.9Mbps, ping 171ms

[[User:Tsugiko]] suggested further that:

  • Accessing Wikimedia sites is significantly slower than similar services in the USA;
  • Much faster upload speeds are achieved when he routed all his traffic to Wikimedia sites via a VPN in Japan.

The second point he made would suggest that some improvement is possible by reconfiguring WMF's IP routing tables.

We have multiple 10Gbps links and plenty of surplus capacity with multiple Tier1/2 providers that have excellent connectivity to Hong Kong & Asia in particular (some were explicitly procured with this criteria).

Nevertheless, it could be some transient problem with some specific ISP that we could raise with them or our providers, or alter the path to go from one of our other carriers or to a different datacenter of ours.

However, it's completely impossible for us to do any kind of troubleshooting without any data other than "coming from Hong Kong". Please provide ISP names, IP addresses/networks that experience issues (anonymized will do, or pristine in private if you'd prefer that) as well as pings & traceroutes to the Wikimedia sites in question.

(In reply to comment #2)

But, that commons fails (generate an error) or is too slow; this is IMO as
bad.

Could use some clarification about this as well. Is there an actual error message? Or is it just so slow you give up? (as Faidon says we can't do much with "coming from Hong Kong" we also can't do much with "generates an error")

Is it a consistent experience or are some times extra bad?

For all of the above (error messages, slow uploads, slow page loads, pings, traceroutes, etc.), please also provide timestamps. (and timezone!)

p.selitskas wrote:

(In reply to comment #8)

We have multiple 10Gbps links and plenty of surplus capacity with multiple
Tier1/2 providers that have excellent connectivity to Hong Kong & Asia in
particular (some were explicitly procured with this criteria).

Nevertheless, it could be some transient problem with some specific ISP that
we
could raise with them or our providers, or alter the path to go from one of
our
other carriers or to a different datacenter of ours.

However, it's completely impossible for us to do any kind of troubleshooting
without any data other than "coming from Hong Kong". Please provide ISP
names,
IP addresses/networks that experience issues (anonymized will do, or pristine
in private if you'd prefer that) as well as pings & traceroutes to the
Wikimedia sites in question.

Public lg's in Hong Kong show appropriate paths (via Japan), but different providers show different perfomance (from Hurricane's 150ms to PCCW's 200+ms), not mentioning quite uncommon jitter.

What's up with Wikimedia's infrastructure in South Korea? Is it still up? If so, is it utilized for HTML caches? What about uploading? (Just a question for the ops: as a user from Europe, does the file go to the European cluster and then being proxied silently to the US when I start uploading?)

(In reply to comment #10)

What's up with Wikimedia's infrastructure in South Korea? Is it still up? If
so, is it utilized for HTML caches?

Gone ages ago.

Pavel, paths to where and which lgs specifically? We peer directly with Hurricane Electric in all locations and we're one hop to PCCW via GTT (which carries it from Hong Kong to San Jose, then SFO).

South Korea is long gone, but we serve Hong Kong via San Francisco (ulsfo) now. As for your question, yes, the file will go to the local caching center and then proxied silently (by Varnish) to Ashburn caches and appservers.

(In reply to comment #12)

We peer directly with
Hurricane Electric in all locations and we're one hop to PCCW via GTT (which
carries it from Hong Kong to San Jose, then SFO).

Isn't HE known to be trash? Internet Archive was quite devastated due to HE till they added another carrier.

I won't comment if they or anyone else is "trash" and please let's avoid having this sort of discussion here (or at all -- do you really think it's good for our peering relationships to have public discussions about which vendor/peer X is trash?). This is a bug report, let's stick to the facts: please contribute data (pings, traceroutes, BGP AS paths, looking glasses that show problematic paths, error messages etc.) about the problem so that we can diagnose it and then evaluate solutions to fix/work around it.

p.selitskas wrote:

(In reply to comment #12)

Pavel, paths to where and which lgs specifically? We peer directly with
Hurricane Electric in all locations and we're one hop to PCCW via GTT (which
carries it from Hong Kong to San Jose, then SFO).

South Korea is long gone, but we serve Hong Kong via San Francisco (ulsfo)
now.
As for your question, yes, the file will go to the local caching center and
then proxied silently (by Varnish) to Ashburn caches and appservers.

I checked Hurricane Electric and PCCW (http://lookingglass.pccwglobal.com/). Hurricane Electric peers directly right like you said:
1 54 ms 69 ms 53 ms 10ge3-1.core1.tyo1.he.net (184.105.222.105)
2 150 ms 150 ms 150 ms 10ge15-2.core1.lax2.he.net (184.105.223.105)
3 168 ms 159 ms 165 ms 10ge9-5.core1.sjc2.he.net (184.105.213.6)
4 160 ms 170 ms 168 ms 10ge5-2.core1.pao1.he.net (72.52.92.69)
5 161 ms 166 ms 171 ms eqix-sv9.wikimedia.org (198.32.176.214)
6 161 ms 164 ms 161 ms text-lb.ulsfo.wikimedia.org (198.35.26.96)

Looking through PCCW LG, I can see like they route requests to eqiad (although ulsfo would be the best choice for East Asia? - that is how they resolve commons.wikimedia.org through their DNS):
traceroute to text-lb.eqiad.wikimedia.org (208.80.154.224), 64 hops max, 44 byte packets
1 bbs-1-250-0-210.on-nets.com (210.0.250.1) 0.373 ms 0.332 ms 0.320 ms
2 10.2.128.31 (10.2.128.31) 0.226 ms 0.228 ms 0.262 ms
3 203.131.243.121 (203.131.243.121) 1.251 ms 1.245 ms 2.749 ms
4 ae-3.r22.tkokhk01.hk.bb.gin.ntt.net (129.250.6.232) 22.838 ms 13.478 ms 1.011 ms
5 ae-12.r22.osakjp02.jp.bb.gin.ntt.net (129.250.6.234) 60.219 ms 58.072 ms 51.183 ms
6 ae-8.r21.osakjp02.jp.bb.gin.ntt.net (129.250.6.193) 46.586 ms 41.789 ms 44.033 ms
7 * * *
8 ae-8.r21.sttlwa01.us.bb.gin.ntt.net (129.250.6.141) 169.446 ms 193.944 ms 192.196 ms
9 ae-5.r21.asbnva02.us.bb.gin.ntt.net (129.250.4.181) 231.650 ms 221.329 ms 231.565 ms
10 ae-2.r04.asbnva02.us.bb.gin.ntt.net (129.250.4.207) 211.152 ms 247.144 ms 248.435 ms
[ * * * repeats infinitely ]

Yes sorry, I shouldn't have saved the comment above. Deryck and others in HK, let people know here or on IRC (e.g. #wikimedia-tech) if you need help posting more information of the kind requested.

(In reply to comment #4)

I disagree that this is an unsolvable problem: it needs to be properly
debugged
and measured. Hong Kong has a world top20 IX according to [[List of Internet
exchange points by size]], we're not talking of the moon or an island in the
middle of the ocean.

It's surely necessary to attach some traceroute data from various
machines/cities to the WMF servers, to start with, because there are several
links: http://www.glif.is/publications/maps/GLIF_5-11_AP_2k.jpg (this website
via multichill). We need to know which is/are being used and what's the
latency.

Then, can someone point out a procedure to measure speed/quality of down/up
connections to WMF servers compared to, say, http://ftp.hk.debian.org/debian/
or whichever fast local server there is?

Hosting a big IX doesn't imply almost a thing about overall infrastructure nor about the relevant (by our perspective) one.

This kind of issue is actually pretty common (I've already experienced a horrible increase in delay from AS3269 towards Equinix) but as a final possibility I won't exclude BGP hijack.

Vito

p.selitskas wrote:

I'm sorry, that was a traceroute from HGC servers. PCCW is as follows (also resolves to Ashburn entry point), just to keep things properly:
Tracing the route to text-lb.eqiad.wikimedia.org (208.80.154.224)
1 pos6-0.cr02.ams01.pccwbtn.net (63.218.64.102) [MPLS: Label 42 Exp 0] 4 msec
pos1-0-1.cr02.ams01.pccwbtn.net (63.218.64.122) [MPLS: Label 42 Exp 0] 52 msec
pos1-0.cr02.ams01.pccwbtn.net (63.218.64.98) [MPLS: Label 42 Exp 0] 0 msec
2 TenGE12-2.br02.ams01.pccwbtn.net (63.218.64.110) 20 msec 8 msec 0 msec
3 xe-0-1-0.cir1.amsterdam2-nh.nl.xo.net (195.69.145.200) 48 msec 4 msec 0 msec
4 te0-3-4-0.rar3.washington-dc.us.xo.net (207.88.13.198) [AS 2828] 92 msec 100 msec 88 msec
5 ae0d1.cir1.ashburn-va.us.xo.net (207.88.13.65) [AS 2828] 84 msec 84 msec 84 msec

Trace from equinix in HK:

traceroute -A commons.wikimedia.org
1 202.177.199.133 (202.177.199.133) [AS17819/AS10138] 0.472 ms 0.477 ms 0.471 ms
2 ge-1-2-9.gw1502.hk1.ap.equinix.com (27.111.175.187) [AS17819] 0.437 ms 0.443 ms 0.437 ms
3 EQX-0022.gw1.hkg3.asianetcom.net (203.192.153.25) [AS3549] 1.009 ms 1.015 ms 1.011 ms
4 te0-4-0-1.wr1.hkg0.asianetcom.net (61.14.157.80) [AS10026] 5.194 ms 2.794 ms 2.791 ms
5 te0-1-0-3.gw1.lax3.asianetcom.net (202.147.61.193) [AS10026/AS3549/AS1221] 149.940 ms 150.086 ms 150.079 ms
6 be1.gw2.lax3.asianetcom.net (202.147.61.162) [AS10026/AS3549/AS1221] 150.042 ms 150.213 ms 150.277 ms
7 * * *

oscar.vives wrote:

I suggest solutions:

  • Allow the upload from third party servers. So people uploading a video can upload it to a local server, then give the url to wikimedia. There the wikimedia server can put the download in a queue, download at maybe very slow rate.
  • Allow the upload with a custom protocol. Maybe a java plugin could implement a upload mechanism for videos using a conexion that can survive low bandwith / high errors conditions?.

The general idea is "browsers suck for uploading large files".

p.selitskas wrote:

(In reply to comment #20)

I suggest solutions:

  • Allow the upload from third party servers. So people uploading a video can

upload it to a local server, then give the url to wikimedia. There the
wikimedia server can put the download in a queue, download at maybe very slow
rate.

  • Allow the upload with a custom protocol. Maybe a java plugin could

implement
a upload mechanism for videos using a conexion that can survive low bandwith
/
high errors conditions?.

The general idea is "browsers suck for uploading large files".

We haven't had any [already requested by Andre Klapper] input on chunked uploads yet. Until we put heavy artillery like client-side Java on the battlefield, let's wait for folks from Hong Kong to check out chunked method first.

To users facing the issue and willing to test the feature: go to your settings on Commons, pick the Upload Wizard tub, and enable "Chunked uploads for files over 1MB in Upload Wizard". Then try to upload a large file and tell us how did it go.

Pavel, thanks a lot for going through the trouble of collecting data via looking glasses, although I'd personally be more interested to get traceroutes from users that experience the issues. I doubt PCCW's backbone has bandwidth troubles with us.

HK wasn't pointed to ulsfo due to lack of data from probes there, but it makes sense from a geographical/network sense, so I just did that.

That being said, I don't expect huge differences in bandwidth, as the path was going via our carriers to Hong Kong, which I doubt that are going to be congested (i.e. if it enter's NTT's network in HK, it doesn't matter if it ends up to our eqiad or ulsfo node, from a bandwidth perspective).

Oh, and finally, the third traceroute you sent was actually from Amsterdam -- first hop is "ams", plus 84ms from HK to Ashburn is 2-3 times the speed of light or something :)

p.selitskas wrote:

(In reply to comment #22)

Pavel, thanks a lot for going through the trouble of collecting data via
looking glasses, although I'd personally be more interested to get
traceroutes
from users that experience the issues. I doubt PCCW's backbone has bandwidth
troubles with us.

Yes, they can adjust priority for traffic coming from different consumers, but that would imply issues had Hong Kong only a couple of 1Gb links to the West Coast. :)

Oh, and finally, the third traceroute you sent was actually from Amsterdam --
first hop is "ams", plus 84ms from HK to Ashburn is 2-3 times the speed of
light or something :)

Yeah, sorry, my bad.
3 ge-7-0-0.hkg11.ip4.tinet.net (213.254.227.77) 4 msec 0 msec 4 msec
4 xe-11-0-1.was10.ip4.tinet.net (141.136.110.198) 232 msec 232 msec 236 msec

Here's a part of a proper trace from PCCW. Nothing special at first glance: it moors to Seattle, although they route traffic through Tinet SpA, and it takes quite a large amount of time (~230ms vs standard ~150ms). If you look at their backbone network map, you can see that Tinet doesn't have submarine links coming from Asia directly to Washington - they're all terminated in California. I don't know why they make it this way, could their maps be just outdated :) Their ping result a half more than normal is not okay though.

Tinet/GTT is our transit and Hong Kong to Ashburn is definitely not 150ms (it's 160ms+ from HK to the west coast + 70ms coast-to-coast). In fact, if you do an On-net ping from PCCW's lg from HKG to ASH you get 224ms, not that different at all. The path from PCCW to eqiad is suboptimal, though, since we peer directly with PCCW there, so they're probably filtering by mistake (the outbound path is correct and going directly, though). I'll ask them.

This matters very little though, since I switched HK to ulsfo now, and we don't peer with PCCW there (since we can't -- they're not present in PAIX). We have two equivalent AS paths to PCCW from/to ulsfo and TiNet wins, and while I could traffic engineer it to go via NTT, I'd rather not, unless we have confirmed there's an issue and this would help. So far I have zero indications suggesting that, we're not even sure if the users having issues connect via PCCW.

So, again, I'd like traceroutes from users that *actually* experience an issue, not just traceroute from large carriers' looking glasses. These are not very useful for the purpose of this bug report.

tracepath from here (HKBN domestic):

deryck@Xenon-RC:~$ tracepath commons.wikimedia.org
1: 014198068245.ctinets.com 0.211ms pmtu 1500
1: 014198068001.ctinets.com 2.332ms
1: 014198068001.ctinets.com 2.272ms
2: 061093141033.ctinets.com 2.123ms
3: 014199252121.ctinets.com 13.955ms
4: 014136128054.ctinets.com 3.554ms asymm 5
5: las-bb1-link.telia.net 165.887ms asymm 6
6: ntt-ic-143926-las-bb1.c.telia.net 182.466ms asymm 7
7: ae-6.r21.lsanca03.us.bb.gin.ntt.net 184.130ms asymm 8
8: no reply
9: ae-1.r06.snjsca04.us.bb.gin.ntt.net 184.650ms asymm 7
10: no reply
11: no reply
12: no reply
13: no reply

[seems to loop indefinitely]

As suggested by jeremyb: (done via HKBN domestic, same as above)

deryck@Xenon-RC:~$ for dc in ulsfo eqiad esams; do echo ${dc}:; mtr -rc 10 text-lb.${dc}.wikimedia.org; echo; done
ulsfo:
HOST: Xenon-RC Loss% Snt Last Avg Best Wrst StDev

1.|-- 014198068001.ctinets.com   0.0%    10    1.3   1.3   0.8   2.8   0.6
2.|-- 061093141033.ctinets.com   0.0%    10    0.8   1.0   0.7   2.9   0.7
3.|-- 014199252121.ctinets.com   0.0%    10   11.0   8.1   3.3  12.8   3.6
4.|-- 014136128054.ctinets.com   0.0%    10    2.0   1.8   1.6   2.0   0.1
5.|-- las-bb1-link.telia.net     0.0%    10  180.2 180.3 180.0 181.1   0.3
6.|-- ntt-ic-151170-las-bb1.c.t  0.0%    10  180.9 180.9 180.7 181.5   0.2
7.|-- ae-6.r21.lsanca03.us.bb.g  0.0%    10  182.1 186.9 182.1 206.7   9.1
8.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
9.|-- ae-1.r06.snjsca04.us.bb.g  0.0%    10  182.2 182.2 181.5 183.0   0.6

10.|-- xe-0-2-0-10.r06.snjsca04. 0.0% 10 178.4 180.2 177.4 199.2 6.7
11.|-- text-lb.ulsfo.wikimedia.o 10.0% 10 177.7 178.1 177.6 179.1 0.6

eqiad:
HOST: Xenon-RC Loss% Snt Last Avg Best Wrst StDev

1.|-- 014198068001.ctinets.com   0.0%    10    1.0   1.1   0.8   1.7   0.3
2.|-- 061093141033.ctinets.com   0.0%    10    0.9   2.5   0.7  17.4   5.2
3.|-- 014199252121.ctinets.com   0.0%    10   10.1   6.2   2.4  10.1   3.1
4.|-- 014136128058.ctinets.com   0.0%    10    1.8   3.7   1.7  20.2   5.8
5.|-- las-bb1-link.telia.net     0.0%    10  176.4 176.6 176.4 176.7   0.1
6.|-- ntt-ic-143926-las-bb1.c.t  0.0%    10  180.7 180.7 180.5 180.9   0.1
7.|-- ae-6.r21.lsanca03.us.bb.g  0.0%    10  178.5 185.9 178.5 246.6  21.4
8.|-- ae-2.r20.asbnva02.us.bb.g 10.0%    10  243.3 245.5 242.4 256.5   4.6
9.|-- ae-1.r04.asbnva02.us.bb.g  0.0%    10  228.7 230.4 227.5 234.3   2.9

10.|-- xe-0-7-0-8.r04.asbnva02.u 10.0% 10 236.3 236.8 233.8 242.1 3.1
11.|-- text-lb.eqiad.wikimedia.o 10.0% 10 234.6 237.3 234.0 240.9 2.5

esams:
HOST: Xenon-RC Loss% Snt Last Avg Best Wrst StDev

1.|-- 014198068001.ctinets.com   0.0%    10    0.9   1.4   0.9   5.0   1.3
2.|-- 061093141033.ctinets.com   0.0%    10    1.0   0.8   0.6   1.0   0.1
3.|-- 014199252121.ctinets.com   0.0%    10   10.4   7.7   3.8  12.5   3.3
4.|-- 014136128046.ctinets.com  20.0%    10    1.9   1.8   1.6   2.0   0.1
5.|-- sjo-bb1-link.telia.net     0.0%    10  143.0 142.9 142.7 143.1   0.1
6.|-- nyk-bb2-link.telia.net     0.0%    10  223.6 241.6 223.6 322.3  37.5
7.|-- ldn-bb2-link.telia.net     0.0%    10  289.3 289.2 289.1 289.5   0.1
8.|-- adm-bb4-link.telia.net     0.0%    10  308.9 308.5 308.3 308.9   0.2
9.|-- adm-b5-link.telia.net     10.0%    10  294.8 294.8 294.7 295.0   0.1

10.|-- wikimedia-ic-129908-adm-b 10.0% 10 250.1 245.7 244.6 250.1 1.8
11.|-- text-lb.esams.wikimedia.o 10.0% 10 246.6 244.1 243.3 246.6 1.0

Some feedback on this? or do you need more information from the field?

(In reply to Kelson [Emmanuel Engelhart] from comment #27)

Some feedback on this? or do you need more information from the field?

comment 25 (and 26 too) was a trace from a user's home connection.

I focused too much on the quality of the trace tool and didn't ask if that was actually a connection with demonstrated slow uploads.

Deryck, have you seen slow uploads recently?

I heard back from PCCW; the asymmetric peering in Ashburn is apparently per their policy so it won't change. It shouldn't matter for the purposes of this bug or at all, though. We also peered with HKBN directly since then, although this shouldn't matter much either, since our transits are far from being congested. Nevertheless, it'd be interesting to hear if Deryck still has issues (if ever).

Other than that, nothing else I can comment on. Yes, I'd like more information from the field in the sense that I'd like to hear from users that actually experience the issue and gather information from them, as stated above (name of the ISP, traceroutes).

(In reply to Faidon Liambotis from comment #29)

it'd be interesting to hear if Deryck still has issues (if ever).

Deryck: ping?

(In reply to Andre Klapper from comment #30)

(In reply to Faidon Liambotis from comment #29)

it'd be interesting to hear if Deryck still has issues (if ever).

Deryck: ping?

Haven't tried big uploads since February, sorry! Also I spent half a year in northeastern China (aka Manchuria) which had really really slow internet so Hong Kong suddenly felt very fast when I returned home.

Closing as WORKSFORME for the time being, as there isn't enough information to act here (see comment 29).
Anybody please feel free to reopen once information is available.