Page MenuHomePhabricator

http connections to European Bits server often time out for some users since ~2013-05-06
Closed, ResolvedPublic

Description

Author: mr.heat

Description:
The server http://bits.wikimedia.org/ is insanely slow since two days. Requests almost never return anything. The requests time out instead. This leaves all Mediawiki projects (including Commons) naked without any CSS (except for my user CSS).

Maybe an DNS issue?

Is there an DoS going on?

I'm sure this is not an issue on my side because I tested this on different computers using different internet connections. It's the same everywhere.

I'm in Germany. Here is the relevant part of a tracert:

C:\>tracert bits.wikimedia.org
Routenverfolgung zu bits-lb.esams.wikimedia.org [91.198.174.233]:
[...]

8    50 ms    51 ms    52 ms  ge0-1-0-cr0.ixf.de.as6908.net [80.81.192.244]
9    58 ms    56 ms    59 ms  xe-5-1-0-core0.nknik.nl.as6908.net [62.149.50.42]

10 54 ms 55 ms 54 ms xe-0-0-1.cr2-knams.wikimedia.org [78.41.155.38]
11 57 ms 56 ms 56 ms bits-lb.esams.wikimedia.org [91.198.174.233]
Ablaufverfolgung beendet.

I can't explain why the tracert looks so good. Requesting any bits URL in the browser almost always times out.


Version: wmf-deployment
Severity: critical
See Also:
https://rt.wikimedia.org/Ticket/Display.html?id=5118

Details

Reference
bz48257

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 1:21 AM
bzimport set Reference to bz48257.
bzimport added a subscriber: Unknown Object (MLST).

It seems completely down at the moment:

Failed to load resource: the server responded with a status of 503 (Service Unavailable)

mr.heat wrote:

To let you know: It's much better now but not solved. Currently it feels like 5% of the requests in the German Wikipedia time out. Nothing happens for a minute and a "server does not respond" is shown. When I try again it works most of the time. Some edits are lost because of this. Multiple users reported the same problem.

Raising priority then, adding the 'ops' keyword.

(In reply to comment #2)

Multiple users reported the same problem.

URLs welcome, as I haven't seen anything on the usual Commons forums that I try to follow.

There was an outage on Wednesday, 13:30 - 14:00 UTC, due to a memcached server going offline. "As usual this caused all kinds of cascading failures on other clusters such as Squid/Varnish. When not overloaded, these clusters would only serve cached pages at that point."
That would not cover "since 2 days" but that's what people immediately mentioned when I brought up this bug report in the operations channel.

I'm currently also in Germany and I ran "mtr" on my Linux machine for a while:

My traceroute  [v0.82]

embrace.foo (0.0.0.0) Fri May 10 02:59:33 2013
Resolver: Received error response 2. (server failure)er of fields quit

Packets               Pings

Host Loss% Snt Last Avg Best Wrst StDev

    1. fritz.box 0.0% 158 1.1 9.1 1.0 606.0 58.0
    2. 217.0.117.142 0.0% 158 20.1 30.4 19.1 529.5 48.8
    3. 87.186.195.6 0.0% 158 22.3 31.6 20.4 434.4 39.0
    4. hh-ea4-i.HH.DE.NET.DTAG.DE 0.6% 158 27.7 41.4 27.0 338.0 35.6
    5. 194.25.208.234 0.0% 158 30.2 45.8 27.7 1023. 81.4 80.156.160.242 80.150.168.162 80.156.163.126
    6. hbg-bb1-link.telia.net 0.0% 158 27.4 48.4 27.2 999.9 80.3 hbg-bb1-link.telia.net hbg-bb1-link.telia.net
    7. adm-bb3-link.telia.net 0.0% 158 33.5 50.3 32.6 1029. 102.1 adm-bb3-link.telia.net adm-bb3-link.telia.net adm-bb3-link.telia.net adm-bb3-link.telia.net adm-bb3-link.telia.net adm-bb3-link.telia.net adm-bb3-link.telia.net
    8. adm-b5-link.telia.net 0.0% 158 35.1 51.0 33.9 943.4 92.1
    9. wikimedia-ic-129908- adm-b3.c.telia.net 5.7% 158 37.0 49.0 34.6 846.6 68.5
  1. bits.esams.wikimedia.org 0.0% 158 35.2 47.8 35.2 746.2 69.9

(In reply to comment #0)

The server http://bits.wikimedia.org/ is insanely slow since two days.
Requests
almost never return anything. The requests time out instead. This leaves all
Mediawiki projects (including Commons) naked without any CSS (except for my
user CSS).

I would note that your user css is served via bits. What urls specifically are timing out, or is it random?

There was an outage on Wednesday, 13:30 - 14:00 UTC, due to a memcached server

going offline.

Shouldn't these sorts of things show up in the server admin log...

I answered to before to bug 42653 (comments: 14 - 17), but i will write the key points to here too. It seems that bits-lb.esams.wikimedia.org http is broken. IP itself answers to ping and https links are working fine.

Eg. this works:

This will fail most of the times

Error is:
curl: (7) Failed to connect to 2620:0:862:ed1a:🅰️ Network is unreachable

Out of curiosity, does
curl -i -4 http://bits.wikimedia.org/
Also give you errors?

(In reply to comment #9)

Yes

To clarify, does it give the same error (it definitely should not)

To clarify, does it give the same error (it definitely should not)

Error message is:
curl -i -4 http://bits.wikimedia.org/
curl: (7) Failed connect to bits.wikimedia.org:80; Connection timed out

And when connection works the response is pretty much instant. So it is not like that http server is too slow, but more like it just works or it doesn't work.

Example response from http query which worked:

HTTP/1.1 200 OK
Server: Apache
Last-Modified: Thu, 12 Aug 2010 16:12:20 GMT
ETag: "b2-48da2a1772100"
Content-Type: text/html
X-Varnish: 1991165982
Via: 1.1 varnish
Content-Length: 178
Accept-Ranges: bytes
Date: Fri, 10 May 2013 06:05:39 GMT
X-Varnish: 3599832084
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: sq67 miss (0), cp3022 miss (0)

<html>
<head><title>bits and pieces</title>

		<meta http-equiv="refresh" content="1;url=http://www.wikimedia.org/" />

</head>
<body>
bits and pieces live here!
</body>
</html>

real 0m0.281s
user 0m0.004s
sys 0m0.004s

mr.heat wrote:

At the moment all Wikimedia projects are kind of dead and unusable because of this. Here are some example URLs that all time out:

http://bits.wikimedia.org/de.wikipedia.org/load.php?debug=false&lang=de&modules=startup&only=scripts&skin=vector&*
http://bits.wikimedia.org/commons.wikimedia.org/load.php?debug=false&lang=de&modules=startup&only=scripts&skin=vector&*
http://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.gadget.ReferenceTooltips%2Ccharinsert%2Ctoolbaralert2%7Cext.wikihiero%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmw.PopUpMediaTransform%7Cskins.vector&only=styles&skin=vector&*

Its like comment #12 said. Some requests to bits.wikimedia.org return immediately, some requests take a very long time (about 30 seconds) and some requests never return (time out).

Again, I'm sitting in Germany.

C:\>tracert bits.wikimedia.org
Routenverfolgung zu bits-lb.esams.wikimedia.org [91.198.174.233]:
[...]

8    53 ms    51 ms    51 ms  ge0-1-0-cr0.ixf.de.as6908.net [80.81.192.244]
9    58 ms    57 ms    57 ms  xe-5-1-0-core0.nknik.nl.as6908.net [62.149.50.42]

10 55 ms 55 ms 55 ms xe-0-0-1.cr2-knams.wikimedia.org [78.41.155.38]
11 59 ms 55 ms 57 ms bits-lb.esams.wikimedia.org [91.198.174.233]

mr.heat wrote:

Bits and Meta subdomain requests time out

Here is a screenshot from the Opera Dragonfly debugger. Please not that it's not only bits.wikimedia.org (all URLs that start with load.php). Also some meta.wikimedia.org URLs time out.

Attached:

timeout-2013-05-10.png (1×1 px, 124 KB)

mr.heat wrote:

(In reply to comment #6)

https links are working fine.

Wow, you are right. The problem is immediately solved when I switch from http to https. I guess this is the reason why most of the users can't reproduce my problem.

https://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia#Wikipedia-Server_sterbenslahm

(In reply to comment #16)

And now both http and https have the same problem and are unusable.

That may be related or not. In Italy, for me HTTPS is down since about 20 min ago, while HTTP sometimes loads after a long time (with or without styles).

Another easy personal workaround is to switch to Google DNS server so the bits.wikimedia.org resolves to bits-lb.eqiad.wikimedia.org which works fine. This is one reason why problem is mainly in Europe.

The URLs in comment 13 and comment 16 load fine for me in Firefox 18 (same for using http:// instead of https://), no matter how often I try to reload, and I am based in Germany too currently.

I assume you bypass the cache when trying to reload these URLs?
http://en.wikipedia.org/wiki/Wikipedia:Bypass_your_cache

Summarizing the aforementioned VP/forum threads (thanks for the links!):

As I don't see indicators yet that this is a problem that a large number of users in Europe is affected by I'll set this back to "highest" priority and "critical".

mr.heat wrote:

(In reply to comment #19)

Firefox

I'm sure the browser does not matter. I tried both Firefox and Opera.

I assume you bypass the cache when trying to reload these URLs?

Yes, I know that and tried everything. In this case bypassing the browser cache made the problem worse. I tried to do the opposite, forcing the browser to never reload these resources if they are in the cache. But it seems there is no setting to do this. As far as I understand the browser always does a HEAD request to check if the cached resources changed. Some of these HEAD requests timed out.

Currently everything seems to work. Both http and https.

I still think there was an overload, maybe caused by a DoS. We will see if the problem comes back every 24 hours.

We've migrated the network in Europe (esams) to a new topology on Friday (May 10th), which probably also explains why this hasn't been happening since.