Page MenuHomePhabricator

HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed
Closed, ResolvedPublic

Description

This Icinga check has been flapping since Saturday.

PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed
RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 53 days)

From IRC scrollback:

vgutierrez mutante: it looks like there is a backend behind their LB misbehaving
vgutierrez I'm ACKing the check till Monday

So my understanding is that it's an issue on the Automatiic side. @Vgutierrez might be able to give more information.

Event Timeline

@Varnent Can you pass on this report to Auttomatic?

I am on vacation this week - but @EdErhart-WMF is on it. :)

@CDanis @ayounsi @Vgutierrez WordPress VIP has sent back the following queries which I'm not equipped to answer. Could you help?

  • "When exactly did those errors start appearing?
  • "How many occurrences of these have you seen, and when?
  • "Is the problem still ongoing?
  • "Can you please provide a traceroute, mtr, or some form of connection trace from your side. At the very least the source network IP and timestamp."

I’m on vacation starting yesterday, so won’t be able to answer everything,
but: errors began intermittently appearing Saturday. Source IP is the A
record that icinga.wikimedia.org resolves to

  • "When exactly did those errors start appearing?

2019-09-07T03:43:55 UTC

  • "How many occurrences of these have you seen, and when?

About one flap every 5 minutes.

  • "Is the problem still ongoing?

Yes

  • "Can you please provide a traceroute, mtr, or some form of connection trace from your side. At the very least the source network IP and timestamp."
ayounsi@icinga1001:~$ mtr -z -n --report-wide blog.wikimedia.org
Start: Wed Sep 11 15:10:33 2019
HOST: icinga1001               Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS14907  208.80.154.67     0.0%    10    0.7   0.5   0.4   0.8   0.0
  2. AS???    206.126.237.124   0.0%    10    1.5   0.9   0.5   1.7   0.0
  3. AS2635   198.181.119.95    0.0%    10    0.8   0.7   0.5   1.1   0.0
  4. AS???    100.68.10.7       0.0%    10    0.7   0.7   0.5   0.9   0.0
  5. AS2635   192.0.79.33       0.0%    10    0.5   0.5   0.5   0.6   0.0

Source is 208.80.154.84

Thanks very much, @ayounsi—I've passed this info onto WordPress. Will let y'all know when I hear more!

@CDanis @ayounsi @Vgutierrez Hello all, happy Monday! WordPress is still investigating what's happening, and they're wondering if we have a more detailed error message that we could share with them. I could also try to cc you on the WordPress ticket, if that would be more helpful. Thanks for your time—y'all are fantastic!

I am downtiming the check for a couple of days because it is polluting our alerts

Mentioned in SAL (#wikimedia-operations) [2019-09-17T04:56:59Z] <effie> Downtiming HTTPS-blog on icing - T232412

@CDanis @ayounsi @Vgutierrez Hello all, happy Monday! WordPress is still investigating what's happening, and they're wondering if we have a more detailed error message that we could share with them. I could also try to cc you on the WordPress ticket, if that would be more helpful. Thanks for your time—y'all are fantastic!

@Vgutierrez had a quick look and determined that sometimes it takes more than 6s to perform the TLS handshake, which is why out tests fail.

I am disabling this alert for another week.

I ran some ssl tests with atlas from ripe, in case that's useful, from 500 probes in the US about 10% failed: https://atlas.ripe.net/measurements/22878642/#!probes (sort by validity)

Similarly 500 world wide probes with ~10% failing handshake: https://atlas.ripe.net/measurements/22878669/#!probes

Getting the WordPress replies through a separate thread.

My guess would be that it's an MTU issue, tracked in T232602. The flapping nature of the issue made it harder to troubleshot as it's usually a hard down. But I guess in this case the reply's size changes over time, and some of those replies are over the max MTU.

It has been green since:

Last State Change: 2019-09-19 18:15:59`

Which matches when we pushed the ADV-MSS change on our routers T232602#5507189

It *might* be a sign that WordPress doesn't respects ICMP Packet too big messages sent by the Cloudflare gateway. Which should trigger a retransmit of smaller packets on the WP side. Or CF's algorithms are aggressively blocking retransmits. Many possibilities...

But we should be good to close that task.

Hey all,

It looks like @ayounsi has already seen WordPress VIP's response, but I wanted to copy it below for y'all's input as well:

Thanks for the report and the additional details. We looked into the intermittent SSL handshake timeout failures as well as the RIPE atlas failures and it doesn't seem like they are related. The RIPE atlas failures are deterministic and repeatable (can run the same test over and over again from the same probe and it fails every time). The reason is because a small subset of probes only support a subset of TLS ciphers that are unsupported on WordPress.com (for example DES-CBC3-SHA over TLSv1.2). I can't find any legitimate clients that support the same subset of ciphers, so I'm hesitant to modify our configs to permit less secure ciphers only to allow these probes to complete the handshake.

As far as the intermittent timeouts between your monitoring and blog.wikimedia.org - it seems related to Cloudflare's Magic Transit service. We started receiving your prefixes via Cloudflare exactly 2 weeks ago, which seems to line up exactly with when this problem started. While the traffic from your servers to ours is direct via Equinix Ashburn, our return path to you goes via Cloudflare.

# traceroute -A  icinga.wikimedia.org -I
traceroute to icinga.wikimedia.org (208.80.154.84), 30 hops max, 60 byte packets
 1  100.68.15.157 (100.68.15.157) [*]  0.147 ms 100.68.15.156 (100.68.15.156) [*]  0.132 ms 100.68.15.157 (100.68.15.157) [*]  0.355 ms
 2  100.68.14.0 (100.68.14.0) [*]  0.126 ms 100.68.14.6 (100.68.14.6) [*]  0.117 ms 100.68.14.0 (100.68.14.0) [*]  0.310 ms
 3  wordpress.com (198.181.119.92) [AS2635]  1.635 ms wordpress.com (198.181.119.88) [AS2635]  0.155 ms wordpress.com (198.181.119.92) [AS2635]  1.809 ms
 4  wordpress.com (198.181.119.47) [AS2635]  0.445 ms  0.303 ms  0.536 ms
 5  206.126.237.30 (206.126.237.30) [AS5773]  8.855 ms  9.137 ms  9.258 ms
 6  162.158.76.25 (162.158.76.25) [AS13335]  0.296 ms  0.302 ms  0.328 ms
 7  162.158.76.25 (162.158.76.25) [AS13335]  0.331 ms  0.531 ms  0.638 ms
 8  icinga1001.wikimedia.org (208.80.154.84) [AS14907]  0.581 ms  0.838 ms  0.969 ms

I'm guessing the implementation of this service is related to the recent DDoS attacks mentioned here? https://techcrunch.com/2019/09/07/wikipedia-blames-malicious-ddos-attack-after-site-goes-down-across-europe-middle-east/

Is this service something you plan on keeping enabled for the long term? If so, it might be best for you to work with Cloudflare to see if there is any evidence on their side of the traffic being dropped. We aren't filtering any of the traffic on our end and it seems pretty likely that if traffic between our two networks is impacted by this configuration that other traffic exchanged with your network is being affected as well.

Hope this helps narrow things down and if you need anything else from our side, please let us know

Should be fixed on our end now. If it starts happening again, we’ll
escalate to Cloudflare to investigate. Thanks for passing this along to
Wordpress!

CDanis claimed this task.