Page MenuHomePhabricator

HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed
Open, Needs TriagePublic

Description

This Icinga check has been flapping since Saturday.

PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed
RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 53 days)

From IRC scrollback:

vgutierrez mutante: it looks like there is a backend behind their LB misbehaving
vgutierrez I'm ACKing the check till Monday

So my understanding is that it's an issue on the Automatiic side. @Vgutierrez might be able to give more information.

Event Timeline

ayounsi created this task.Mon, Sep 9, 10:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Sep 9, 10:58 PM

@Varnent Can you pass on this report to Auttomatic?

ayounsi removed a subscriber: ayounsi.Mon, Sep 9, 11:34 PM

@CDanis the report has been passed on!

I am on vacation this week - but @EdErhart-WMF is on it. :)

EdErhart-WMF added a comment.EditedTue, Sep 10, 4:38 PM

@CDanis @ayounsi @Vgutierrez WordPress VIP has sent back the following queries which I'm not equipped to answer. Could you help?

  • "When exactly did those errors start appearing?
  • "How many occurrences of these have you seen, and when?
  • "Is the problem still ongoing?
  • "Can you please provide a traceroute, mtr, or some form of connection trace from your side. At the very least the source network IP and timestamp."

I’m on vacation starting yesterday, so won’t be able to answer everything,
but: errors began intermittently appearing Saturday. Source IP is the A
record that icinga.wikimedia.org resolves to

  • "When exactly did those errors start appearing?

2019-09-07T03:43:55 UTC

  • "How many occurrences of these have you seen, and when?

About one flap every 5 minutes.

  • "Is the problem still ongoing?

Yes

  • "Can you please provide a traceroute, mtr, or some form of connection trace from your side. At the very least the source network IP and timestamp."
ayounsi@icinga1001:~$ mtr -z -n --report-wide blog.wikimedia.org
Start: Wed Sep 11 15:10:33 2019
HOST: icinga1001               Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS14907  208.80.154.67     0.0%    10    0.7   0.5   0.4   0.8   0.0
  2. AS???    206.126.237.124   0.0%    10    1.5   0.9   0.5   1.7   0.0
  3. AS2635   198.181.119.95    0.0%    10    0.8   0.7   0.5   1.1   0.0
  4. AS???    100.68.10.7       0.0%    10    0.7   0.7   0.5   0.9   0.0
  5. AS2635   192.0.79.33       0.0%    10    0.5   0.5   0.5   0.6   0.0

Source is 208.80.154.84

Thanks very much, @ayounsi—I've passed this info onto WordPress. Will let y'all know when I hear more!

@CDanis @ayounsi @Vgutierrez Hello all, happy Monday! WordPress is still investigating what's happening, and they're wondering if we have a more detailed error message that we could share with them. I could also try to cc you on the WordPress ticket, if that would be more helpful. Thanks for your time—y'all are fantastic!

jijiki added a subscriber: jijiki.Tue, Sep 17, 4:54 AM

I am downtiming the check for a couple of days because it is polluting our alerts

Mentioned in SAL (#wikimedia-operations) [2019-09-17T04:56:59Z] <effie> Downtiming HTTPS-blog on icing - T232412

@CDanis @ayounsi @Vgutierrez Hello all, happy Monday! WordPress is still investigating what's happening, and they're wondering if we have a more detailed error message that we could share with them. I could also try to cc you on the WordPress ticket, if that would be more helpful. Thanks for your time—y'all are fantastic!

@Vgutierrez had a quick look and determined that sometimes it takes more than 6s to perform the TLS handshake, which is why out tests fail.

I am disabling this alert for another week.

fgiunchedi added a subscriber: fgiunchedi.EditedThu, Sep 19, 2:34 PM

I ran some ssl tests with atlas from ripe, in case that's useful, from 500 probes in the US about 10% failed: https://atlas.ripe.net/measurements/22878642/#!probes (sort by validity)

Similarly 500 world wide probes with ~10% failing handshake: https://atlas.ripe.net/measurements/22878669/#!probes