Page MenuHomePhabricator

ats-tls is having issues when varnish-fe goes away
Closed, ResolvedPublic

Description

During last week incident (2020-01-08 - MW API outage), ats-tls experienced major issues on esams when varnish-fe crashed on several text nodes. Specifically ats-tls took way longer than pybal realizing that varnish-fe came back after an automatic restart.

timeline:

  • pybal on lvs3005 notices that something is wrong with cp3050 port 443 at 15:05:46
Jan  8 15:05:46 lvs3005 pybal[13763]: [textlb6_443 ProxyFetch] WARN: cp3050.esams.wmnet (enabled/up/pooled): Fetch failed (https://varnishcheck.wikimedia.org/from/pybal), 5.001 s
  • ats-tls notices that varnish-fe is having issues at 15:05:54
Jan 08 15:05:54 cp3050 traffic_manager[21052]: [Jan  8 15:05:54.239] {0x2b0667ffd700} NOTE: Failure threshold met failcount:10 >= threshold:10, http parent proxy 10.20.0.50:3120 marked down
  • pybal reports varnish-fe issues on port 80 at 15:06:23:
Jan  8 15:06:23 lvs3005 pybal[13763]: [textlb_80] ERROR: Monitoring instance IdleConnection reports server cp3050.esams.wmnet (enabled/up/pooled) down: Connection to the other side was lost in a non-clean fashion.
  • pybal reports varnish-fe on port 80 coming back online at 15:06:26 (3 seconds later):
Jan  8 15:06:26 lvs3005 pybal[13763]: [textlb_80] INFO: Server cp3050.esams.wmnet (enabled/partially up/not pooled) is up
  • meanwhile port 443 goes up & down several times according to pybal:
Jan  8 15:06:22 lvs3005 pybal[13763]: [textlb_443 ProxyFetch] WARN: cp3050.esams.wmnet (enabled/partially up/not pooled): Fetch failed (https://varnishcheck.wikimedia.org/from/pybal), 5.002 s
Jan  8 15:06:36 lvs3005 pybal[13763]: [textlb_443] INFO: Server cp3050.esams.wmnet (enabled/partially up/not pooled) is up
Jan  8 15:07:14 lvs3005 pybal[13763]: [textlb_443 ProxyFetch] WARN: cp3050.esams.wmnet (enabled/up/pooled): Fetch failed (https://varnishcheck.wikimedia.org/from/pybal), 5.002 s
Jan  8 15:07:29 lvs3005 pybal[13763]: [textlb_443 ProxyFetch] WARN: cp3050.esams.wmnet (enabled/partially up/not pooled): Fetch failed (https://varnishcheck.wikimedia.org/from/pybal), 5.002 s
Jan  8 15:07:40 lvs3005 pybal[13763]: [textlb_443] INFO: Server cp3050.esams.wmnet (enabled/partially up/not pooled) is up
Jan  8 15:08:17 lvs3005 pybal[13763]: [textlb_443 ProxyFetch] WARN: cp3050.esams.wmnet (enabled/up/pooled): Fetch failed (https://varnishcheck.wikimedia.org/from/pybal), 5.002 s
Jan  8 15:08:17 lvs3005 pybal[13763]: [textlb_443] ERROR: Monitoring instance ProxyFetch reports server cp3050.esams.wmnet (enabled/up/pooled) down: Getting https://varnishcheck.wikimedia.org/from/pybal took longer than 5 seconds.
Jan  8 15:08:32 lvs3005 pybal[13763]: [textlb_443] INFO: Server cp3050.esams.wmnet (enabled/partially up/not pooled) is up
  • ats-tls reports some recoveries on ports 3120 && 3121 (nothing about 3122-3127):
Jan 08 15:07:16 cp3050 traffic_manager[21052]: [Jan  8 15:06:36.208] {0x2b0697926700} NOTE: http parent proxy 10.20.0.50:3120 restored
Jan 08 15:07:17 cp3050 traffic_manager[21052]: [Jan  8 15:07:17.325] {0x2b0667365700} NOTE: http parent proxy 10.20.0.50:3121 restored
Jan 08 15:07:57 cp3050 traffic_manager[21052]: [Jan  8 15:07:57.224] {0x2b0694e98700} NOTE: http parent proxy 10.20.0.50:3120 restored
Jan 08 15:07:57 cp3050 traffic_manager[21052]: [Jan  8 15:07:57.376] {0x2b0697926700} NOTE: http parent proxy 10.20.0.50:3120 restored

We should further debug what's going on with:

  • as mentioned on the timeline, ats-tls only reported recoveries for ports 3120 and 3121 while it is using ports 3120-3127 to communicate with varnish-fe. The remaining ports never recovered? did we lose the recovery messages?

Pain points:

  • ats doesn't tell between connect and TTFB timeout, this is forcing us to have a 180 secs timeout on proxy.config.http.parent_proxy.connect_attempts_timeout. This is especially bad when varnish-fe crashes.

ats-tls currently has the following configuration:

dest_domain=. parent="10.20.0.50:3120,10.20.0.50:3121,10.20.0.50:3122,10.20.0.50:3123,10.20.0.50:3124,10.20.0.50:3125,10.20.0.50:3126,10.20.0.50:3127" parent_is_proxy=false round_robin=strict
---
proxy.config.http.parent_proxy.fail_threshold: 10
proxy.config.http.parent_proxy.total_connect_attempts: 4
proxy.config.http.parent_proxy.per_parent_connect_attempts: 2

This means that ats-tls is going to perform 2 connects per parent (each varnish-fe port) waiting up to 180 secs per attempt and up to fail_threshold (10) / total_connect_attempts (4) = 2 attempts.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Vgutierrez moved this task from Backlog to TLS on the Traffic board.

Apparently ATS solves our current issue about TTFB VS connect timeout on https://github.com/apache/trafficserver/pull/4028, this is currently backported to 7.x and 9.x, I'm checking with upstream what's the current blocker for 8,x.

Change 564584 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/trafficserver@master] Release 8.0.5-1wm12

https://gerrit.wikimedia.org/r/564584

Mentioned in SAL (#wikimedia-operations) [2020-01-14T10:40:08Z] <vgutierrez> upgrade ats to 8.0.5-1wm12 in cp4026 and cp4032 - T242620

Mentioned in SAL (#wikimedia-operations) [2020-01-14T14:09:48Z] <vgutierrez> upgrade ats to 8.0.5-1wm12 in cp5006 and cp5012 - T242620

Change 564708 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Set connect timeout and TTFB timeouts to different values

https://gerrit.wikimedia.org/r/564708

Change 564711 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS cp40[26|32], cp50[06|12]: Set connect timeout and TTFB timeouts to different values

https://gerrit.wikimedia.org/r/564711

Change 564711 merged by Vgutierrez:
[operations/puppet@production] ATS: Set connect timeout and TTFB timeouts to different values (test hosts only)

https://gerrit.wikimedia.org/r/564711

Change 564584 merged by Vgutierrez:
[operations/debs/trafficserver@master] Release 8.0.5-1wm12

https://gerrit.wikimedia.org/r/564584

Mentioned in SAL (#wikimedia-traffic) [2020-01-20T15:03:10Z] <vgutierrez> uploaded trafficserver 8.0.5-1wm12 to apt.wm.o (stretch) - T242620 T242778

Mentioned in SAL (#wikimedia-operations) [2020-01-20T15:20:08Z] <vgutierrez> rolling upgrade of ats to version 8.0.5-1wm12 - T242620 T242778

Change 564708 merged by Vgutierrez:
[operations/puppet@production] ATS: Set connect timeout and TTFB timeouts to different values

https://gerrit.wikimedia.org/r/564708

Vgutierrez claimed this task.