During last week incident (2020-01-08 - MW API outage), ats-tls experienced major issues on esams when varnish-fe crashed on several text nodes. Specifically ats-tls took way longer than pybal realizing that varnish-fe came back after an automatic restart.
timeline:
- pybal on lvs3005 notices that something is wrong with cp3050 port 443 at 15:05:46
Jan 8 15:05:46 lvs3005 pybal[13763]: [textlb6_443 ProxyFetch] WARN: cp3050.esams.wmnet (enabled/up/pooled): Fetch failed (https://varnishcheck.wikimedia.org/from/pybal), 5.001 s
- ats-tls notices that varnish-fe is having issues at 15:05:54
Jan 08 15:05:54 cp3050 traffic_manager[21052]: [Jan 8 15:05:54.239] {0x2b0667ffd700} NOTE: Failure threshold met failcount:10 >= threshold:10, http parent proxy 10.20.0.50:3120 marked down
- pybal reports varnish-fe issues on port 80 at 15:06:23:
Jan 8 15:06:23 lvs3005 pybal[13763]: [textlb_80] ERROR: Monitoring instance IdleConnection reports server cp3050.esams.wmnet (enabled/up/pooled) down: Connection to the other side was lost in a non-clean fashion.
- pybal reports varnish-fe on port 80 coming back online at 15:06:26 (3 seconds later):
Jan 8 15:06:26 lvs3005 pybal[13763]: [textlb_80] INFO: Server cp3050.esams.wmnet (enabled/partially up/not pooled) is up
- meanwhile port 443 goes up & down several times according to pybal:
Jan 8 15:06:22 lvs3005 pybal[13763]: [textlb_443 ProxyFetch] WARN: cp3050.esams.wmnet (enabled/partially up/not pooled): Fetch failed (https://varnishcheck.wikimedia.org/from/pybal), 5.002 s Jan 8 15:06:36 lvs3005 pybal[13763]: [textlb_443] INFO: Server cp3050.esams.wmnet (enabled/partially up/not pooled) is up Jan 8 15:07:14 lvs3005 pybal[13763]: [textlb_443 ProxyFetch] WARN: cp3050.esams.wmnet (enabled/up/pooled): Fetch failed (https://varnishcheck.wikimedia.org/from/pybal), 5.002 s Jan 8 15:07:29 lvs3005 pybal[13763]: [textlb_443 ProxyFetch] WARN: cp3050.esams.wmnet (enabled/partially up/not pooled): Fetch failed (https://varnishcheck.wikimedia.org/from/pybal), 5.002 s Jan 8 15:07:40 lvs3005 pybal[13763]: [textlb_443] INFO: Server cp3050.esams.wmnet (enabled/partially up/not pooled) is up Jan 8 15:08:17 lvs3005 pybal[13763]: [textlb_443 ProxyFetch] WARN: cp3050.esams.wmnet (enabled/up/pooled): Fetch failed (https://varnishcheck.wikimedia.org/from/pybal), 5.002 s Jan 8 15:08:17 lvs3005 pybal[13763]: [textlb_443] ERROR: Monitoring instance ProxyFetch reports server cp3050.esams.wmnet (enabled/up/pooled) down: Getting https://varnishcheck.wikimedia.org/from/pybal took longer than 5 seconds. Jan 8 15:08:32 lvs3005 pybal[13763]: [textlb_443] INFO: Server cp3050.esams.wmnet (enabled/partially up/not pooled) is up
- ats-tls reports some recoveries on ports 3120 && 3121 (nothing about 3122-3127):
Jan 08 15:07:16 cp3050 traffic_manager[21052]: [Jan 8 15:06:36.208] {0x2b0697926700} NOTE: http parent proxy 10.20.0.50:3120 restored Jan 08 15:07:17 cp3050 traffic_manager[21052]: [Jan 8 15:07:17.325] {0x2b0667365700} NOTE: http parent proxy 10.20.0.50:3121 restored Jan 08 15:07:57 cp3050 traffic_manager[21052]: [Jan 8 15:07:57.224] {0x2b0694e98700} NOTE: http parent proxy 10.20.0.50:3120 restored Jan 08 15:07:57 cp3050 traffic_manager[21052]: [Jan 8 15:07:57.376] {0x2b0697926700} NOTE: http parent proxy 10.20.0.50:3120 restored
We should further debug what's going on with:
- as mentioned on the timeline, ats-tls only reported recoveries for ports 3120 and 3121 while it is using ports 3120-3127 to communicate with varnish-fe. The remaining ports never recovered? did we lose the recovery messages?
Pain points:
- ats doesn't tell between connect and TTFB timeout, this is forcing us to have a 180 secs timeout on proxy.config.http.parent_proxy.connect_attempts_timeout. This is especially bad when varnish-fe crashes.
ats-tls currently has the following configuration:
dest_domain=. parent="10.20.0.50:3120,10.20.0.50:3121,10.20.0.50:3122,10.20.0.50:3123,10.20.0.50:3124,10.20.0.50:3125,10.20.0.50:3126,10.20.0.50:3127" parent_is_proxy=false round_robin=strict --- proxy.config.http.parent_proxy.fail_threshold: 10 proxy.config.http.parent_proxy.total_connect_attempts: 4 proxy.config.http.parent_proxy.per_parent_connect_attempts: 2
This means that ats-tls is going to perform 2 connects per parent (each varnish-fe port) waiting up to 180 secs per attempt and up to fail_threshold (10) / total_connect_attempts (4) = 2 attempts.