Page MenuHomePhabricator

ats-be on the text cluster is experiencing broken connections
Open, MediumPublic

Description

As it can be seen on any text instance using ats-be on the "ATS Instance Drilldown" grafana dashboard, these instances are experiencing broken connections against origin servers.

The errors causing broken connections are being logged in /var/trafficserver/error.log and they look like this:

vgutierrez@cp5007:/var/log/trafficserver$ fgrep BAD_INCOMING_RESPONSE error.log |head -1
20191028.15h06m23s CONNECT:[0] could not connect [BAD_INCOMING_RESPONSE] to 10.2.2.1 for 'https://appservers-rw.discovery.wmnet/wiki/Special:EntityPage/Q170564'
vgutierrez@cp5007:/var/log/trafficserver$ fgrep CONNECTION_CLOSED error.log |head -1
20191028.13h50m20s CONNECT:[0] could not connect [CONNECTION_CLOSED] to 10.2.2.1 for 'https://appservers-rw.discovery.wmnet/wiki/Special:%E7%9B%91%E8%A7%86%E5%88%97%E8%A1%A8?hidepreviousrevisions=1&hidecategorization=1&hideWikibase=1&limit=1000&days=14&urlversion=2&peek=1&from=20191028134609&isAnon=false&action=render&enhanced=0'
vgutierrez@cp5007:/var/log/trafficserver$ fgrep CONNECTION_ERROR error.log |head -1
20191028.13h53m10s CONNECT:[0] could not connect [CONNECTION_ERROR] to 10.2.2.22 for 'https://api-rw.discovery.wmnet/w/api.php?format=json&formatversion=2&errorformat=plaintext&action=query&meta=notifications&notprop=list&notfilter=!read&notlimit=1'
vgutierrez@cp5007:/var/log/trafficserver$ fgrep INACTIVE_TIMEOUT error.log |head -1
20191028.13h56m33s CONNECT:[0] could not connect [INACTIVE_TIMEOUT] to 10.2.2.1 for 'https://appservers-rw.discovery.wmnet/wiki/Night_of_the_Living_Dead_(film_series)'

Of course some are bound to happen due to operations on the origin servers, others could be a symptom of a misconfigured service (i.e. missing public hostname on the TLS certificate SAN list)

Event Timeline

BBlack subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

@Vgutierrez: https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&from=now-30d&to=now&viewPanel=59 shows a lower frequency of broken connections than is implied by the ticket description. Would you say that these incidents are not related to this ticket? I am not seeing any of the BAD_INCOMING_RESPONSE string in the logs (granted, we recently refreshed those servers via T321309)