Page MenuHomePhabricator

handle logstash timeouts separately from spikes in errors reported by logstash
Closed, ResolvedPublic

Description

scap should catch connection timeout errors when connecting to logstash to monitor canary deploys. Currently we treat logstash failures the same as error spikes. We should at least catch the exception and report the error in a sensible way. (See T143982: scap on beta cluster does not run anymore due to logstash being down)

Scap runs /usr/local/bin/logstash_checker.py to perform the logstash canary checks and it uses its exit status to determine pass/fail. A possible fix is to arrange for logstash_checker.py terminate with a distinguished exit status if the error rate threshold has been exceeded. Other errors will result in the non-distinguished exit status of 1.

In the case of "other errors", scap will prompt the user to what to do next (exit, retry logstash check, or proceed).

Details

TitleReferenceAuthorSource BranchDest Branch
Handle logstash timeouts separately from spikes in errors reported by logstashrepos/releng/scap!221dancymaster-Ifd410026dfea14cdd5f90a7ddad9000e2ef4db05master
Customize query in GitLab

Event Timeline

greg lowered the priority of this task from Unbreak Now! to High.Aug 29 2016, 6:24 PM

Sorry I didn't mean to do that. Creating a sub-task should not clone to parent's priority ;)

stack trace of timout errors

20:16:43 Executing check 'Logstash Error rate for mw2374.codfw.wmnet'
20:16:43 Executing check 'Logstash Error rate for mw2376.codfw.wmnet'
20:16:43 Executing check 'Logstash Error rate for mw2271.codfw.wmnet'
20:16:43 Executing check 'Logstash Error rate for mw2272.codfw.wmnet'
20:16:53 Check 'Logstash Error rate for mw2374.codfw.wmnet' failed: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1356, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 280, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 65, in fetch_url
    **kw
  File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 323, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='logstash1023.eqiad.wmnet', port=9200): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 411, in <module>
    main()
  File "/usr/local/bin/logstash_checker.py", line 407, in main
    sys.exit(checker.run())
  File "/usr/local/bin/logstash_checker.py", line 266, in run
    body=query_object
  File "/usr/local/bin/logstash_checker.py", line 81, in fetch_url
    "downloading {}".format(url))
__main__.CheckServiceError: Timeout on connection while downloading logstash1023.eqiad.wmnet:9200/logstash-*/_search

20:16:53 Check 'Logstash Error rate for mw2376.codfw.wmnet' failed: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1356, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 280, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 65, in fetch_url
    **kw
  File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 323, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='logstash1023.eqiad.wmnet', port=9200): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 411, in <module>
    main()
  File "/usr/local/bin/logstash_checker.py", line 407, in main
    sys.exit(checker.run())
  File "/usr/local/bin/logstash_checker.py", line 266, in run
    body=query_object
  File "/usr/local/bin/logstash_checker.py", line 81, in fetch_url
    "downloading {}".format(url))
__main__.CheckServiceError: Timeout on connection while downloading logstash1023.eqiad.wmnet:9200/logstash-*/_search

20:16:53 Check 'Logstash Error rate for mw2271.codfw.wmnet' failed: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1356, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 280, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 65, in fetch_url
    **kw
  File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 323, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='logstash1023.eqiad.wmnet', port=9200): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 411, in <module>
    main()
  File "/usr/local/bin/logstash_checker.py", line 407, in main
    sys.exit(checker.run())
  File "/usr/local/bin/logstash_checker.py", line 266, in run
    body=query_object
  File "/usr/local/bin/logstash_checker.py", line 81, in fetch_url
    "downloading {}".format(url))
__main__.CheckServiceError: Timeout on connection while downloading logstash1023.eqiad.wmnet:9200/logstash-*/_search

20:16:53 Check 'Logstash Error rate for mw2272.codfw.wmnet' failed: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1356, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 280, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 65, in fetch_url
    **kw
  File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 323, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='logstash1023.eqiad.wmnet', port=9200): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 411, in <module>
    main()
  File "/usr/local/bin/logstash_checker.py", line 407, in main
    sys.exit(checker.run())
  File "/usr/local/bin/logstash_checker.py", line 266, in run
    body=query_object
  File "/usr/local/bin/logstash_checker.py", line 81, in fetch_url
    "downloading {}".format(url))
__main__.CheckServiceError: Timeout on connection while downloading logstash1023.eqiad.wmnet:9200/logstash-*/_search

20:16:53 scap failed: average error rate on 4/4 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details)
20:16:53 Finished sync-check-canaries (duration: 00m 50s)
20:16:53 Unhandled error:
Traceback (most recent call last):
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/cli.py", line 542, in run
    exit_status = app.main(app.extra_arguments)
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/main.py", line 179, in main
    self.canary_checks(canaries, timer)
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/main.py", line 552, in canary_checks
    raise RuntimeError(canary_fail_msg)
RuntimeError: scap failed: average error rate on 4/4 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details)
20:16:53 sync-wikiversions failed: <RuntimeError> scap failed: average error rate on 4/4 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details)
20:16:54 deploy-promote failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-wikiversions', 'group1 wikis to 1.42.0-wmf.19  refs T354437']' returned non-zero exit status 70.
20:16:54 train failed: <CalledProcessError> Command '['/usr/bin/scap', 'deploy-promote', '--train', '--old-version', '1.42.0-wmf.18', 'group1', '1.42.0-wmf.19']' returned non-zero exit status 70.
dancy updated the task description. (Show Details)

Change 1005610 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] logstash_checker.py: Exit 10 if over error threshold

https://gerrit.wikimedia.org/r/1005610

Change 1005610 merged by JHathaway:

[operations/puppet@production] logstash_checker.py: Exit 10 if over error threshold

https://gerrit.wikimedia.org/r/1005610

dancy opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/221

Handle logstash timeouts separately from spikes in errors reported by logstash

dancy merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/221

Handle logstash timeouts separately from spikes in errors reported by logstash

scap 4.67.0 has been deployed with this change.