handle logstash timeouts separately from spikes in errors reported by logstash
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• mmodell
	Aug 26 2016, 5:29 PM

Description

scap should catch connection timeout errors when connecting to logstash to monitor canary deploys. Currently we treat logstash failures the same as error spikes. We should at least catch the exception and report the error in a sensible way. (See T143982: scap on beta cluster does not run anymore due to logstash being down)

Scap runs /usr/local/bin/logstash_checker.py to perform the logstash canary checks and it uses its exit status to determine pass/fail. A possible fix is to arrange for logstash_checker.py terminate with a distinguished exit status if the error rate threshold has been exceeded. Other errors will result in the non-distinguished exit status of 1.

In the case of "other errors", scap will prompt the user to what to do next (exit, retry logstash check, or proceed).

Details

	Subject	Repo	Branch	Lines +/-
	logstash_checker.py: Exit 10 if over error threshold	operations/puppet	production	+13 -4

Customize query in gerrit

	Title	Reference	Author	Source Branch	Dest Branch
	Handle logstash timeouts separately from spikes in errors reported by logstash	repos/releng/scap!221	dancy	master-Ifd410026dfea14cdd5f90a7ddad9000e2ef4db05	master

Customize query in GitLab

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		hashar	T143982 scap on beta cluster does not run anymore due to logstash being down
		Resolved		dancy	T144033 handle logstash timeouts separately from spikes in errors reported by logstash

Event Timeline

• mmodell created this task.Aug 26 2016, 5:29 PM

• mmodell removed hashar as the assignee of this task.Aug 26 2016, 9:37 PM

greg lowered the priority of this task from Unbreak Now! to High.Aug 29 2016, 6:24 PM

Sorry I didn't mean to do that. Creating a sub-task should not clone to parent's priority ;)

• mmodell edited projects, added Scap; removed scap2.Feb 10 2017, 6:22 PM

• Phabricator_maintenance moved this task from To Triage to Backlog (Tech) on the Deployments board.Sep 26 2017, 11:23 PM

• mmodell moved this task from Needs triage to Debt on the Scap board.Feb 1 2018, 12:22 AM

dancy edited projects, added Release-Engineering-Team (Now this 🫠); removed Scap.Feb 21 2024, 9:13 PM

stack trace of timout errors

20:16:43 Executing check 'Logstash Error rate for mw2374.codfw.wmnet'
20:16:43 Executing check 'Logstash Error rate for mw2376.codfw.wmnet'
20:16:43 Executing check 'Logstash Error rate for mw2271.codfw.wmnet'
20:16:43 Executing check 'Logstash Error rate for mw2272.codfw.wmnet'
20:16:53 Check 'Logstash Error rate for mw2374.codfw.wmnet' failed: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1356, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 280, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 65, in fetch_url
    **kw
  File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 323, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='logstash1023.eqiad.wmnet', port=9200): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 411, in <module>
    main()
  File "/usr/local/bin/logstash_checker.py", line 407, in main
    sys.exit(checker.run())
  File "/usr/local/bin/logstash_checker.py", line 266, in run
    body=query_object
  File "/usr/local/bin/logstash_checker.py", line 81, in fetch_url
    "downloading {}".format(url))
__main__.CheckServiceError: Timeout on connection while downloading logstash1023.eqiad.wmnet:9200/logstash-*/_search

20:16:53 Check 'Logstash Error rate for mw2376.codfw.wmnet' failed: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1356, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 280, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 65, in fetch_url
    **kw
  File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 323, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='logstash1023.eqiad.wmnet', port=9200): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 411, in <module>
    main()
  File "/usr/local/bin/logstash_checker.py", line 407, in main
    sys.exit(checker.run())
  File "/usr/local/bin/logstash_checker.py", line 266, in run
    body=query_object
  File "/usr/local/bin/logstash_checker.py", line 81, in fetch_url
    "downloading {}".format(url))
__main__.CheckServiceError: Timeout on connection while downloading logstash1023.eqiad.wmnet:9200/logstash-*/_search

20:16:53 Check 'Logstash Error rate for mw2271.codfw.wmnet' failed: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1356, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 280, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 65, in fetch_url
    **kw
  File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 323, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='logstash1023.eqiad.wmnet', port=9200): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 411, in <module>
    main()
  File "/usr/local/bin/logstash_checker.py", line 407, in main
    sys.exit(checker.run())
  File "/usr/local/bin/logstash_checker.py", line 266, in run
    body=query_object
  File "/usr/local/bin/logstash_checker.py", line 81, in fetch_url
    "downloading {}".format(url))
__main__.CheckServiceError: Timeout on connection while downloading logstash1023.eqiad.wmnet:9200/logstash-*/_search

20:16:53 Check 'Logstash Error rate for mw2272.codfw.wmnet' failed: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1356, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 280, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 65, in fetch_url
    **kw
  File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 323, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='logstash1023.eqiad.wmnet', port=9200): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 411, in <module>
    main()
  File "/usr/local/bin/logstash_checker.py", line 407, in main
    sys.exit(checker.run())
  File "/usr/local/bin/logstash_checker.py", line 266, in run
    body=query_object
  File "/usr/local/bin/logstash_checker.py", line 81, in fetch_url
    "downloading {}".format(url))
__main__.CheckServiceError: Timeout on connection while downloading logstash1023.eqiad.wmnet:9200/logstash-*/_search

20:16:53 scap failed: average error rate on 4/4 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details)
20:16:53 Finished sync-check-canaries (duration: 00m 50s)
20:16:53 Unhandled error:
Traceback (most recent call last):
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/cli.py", line 542, in run
    exit_status = app.main(app.extra_arguments)
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/main.py", line 179, in main
    self.canary_checks(canaries, timer)
  File "/var/lib/scap/scap/lib/python3.7/site-packages/scap/main.py", line 552, in canary_checks
    raise RuntimeError(canary_fail_msg)
RuntimeError: scap failed: average error rate on 4/4 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details)
20:16:53 sync-wikiversions failed: <RuntimeError> scap failed: average error rate on 4/4 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details)
20:16:54 deploy-promote failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-wikiversions', 'group1 wikis to 1.42.0-wmf.19  refs T354437']' returned non-zero exit status 70.
20:16:54 train failed: <CalledProcessError> Command '['/usr/bin/scap', 'deploy-promote', '--train', '--old-version', '1.42.0-wmf.18', 'group1', '1.42.0-wmf.19']' returned non-zero exit status 70.

dancy updated the task description. (Show Details)Feb 21 2024, 9:26 PM

dancy updated the task description. (Show Details)

Change 1005610 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] logstash_checker.py: Exit 10 if over error threshold

https://gerrit.wikimedia.org/r/1005610

gerritbot added a project: Patch-For-Review.Feb 21 2024, 9:58 PM

dancy moved this task from Backlog to Waiting for review on the Release-Engineering-Team (Now this 🫠) board.Feb 21 2024, 10:26 PM

dancy updated the task description. (Show Details)Feb 21 2024, 11:59 PM

Change 1005610 merged by JHathaway:

[operations/puppet@production] logstash_checker.py: Exit 10 if over error threshold

https://gerrit.wikimedia.org/r/1005610

Maintenance_bot removed a project: Patch-For-Review.Feb 22 2024, 5:30 PM