On 2019-05-21 aprox 12:00 UTC:
<icinga-wm> PROBLEM - toolschecker: Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on
On 2019-05-21 aprox 12:00 UTC:
<icinga-wm> PROBLEM - toolschecker: Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on
diamond collector issue may have some hints:
May 21 12:06:41 tools-redis-1002 diamond[578]: Collector failed! May 21 12:06:41 tools-redis-1002 diamond[578]: Traceback (most recent call last): May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/lib/python2.7/dist-packages/diamond/utils/scheduler.py", line 73, in collector_process May 21 12:06:41 tools-redis-1002 diamond[578]: collector._run() May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/lib/python2.7/dist-packages/diamond/collector.py", line 477, in _run May 21 12:06:41 tools-redis-1002 diamond[578]: self.collect() May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/share/diamond/collectors/redisstat/redisstat.py", line 361, in collect May 21 12:06:41 tools-redis-1002 diamond[578]: self.collect_instance(nick, host, int(port), unix_socket, auth) May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/share/diamond/collectors/redisstat/redisstat.py", line 309, in collect_instance May 21 12:06:41 tools-redis-1002 diamond[578]: 'maxmemory') May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/share/diamond/collectors/redisstat/redisstat.py", line 282, in _get_config May 21 12:06:41 tools-redis-1002 diamond[578]: config_value = client.config_get(config_key) May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/lib/python2.7/dist-packages/redis/client.py", line 576, in config_get May 21 12:06:41 tools-redis-1002 diamond[578]: return self.execute_command('CONFIG GET', pattern) May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/lib/python2.7/dist-packages/redis/client.py", line 529, in execute_command May 21 12:06:41 tools-redis-1002 diamond[578]: return self.parse_response(connection, command_name, **options) May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/lib/python2.7/dist-packages/redis/client.py", line 541, in parse_response May 21 12:06:41 tools-redis-1002 diamond[578]: response = connection.read_response() May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/lib/python2.7/dist-packages/redis/connection.py", line 550, in read_response May 21 12:06:41 tools-redis-1002 diamond[578]: raise response May 21 12:06:41 tools-redis-1002 diamond[578]: ResponseError: unknown command 'CONFIG'
The redis-server service is in failed state in tools-redis-1002.
When trying to restart the redis-server service:
[28908] 21 May 12:09:36.210 # You requested maxclients of 10000 requiring at least 10032 max file descriptors. [28908] 21 May 12:09:36.212 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted. [28908] 21 May 12:09:36.212 # Current maximum open files is 1024. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'. [28908] 21 May 12:09:36.212 # Creating Server TCP listening socket 127.0.0.1:6379: bind: Address already in use [28915] 21 May 12:09:36.441 # You requested maxclients of 10000 requiring at least 10032 max file descriptors. [28915] 21 May 12:09:36.441 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted. [28915] 21 May 12:09:36.441 # Current maximum open files is 1024. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'. [28915] 21 May 12:09:36.445 # Creating Server TCP listening socket 127.0.0.1:6379: bind: Address already in use [28921] 21 May 12:09:36.689 # You requested maxclients of 10000 requiring at least 10032 max file descriptors. [28921] 21 May 12:09:36.689 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted. [28921] 21 May 12:09:36.689 # Current maximum open files is 1024. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'. [28921] 21 May 12:09:36.689 # Creating Server TCP listening socket 127.0.0.1:6379: bind: Address already in use [28927] 21 May 12:09:36.942 # You requested maxclients of 10000 requiring at least 10032 max file descriptors. [28927] 21 May 12:09:36.944 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted. [28927] 21 May 12:09:36.944 # Current maximum open files is 1024. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'. [28927] 21 May 12:09:36.944 # Creating Server TCP listening socket 127.0.0.1:6379: bind: Address already in use [28933] 21 May 12:09:37.187 # You requested maxclients of 10000 requiring at least 10032 max file descriptors. [28933] 21 May 12:09:37.187 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted. [28933] 21 May 12:09:37.187 # Current maximum open files is 1024. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'. [28933] 21 May 12:09:37.188 # Creating Server TCP listening socket 127.0.0.1:6379: bind: Address already in use
Mentioned in SAL (#wikimedia-cloud) [2019-05-21T12:35:37Z] <arturo> T223992 rebooting tools-redis-1002
I tried restarting both redis-server.service and redis-instance-tcp_6379.service with no luck, so rebooting server to see if it comes in a clean state by chance.
Other hint, the toolschecker message itself:
Caught exception: Error -2 connecting to tools-redis:6379. Name or service not known.
Could this be related to my recent merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/506019 ?
Indeed, this is related to that DNS name:
aborrero@tools-redis-1002:~$ host tools-redis.eqiad.wmflabs Host tools-redis.eqiad.wmflabs not found: 3(NXDOMAIN)
And the wmcs-wikireplica-dns has some issues:
root@cloudcontrol1003:~# wmcs-wikireplica-dns 2019-05-21T12:41:04Z mwopenstackclients.DnsManager WARNING : Creating tools-redis.svc.eqiad.wmflabs. 2019-05-21T12:41:04Z mwopenstackclients.DnsManager WARNING : Creating tools-redis.tools.eqiad.wmflabs. 2019-05-21T12:41:05Z mwopenstackclients.DnsManager WARNING : Creating tools-redis.eqiad.wmflabs. 2019-05-21T12:41:06Z mwopenstackclients.DnsManager WARNING : Creating s6.{u'status': u'ACTIVE', u'masters': [], u'name': u'analytics.db.svc.eqiad.wmflabs.', u'links': {u'self': u'http://cloudservices1003.wikimedia.org:9001/v2/zones/04c45c1f-214d-450b-a733-028dcdc87a12'}, u'transferred_at': None, u'created_at': u'2017-09-15T23:16:54.000000', u'pool_id': u'794ccc2c-d751-44fe-b57f-8894c9f5c842', u'updated_at': u'2019-04-23T20:55:06.000000', u'email': u'root@wmflabs.org', u'version': 2864, u'ttl': 60, u'action': u'NONE', u'attributes': {}, u'serial': 1556052260, u'project_id': u'noauth-project', u'type': u'PRIMARY', u'id': u'04c45c1f-214d-450b-a733-028dcdc87a12', u'description': u'long running wiki replica queries'} 2019-05-21T12:41:06Z root WARNING : Error response from http://cloudservices1003.wikimedia.org:9001/v2/zones/04c45c1f-214d-450b-a733-028dcdc87a12/recordsets: Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 481, in handle_one_response result = self.application(self.environ, start_response) File "/usr/lib/python2.7/dist-packages/paste/urlmap.py", line 206, in __call__ return app(environ, start_response) File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__ resp = self.call_func(req, *args, **self.kwargs) File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func return self.func(req, *args, **kwargs) File "/usr/lib/python2.7/dist-packages/keystonemiddleware/auth_token/__init__.py", line 467, in __call__ response = req.get_response(self._app) File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1299, in send application, catch_exc_info=False) File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1263, in call_application app_iter = application(self.environ, start_response) File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__ resp = self.call_func(req, *args, **self.kwargs) File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func return self.func(req, *args, **kwargs) File "/usr/lib/python2.7/dist-packages/oslo_middleware/base.py", line 114, in __call__ response = req.get_response(self.application) File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1299, in send application, catch_exc_info=False) File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1263, in call_application app_iter = application(self.environ, start_response) File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__ resp = self.call_func(req, *args, **self.kwargs) File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func return self.func(req, *args, **kwargs) File "/usr/lib/python2.7/dist-packages/oslo_middleware/base.py", line 114, in __call__ response = req.get_response(self.application) File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1299, in send application, catch_exc_info=False) File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1263, in call_application app_iter = application(self.environ, start_response) File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__ resp = self.call_func(req, *args, **self.kwargs) File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func return self.func(req, *args, **kwargs) File "/usr/lib/python2.7/dist-packages/designate/api/middleware.py", line 262, in __call__ return self._handle_exception(request, e, status, response) File "/usr/lib/python2.7/dist-packages/designate/api/middleware.py", line 312, in _handle_exception response=json.dumps(response)) File "/usr/lib/python2.7/dist-packages/oslo_serialization/jsonutils.py", line 178, in dumps return json.dumps(obj, default=default, **kwargs) File "/usr/lib/python2.7/json/__init__.py", line 250, in dumps sort_keys=sort_keys, **kw).encode(obj) File "/usr/lib/python2.7/json/encoder.py", line 207, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode return _iterencode(o, 0) ValueError: Circular reference detected Traceback (most recent call last): File "/usr/local/sbin/wmcs-wikireplica-dns", line 146, in <module> main() File "/usr/local/sbin/wmcs-wikireplica-dns", line 117, in main dns.ensure_recordset(zone_id, fqdn, 'A', ips) File "/usr/lib/python2.7/dist-packages/mwopenstackclients.py", line 329, in ensure_recordset rs = self.create_recordset(zone, name, type_, records) File "/usr/lib/python2.7/dist-packages/mwopenstackclients.py", line 305, in create_recordset r = self._post('/v2/zones/{}/recordsets'.format(uuid), data=data) File "/usr/lib/python2.7/dist-packages/mwopenstackclients.py", line 232, in _post return self._req('POST', *args, **kwargs) File "/usr/lib/python2.7/dist-packages/mwopenstackclients.py", line 224, in _req r.raise_for_status() File "/usr/lib/python2.7/dist-packages/requests/models.py", line 893, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://cloudservices1003.wikimedia.org:9001/v2/zones/04c45c1f-214d-450b-a733-028dcdc87a12/recordsets
Despite the errors, the script was able to create the required DNS records:
aborrero@tools-redis-1002:~$ host tools-redis.svc.eqiad.wmflabs tools-redis.svc.eqiad.wmflabs has address 172.16.5.84 aborrero@tools-redis-1002:~$ host tools-redis.eqiad.wmflabs tools-redis.eqiad.wmflabs is an alias for tools-redis.svc.eqiad.wmflabs. tools-redis.svc.eqiad.wmflabs has address 172.16.5.84 aborrero@tools-redis-1002:~$ host tools-redis.tools.eqiad.wmflabs tools-redis.tools.eqiad.wmflabs is an alias for tools-redis.svc.eqiad.wmflabs. tools-redis.svc.eqiad.wmflabs has address 172.16.5.84
And the icinga warning is now gone:
12:45 <+icinga-wm> RECOVERY - toolschecker: Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
Will open a separated phab task for the script error.