Page MenuHomePhabricator

toolforge: redis issue
Closed, ResolvedPublic

Description

On 2019-05-21 aprox 12:00 UTC:

<icinga-wm> PROBLEM - toolschecker: Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on

Event Timeline

diamond collector issue may have some hints:

May 21 12:06:41 tools-redis-1002 diamond[578]: Collector failed!
May 21 12:06:41 tools-redis-1002 diamond[578]: Traceback (most recent call last):
May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/lib/python2.7/dist-packages/diamond/utils/scheduler.py", line 73, in collector_process
May 21 12:06:41 tools-redis-1002 diamond[578]: collector._run()
May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/lib/python2.7/dist-packages/diamond/collector.py", line 477, in _run
May 21 12:06:41 tools-redis-1002 diamond[578]: self.collect()
May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/share/diamond/collectors/redisstat/redisstat.py", line 361, in collect
May 21 12:06:41 tools-redis-1002 diamond[578]: self.collect_instance(nick, host, int(port), unix_socket, auth)
May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/share/diamond/collectors/redisstat/redisstat.py", line 309, in collect_instance
May 21 12:06:41 tools-redis-1002 diamond[578]: 'maxmemory')
May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/share/diamond/collectors/redisstat/redisstat.py", line 282, in _get_config
May 21 12:06:41 tools-redis-1002 diamond[578]: config_value = client.config_get(config_key)
May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/lib/python2.7/dist-packages/redis/client.py", line 576, in config_get
May 21 12:06:41 tools-redis-1002 diamond[578]: return self.execute_command('CONFIG GET', pattern)
May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/lib/python2.7/dist-packages/redis/client.py", line 529, in execute_command
May 21 12:06:41 tools-redis-1002 diamond[578]: return self.parse_response(connection, command_name, **options)
May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/lib/python2.7/dist-packages/redis/client.py", line 541, in parse_response
May 21 12:06:41 tools-redis-1002 diamond[578]: response = connection.read_response()
May 21 12:06:41 tools-redis-1002 diamond[578]: File "/usr/lib/python2.7/dist-packages/redis/connection.py", line 550, in read_response
May 21 12:06:41 tools-redis-1002 diamond[578]: raise response
May 21 12:06:41 tools-redis-1002 diamond[578]: ResponseError: unknown command 'CONFIG'

The redis-server service is in failed state in tools-redis-1002.

When trying to restart the redis-server service:

[28908] 21 May 12:09:36.210 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
[28908] 21 May 12:09:36.212 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted.
[28908] 21 May 12:09:36.212 # Current maximum open files is 1024. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
[28908] 21 May 12:09:36.212 # Creating Server TCP listening socket 127.0.0.1:6379: bind: Address already in use
[28915] 21 May 12:09:36.441 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
[28915] 21 May 12:09:36.441 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted.
[28915] 21 May 12:09:36.441 # Current maximum open files is 1024. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
[28915] 21 May 12:09:36.445 # Creating Server TCP listening socket 127.0.0.1:6379: bind: Address already in use
[28921] 21 May 12:09:36.689 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
[28921] 21 May 12:09:36.689 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted.
[28921] 21 May 12:09:36.689 # Current maximum open files is 1024. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
[28921] 21 May 12:09:36.689 # Creating Server TCP listening socket 127.0.0.1:6379: bind: Address already in use
[28927] 21 May 12:09:36.942 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
[28927] 21 May 12:09:36.944 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted.
[28927] 21 May 12:09:36.944 # Current maximum open files is 1024. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
[28927] 21 May 12:09:36.944 # Creating Server TCP listening socket 127.0.0.1:6379: bind: Address already in use
[28933] 21 May 12:09:37.187 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
[28933] 21 May 12:09:37.187 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted.
[28933] 21 May 12:09:37.187 # Current maximum open files is 1024. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
[28933] 21 May 12:09:37.188 # Creating Server TCP listening socket 127.0.0.1:6379: bind: Address already in use

Mentioned in SAL (#wikimedia-cloud) [2019-05-21T12:35:37Z] <arturo> T223992 rebooting tools-redis-1002

I tried restarting both redis-server.service and redis-instance-tcp_6379.service with no luck, so rebooting server to see if it comes in a clean state by chance.

Other hint, the toolschecker message itself:

Caught exception: Error -2 connecting to tools-redis:6379. Name or service not known.

Could this be related to my recent merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/506019 ?

Indeed, this is related to that DNS name:

aborrero@tools-redis-1002:~$ host tools-redis.eqiad.wmflabs
Host tools-redis.eqiad.wmflabs not found: 3(NXDOMAIN)

And the wmcs-wikireplica-dns has some issues:

root@cloudcontrol1003:~# wmcs-wikireplica-dns 
2019-05-21T12:41:04Z mwopenstackclients.DnsManager WARNING : Creating tools-redis.svc.eqiad.wmflabs.
2019-05-21T12:41:04Z mwopenstackclients.DnsManager WARNING : Creating tools-redis.tools.eqiad.wmflabs.
2019-05-21T12:41:05Z mwopenstackclients.DnsManager WARNING : Creating tools-redis.eqiad.wmflabs.
2019-05-21T12:41:06Z mwopenstackclients.DnsManager WARNING : Creating s6.{u'status': u'ACTIVE', u'masters': [], u'name': u'analytics.db.svc.eqiad.wmflabs.', u'links': {u'self': u'http://cloudservices1003.wikimedia.org:9001/v2/zones/04c45c1f-214d-450b-a733-028dcdc87a12'}, u'transferred_at': None, u'created_at': u'2017-09-15T23:16:54.000000', u'pool_id': u'794ccc2c-d751-44fe-b57f-8894c9f5c842', u'updated_at': u'2019-04-23T20:55:06.000000', u'email': u'root@wmflabs.org', u'version': 2864, u'ttl': 60, u'action': u'NONE', u'attributes': {}, u'serial': 1556052260, u'project_id': u'noauth-project', u'type': u'PRIMARY', u'id': u'04c45c1f-214d-450b-a733-028dcdc87a12', u'description': u'long running wiki replica queries'}
2019-05-21T12:41:06Z root         WARNING : Error response from http://cloudservices1003.wikimedia.org:9001/v2/zones/04c45c1f-214d-450b-a733-028dcdc87a12/recordsets:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 481, in handle_one_response
    result = self.application(self.environ, start_response)
  File "/usr/lib/python2.7/dist-packages/paste/urlmap.py", line 206, in __call__
    return app(environ, start_response)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/keystonemiddleware/auth_token/__init__.py", line 467, in __call__
    response = req.get_response(self._app)
  File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1299, in send
    application, catch_exc_info=False)
  File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1263, in call_application
    app_iter = application(self.environ, start_response)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/oslo_middleware/base.py", line 114, in __call__
    response = req.get_response(self.application)
  File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1299, in send
    application, catch_exc_info=False)
  File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1263, in call_application
    app_iter = application(self.environ, start_response)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/oslo_middleware/base.py", line 114, in __call__
    response = req.get_response(self.application)
  File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1299, in send
    application, catch_exc_info=False)
  File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1263, in call_application
    app_iter = application(self.environ, start_response)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/designate/api/middleware.py", line 262, in __call__
    return self._handle_exception(request, e, status, response)
  File "/usr/lib/python2.7/dist-packages/designate/api/middleware.py", line 312, in _handle_exception
    response=json.dumps(response))
  File "/usr/lib/python2.7/dist-packages/oslo_serialization/jsonutils.py", line 178, in dumps
    return json.dumps(obj, default=default, **kwargs)
  File "/usr/lib/python2.7/json/__init__.py", line 250, in dumps
    sort_keys=sort_keys, **kw).encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
ValueError: Circular reference detected

Traceback (most recent call last):
  File "/usr/local/sbin/wmcs-wikireplica-dns", line 146, in <module>
    main()
  File "/usr/local/sbin/wmcs-wikireplica-dns", line 117, in main
    dns.ensure_recordset(zone_id, fqdn, 'A', ips)
  File "/usr/lib/python2.7/dist-packages/mwopenstackclients.py", line 329, in ensure_recordset
    rs = self.create_recordset(zone, name, type_, records)
  File "/usr/lib/python2.7/dist-packages/mwopenstackclients.py", line 305, in create_recordset
    r = self._post('/v2/zones/{}/recordsets'.format(uuid), data=data)
  File "/usr/lib/python2.7/dist-packages/mwopenstackclients.py", line 232, in _post
    return self._req('POST', *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/mwopenstackclients.py", line 224, in _req
    r.raise_for_status()
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 893, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://cloudservices1003.wikimedia.org:9001/v2/zones/04c45c1f-214d-450b-a733-028dcdc87a12/recordsets
aborrero claimed this task.

Despite the errors, the script was able to create the required DNS records:

aborrero@tools-redis-1002:~$ host tools-redis.svc.eqiad.wmflabs
tools-redis.svc.eqiad.wmflabs has address 172.16.5.84
aborrero@tools-redis-1002:~$ host tools-redis.eqiad.wmflabs
tools-redis.eqiad.wmflabs is an alias for tools-redis.svc.eqiad.wmflabs.
tools-redis.svc.eqiad.wmflabs has address 172.16.5.84
aborrero@tools-redis-1002:~$ host tools-redis.tools.eqiad.wmflabs
tools-redis.tools.eqiad.wmflabs is an alias for tools-redis.svc.eqiad.wmflabs.
tools-redis.svc.eqiad.wmflabs has address 172.16.5.84

And the icinga warning is now gone:

12:45 <+icinga-wm> RECOVERY - toolschecker: Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

Will open a separated phab task for the script error.