Page MenuHomePhabricator

[Investigate] Periodic redis related errors in wmflabs
Closed, ResolvedPublic

Description

@-jem- reported that ORES is reporting 500 errors intermittently. Here's the error body of the response:

Traceback (most recent call last):
  File "/srv/ores/venv/lib/python3.4/site-packages/redis/client.py", line 573, in execute_command
    return self.parse_response(connection, command_name, **options)
  File "/srv/ores/venv/lib/python3.4/site-packages/redis/client.py", line 585, in parse_response
    response = connection.read_response()
  File "/srv/ores/venv/lib/python3.4/site-packages/redis/connection.py", line 577, in read_response
    response = self._parser.read_response()
  File "/srv/ores/venv/lib/python3.4/site-packages/redis/connection.py", line 255, in read_response
    raise error
redis.exceptions.BusyLoadingError: Redis is loading the dataset in memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./ores/wsgi/routes/v1/scores.py", line 67, in score_model_revisions
    context, models, rev_ids, precache=precache)
  File "./ores/scoring_systems/scoring_system.py", line 41, in score
    include_model_info=include_model_info)
  File "./ores/scoring_systems/celery_queue.py", line 207, in _score
    return super()._score(*args, **kwargs)
  File "./ores/scoring_systems/scoring_system.py", line 71, in _score
    injection_caches=injection_caches)
  File "./ores/scoring_systems/scoring_system.py", line 244, in _lookup_cached_scores
    injection_cache=injection_cache)
  File "./ores/scoring_systems/scoring_system.py", line 260, in _lookup_cached_score
    version=version, injection_cache=injection_cache)
  File "./ores/score_caches/redis.py", line 27, in lookup
    value = self.redis.get(key)
  File "/srv/ores/venv/lib/python3.4/site-packages/redis/client.py", line 880, in get
    return self.execute_command('GET', name)
  File "/srv/ores/venv/lib/python3.4/site-packages/redis/client.py", line 579, in execute_command
    return self.parse_response(connection, command_name, **options)
  File "/srv/ores/venv/lib/python3.4/site-packages/redis/client.py", line 585, in parse_response
    response = connection.read_response()
  File "/srv/ores/venv/lib/python3.4/site-packages/redis/connection.py", line 577, in read_response
    response = self._parser.read_response()
  File "/srv/ores/venv/lib/python3.4/site-packages/redis/connection.py", line 255, in read_response
    raise error
redis.exceptions.BusyLoadingError: Redis is loading the dataset in memory

Event Timeline

Halfak renamed this task from [Investigate] Period redis related errors in wmflabs to [Investigate] Periodic redis related errors in wmflabs.Aug 2 2016, 11:14 PM

The problem has been happening for several days now, and it can last a few minutes or a few hours. As my patroller bot make continous use of the ORES API, this is quite a problem for me. Thanks in advance.

I looked into this with @akosiaris. We think this issue was cased by the redis node going into swap. We started up ores-redis-02 with more memory and directed all of the nodes to use that. We haven't seen the error since.

I looked into this with @akosiaris. We think this issue was cased by the redis node going into swap.

Some more info. We know the issue was cause by Out of memory killer (also known colloquially as OOM) being triggered due to the VM having its memory fully utilized while processes were requesting even more memory and killing redis. systemd would restart redis and while it was loading the data set the Redis is loading the dataset in memory message would be issued.

@-jem-. One rather strange question. Why are you not using the production cluster (ores.wikimedia.org) which is more stable.

FWIW, I am pretty certain we have pinpointed the issue and this can be marked as resolved

Yup, it's in the "Done" column in our. We do it after the weekly meeting (which is later today)

@Ladsgroup: Probably because I used the first URL I found when searching, and since then until now I hadn't realized that was a better choice. I have changed my code. And thanks anyone; I confirm that the problem hasn't happened again.