Page MenuHomePhabricator

ORES service gets stuck reporting "server overloaded" even after load returns to normal
Closed, ResolvedPublic

Description

I ran stress tests a few times, but the servers quickly locked up with ScoreProcessorOverloaded. Make sure the production service can't die in this way.

When the Redis "celery" queue fills with pending jobs, then we can hit a limit (configured to 100 normally, 400 for the stress tests) where new jobs cannot be processed. I'm not sure how, but I see 481 items currently in the queue. This number doesn't grow or shrink, regardless of new requests.

The fix is probably to have a job that expires old pending jobs (Redis TTL won't work inside the list).

This problem appeared because all jobs are timing out.

Event Timeline

awight updated the task description. (Show Details)
Halfak triaged this task as High priority.Oct 9 2017, 9:14 PM
Halfak added a project: ORES.
Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

@awight you said this is done. Should we close this?