ORES service gets stuck reporting "server overloaded" even after load returns to normal
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	awight
	Sep 12 2017, 12:08 AM

Description

I ran stress tests a few times, but the servers quickly locked up with ScoreProcessorOverloaded. Make sure the production service can't die in this way.

When the Redis "celery" queue fills with pending jobs, then we can hit a limit (configured to 100 normally, 400 for the stress tests) where new jobs cannot be processed. I'm not sure how, but I see 481 items currently in the queue. This number doesn't grow or shrink, regardless of new requests.

The fix is probably to have a job that expires old pending jobs (Redis TTL won't work inside the list).

This problem appeared because all jobs are timing out.

Related Objects

Mentioned In: T174402: Review and fix file handle management in worker and celery processes

Event Timeline

awight created this task.Sep 12 2017, 12:08 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2017, 12:08 AM

awight updated the task description. (Show Details)Sep 12 2017, 12:18 AM

awight updated the task description. (Show Details)Sep 12 2017, 12:38 AM

awight updated the task description. (Show Details)

awight updated the task description. (Show Details)Sep 12 2017, 12:45 AM

awight mentioned this in T174402: Review and fix file handle management in worker and celery processes.Sep 12 2017, 1:24 AM

Halfak triaged this task as High priority.Oct 9 2017, 9:14 PM

Halfak added a project: ORES.

Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

@awight you said this is done. Should we close this?

This is done.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptSep 20 2018, 10:35 AM

Ladsgroup moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Sep 20 2018, 10:35 AM

ORES service gets stuck reporting "server overloaded" even after load returns to normalClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

ORES service gets stuck reporting "server overloaded" even after load returns to normal
Closed, ResolvedPublic
Actions