Here's some facts:
- ORES celery queue is 100 revisions once it gets more than 100 revision. We start to send 503 errors.
- Number of our workers is 16 per node (32 overall).
Here's my hypotheses:
It happens when:
- We haven't reached the 100 queue size but we are close to it.
- Our workers are busy and they couldn't get to revision in time so ores itself starts to abandon the process and send out timeout error instead of overload
The solution in that case is to increase number of workers or reduce queue size (obviously the farmer is more desirable). Let's wait to get the refactor in prod so we have some free memory to use.
Steps to fix this:
- Increase number of workers. We increased to 48 (from 32) right now and it reduced ratio of failed jobs drastically.
- Let it just warn and not throw exception (T141978: ORES extension jobs should just fail when scoring is errored not to throw exception)