Page MenuHomePhabricator

Tasks should not be marked permanently revoked due to high system load
Closed, ResolvedPublic

Description

When ORES hits the wall, it starts returning "server overloaded" errors. We have a mechanism for "revoking" tasks, e.g. when the scoring of a particular revision takes too long and we want to avoid potential DoS, however we're abusing it by marking requests made during a general overcapacity interval as revoked. This is incorrect, and makes the requested revisions unscoreable for 24 hours.

We should distinguish between general overload and an expensive request. The former should be refused with a 5xx HTTP status code (currently returns 200 but the contents are all errors), and the revisions in question should not be penalized.

Event Timeline

The cause seems to be the scoring_systems.timeout config, which is set to 15 seconds. Our signal() timeout is set using the same variable, therefore loses the race condition and will never fire. If a request isn't served within the timeout, the Celery apply_async expires will cause the task to automatically be marked revoked by Celery.

Storing the revoked status in Redis causes us to reply with revoked model errors.

We should intercept that TaskRevokedError and somehow determine whether the task was worked on and timed out (continue with revocation) or never started (do not revoke, just return a timeout).

See also T169367.

maybe limit the tasks that cause ORES to overload the server to certain time/size/query time/etc

I did an experiment in which I set the apply_async expires=timeout+1, so the signal would fire before celery times out. IMO we should do this, and the long-term solution will be to configure the celery timeout separately.

This backfired a bit, but in a learning sort of way. When the signal kills the celery worker, a catastrophic memory surge damages the service.

[2017-09-13 15:44:37,701: INFO/Worker-30] Looking up draftquality in 082e41a9-715a-4bc8-a23b-9a63e3a445ef
[2017-09-13 15:44:37,712: WARNING/Worker-30] /home/awight/venv/lib/python3.5/site-packages/celery/result.py:45: RuntimeWarning: Never call result.get() within a task!
See http://docs.celeryq.org/en/latest/userguide/tasks.html#task-synchronous-subtasks

In Celery 3.2 this will result in an exception being
raised instead of just being a warning.

  warnings.warn(RuntimeWarning(E_WOULDBLOCK))
[2017-09-13 15:44:37,696: DEBUG/Worker-12] Extracted features for enwiki:damaging:0.3.0 in 0.004 secs
[2017-09-13 15:44:37,733: DEBUG/Worker-12] Scored features for enwiki:damaging:0.3.0 in 0.02 secs
[2017-09-13 15:44:37,999: DEBUG/Worker-38] score_timed_out: enwiki:{'damaging', 'draftquality', 'goodfaith', 'reverted', 'wp10'} in 15.724 seconds
[2017-09-13 15:44:38,032: INFO/Worker-38] Generating a score map for 'enwiki':699151985:frozenset({'goodfaith', 'wp10', 'draftquality', 'damaging', 'reverted'})(features, model_info=["version"])
[2017-09-13 15:44:38,044: ERROR/MainProcess] Error on stopping Pool: KeyError(None,)
Traceback (most recent call last):
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/pool.py", line 1260, in maintain_pool
    self._maintain_pool()
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/pool.py", line 1252, in _maintain_pool
    self._repopulate_pool(joined)
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/pool.py", line 1237, in _repopulate_pool
    self._create_worker_process(self._avail_index())
  File "/home/awight/venv/lib/python3.5/site-packages/celery/concurrency/asynpool.py", line 415, in _create_worker_process
    return super(AsynPool, self)._create_worker_process(i)
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/pool.py", line 1068, in _create_worker_process
    w.start()
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/process.py", line 137, in start
    self._popen = Popen(self)
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/forking.py", line 99, in __init__
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/awight/venv/lib/python3.5/site-packages/celery/worker/__init__.py", line 206, in start
    self.blueprint.start(self)
  File "/home/awight/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/home/awight/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 374, in start
    return self.obj.start()
  File "/home/awight/venv/lib/python3.5/site-packages/celery/worker/consumer.py", line 280, in start
    blueprint.start(self)
  File "/home/awight/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/home/awight/venv/lib/python3.5/site-packages/celery/worker/consumer.py", line 884, in start
    c.loop(*c.loop_args())
  File "/home/awight/venv/lib/python3.5/site-packages/celery/worker/loops.py", line 76, in asynloop
    next(loop)
  File "/home/awight/venv/lib/python3.5/site-packages/kombu/async/hub.py", line 340, in create_loop
    cb(*cbargs)
  File "/home/awight/venv/lib/python3.5/site-packages/celery/concurrency/asynpool.py", line 420, in _event_process_exit
    self.maintain_pool()
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/pool.py", line 1269, in maintain_pool
    sys.exc_info()[2])
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/five.py", line 125, in reraise
    raise value.with_traceback(tb)
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/pool.py", line 1260, in maintain_pool
    self._maintain_pool()
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/pool.py", line 1252, in _maintain_pool
    self._repopulate_pool(joined)
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/pool.py", line 1237, in _repopulate_pool
    self._create_worker_process(self._avail_index())
  File "/home/awight/venv/lib/python3.5/site-packages/celery/concurrency/asynpool.py", line 415, in _create_worker_process
    return super(AsynPool, self)._create_worker_process(i)
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/pool.py", line 1068, in _create_worker_process
    w.start()
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/process.py", line 137, in start
    self._popen = Popen(self)
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/forking.py", line 99, in __init__
    self.pid = os.fork()
MemoryError: [Errno 12] Cannot allocate memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/awight/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 155, in send_all
    fun(parent, *args)
  File "/home/awight/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 378, in stop
    return self.obj.stop()
  File "/home/awight/venv/lib/python3.5/site-packages/celery/concurrency/base.py", line 123, in stop
    self.on_stop()
  File "/home/awight/venv/lib/python3.5/site-packages/celery/concurrency/prefork.py", line 147, in on_stop
    self._pool.join()
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/pool.py", line 1532, in join
    stop_if_not_current(self._result_handler)
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/pool.py", line 151, in stop_if_not_current
    thread.stop(timeout)
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/pool.py", line 500, in stop
    self.on_stop_not_started()
  File "/home/awight/venv/lib/python3.5/site-packages/celery/concurrency/asynpool.py", line 317, in on_stop_not_started
    join_exited_workers(shutdown=True)
  File "/home/awight/venv/lib/python3.5/site-packages/billiard/pool.py", line 1119, in _join_exited_workers
    del self._poolctrl[worker.pid]
KeyError: None
[2017-09-13 15:44:38,045: WARNING/MainProcess] Restoring 96 unacknowledged message(s).
Ladsgroup triaged this task as Medium priority.Nov 26 2018, 4:57 PM
Ladsgroup added a project: TestMe.
Ladsgroup subscribed.

This should be rechecked because of upgrading to celery4, it has more robust way of handling tasks.