Page MenuHomePhabricator

Upgrade celery to 4.1.0 for ORES
Closed, ResolvedPublic

Description

FY2018-19 Q2 goal

What's new in celery 4: http://docs.celeryproject.org/en/v4.1.0/whatsnew-4.0.html

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2017-10-26T16:33:38Z] <awight@tin> Finished deploy [ores/deploy@971be22]: Push ORES w/ Celery 4 support to new cluster (take 3), T178441 (duration: 15m 33s)

Mentioned in SAL (#wikimedia-operations) [2017-10-26T16:43:06Z] <awight@tin> Started deploy [ores/deploy@971be22]: Push ORES w/ Celery 4 support to new cluster (try to rebuild venv), T178441

Mentioned in SAL (#wikimedia-operations) [2017-10-26T16:44:07Z] <awight@tin> Finished deploy [ores/deploy@971be22]: Push ORES w/ Celery 4 support to new cluster (try to rebuild venv), T178441 (duration: 01m 02s)

Mentioned in SAL (#wikimedia-operations) [2017-10-26T17:11:37Z] <awight@tin> Started deploy [ores/deploy@971be22]: ORES w/ revscoring 2 and Celery 4, T175180 T178441

Mentioned in SAL (#wikimedia-operations) [2017-10-26T17:14:57Z] <awight@tin> Finished deploy [ores/deploy@971be22]: ORES w/ revscoring 2 and Celery 4, T175180 T178441 (duration: 03m 20s)

Mentioned in SAL (#wikimedia-operations) [2017-10-26T17:16:49Z] <awight@tin> Started deploy [ores/deploy@971be22]: Rolling back scb1002, T175180 T178441

Mentioned in SAL (#wikimedia-operations) [2017-10-26T17:17:22Z] <awight@tin> Finished deploy [ores/deploy@971be22]: Rolling back scb1002, T175180 T178441 (duration: 00m 32s)

I'm seeing a new error,

http://0.0.0.0:8081/v3/scores/enwiki/123456/damaging

Traceback (most recent call last):
  File \"./ores/wsgi/routes/v3/scores.py\", line 28, in process_score_request
    score_response = scoring_system.score(score_request)
  File \"./ores/scoring_systems/scoring_system.py\", line 42, in score
    response = self._score(request)
  File \"./ores/scoring_systems/celery_queue.py\", line 194, in _score
    return super()._score(*args, **kwargs)
  File \"./ores/scoring_systems/scoring_system.py\", line 98, in _score
    inprogress_results=inprogress_results)
  File \"./ores/scoring_systems/celery_queue.py\", line 117, in _process_missing_scores
    request, missing_models, rev_id, root_cache)
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/app/task.py\", line 453, in delay
    return self.apply_async(args, kwargs)
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/app/task.py\", line 565, in apply_async
    **dict(self._get_exec_options(), **options)
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/app/base.py\", line 354, in send_task
    reply_to=reply_to or self.oid, **options
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/app/amqp.py\", line 297, in publish_task
    retry_policy=retry_policy,
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/utils/dispatch/signal.py\", line 166, in send
    response = receiver(signal=self, sender=sender, **named)
  File \"./ores/scoring_systems/celery_queue.py\", line 31, in update_sent_state
    task_id = properties['correlation_id']
KeyError: 'correlation_id'

Point of information, I botched the deployment and the above error was with Celery 3, so isn't related to this task.

@Halfak I'm having fun investigating T179524, but my instinct is currently to unblock Revscoring 2 by reverting this change, and trying to upgrade to Celery 4 separately. What do you think?

Change 388065 had a related patch set uploaded (by Awight; owner: Awight):
[research/ores/wheels@master] Rebuild wheels, downgrade to Celery 3

https://gerrit.wikimedia.org/r/388065

Change 388066 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Downgrade to Celery 3, bump other requirements

https://gerrit.wikimedia.org/r/388066

Change 388065 merged by Ladsgroup:
[research/ores/wheels@master] Rebuild wheels, downgrade to Celery 3

https://gerrit.wikimedia.org/r/388065

Change 388070 had a related patch set uploaded (by Awight; owner: Awight):
[research/ores/wheels@master] Remove wheels which are included as source

https://gerrit.wikimedia.org/r/388070

Change 388070 merged by Ladsgroup:
[research/ores/wheels@master] Remove wheels which are included as source

https://gerrit.wikimedia.org/r/388070

Change 388071 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Remove local source packages from wheels

https://gerrit.wikimedia.org/r/388071

Change 388066 merged by Awight:
[mediawiki/services/ores/deploy@master] Downgrade to Celery 3, bump other requirements

https://gerrit.wikimedia.org/r/388066

Change 388071 merged by Awight:
[mediawiki/services/ores/deploy@master] Remove local source packages from wheels

https://gerrit.wikimedia.org/r/388071

I've created CELERY_4 branches for ores, wheels, and ores-prod-deploy to continue the integration work.

Change 388160 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@CELERY_4] Use branched ores to continue Celery 4 work

https://gerrit.wikimedia.org/r/388160

Change 388160 merged by Awight:
[mediawiki/services/ores/deploy@CELERY_4] Use branched ores to continue Celery 4 work

https://gerrit.wikimedia.org/r/388160

A little something to keep track of: I test-deployed the Celery 4 code to beta, and most of the wheels installed, but some did not. Installing the *.whl files manually, I saw the following message listing the missed packages:

Installing collected packages: scipy, six, websocket-client, socketIO-client, sseclient, statistics, statsd, stopit, tabulate, textstat, tqdm, urllib3, vine, Werkzeug, yamlconf
  Found existing installation: scipy 0.14.0
    Not uninstalling scipy at /usr/lib/python3/dist-packages, outside environment /srv/deployment/ores/venv
Successfully installed Werkzeug-0.12.2 scipy-0.17.1 six-1.11.0 socketIO-client-0.5.6 sseclient-0.0.18 statistics-1.0.3.5 statsd-3.2.1 stopit-1.1.1 tabulate-0.7.7 textstat-0.3.1 tqdm-4.15.0 urllib3-1.22 vine-1.1.4 websocket-client-0.44.0 yamlconf-0.2.3

Update: beta is still on Celery 3. I'm now getting timeout errors when running Celery 4 code locally.

Change 391264 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@CELERY_4] Bump revscoring and ores

https://gerrit.wikimedia.org/r/391264

Dropping from the current workboard.

We decoupled this from the new cluster deployment, I'll disconnect the parent task.

Change 391264 abandoned by Awight:
Bump revscoring and ores

https://gerrit.wikimedia.org/r/391264

Ladsgroup triaged this task as High priority.
Ladsgroup updated the task description. (Show Details)
Ladsgroup subscribed.

Change 470870 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[research/ores/wheels@master] Rebuild wheels for new flask and celery 4

https://gerrit.wikimedia.org/r/470870

Change 470870 merged by Ladsgroup:
[research/ores/wheels@master] Rebuild wheels for new flask and celery 4

https://gerrit.wikimedia.org/r/470870

Change 470878 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/services/ores/deploy@master] Bump ores and wheels to HEAD

https://gerrit.wikimedia.org/r/470878

Change 470878 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Bump ores and wheels to HEAD

https://gerrit.wikimedia.org/r/470878

Mentioned in SAL (#wikimedia-operations) [2018-10-31T20:27:56Z] <ladsgroup@deploy1001> Finished deploy [ores/deploy@70ba14b]: Upgrade to celery4 and flask 0.12.4, logstash fixes: T181546 T181630 T168921 T205256 T169586 T208258 T178441 (duration: 21m 29s)

With upgrading to celery4 we gained another 100% reduction in response time:

image.png (888×1 px, 125 KB)

The very likely cause of it is using -Ofair which is the default behavior in celery4 (instead of -Ofast in celery3) which is very useful when the time that takes for a task to finish varies and it's not the same (like in our case) more info
More data:

  • 95%: 1.5s vs. 1.17s
  • 75%: 961ms vs. 500ms
  • median: 840ms vs. 330ms