Investigate failed ORES deployment
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Feb 8 2017, 9:43 PM

Description

I tried to deploy ores/deploy@7c80636 today and the process failed at the canary node with "Internal Server Error". I found the following in the logs. I think this error happened because revscoring==1.3.6 wasn't successfully installed from its wheel. See paste below that confirms that revscoring==1.2.8 was installed during the deployment.

Traceback (most recent call last):
  File "/srv/deployment/ores/deploy/ores_wsgi.py", line 6, in <module>
    application = wsgi.build()
  File "./ores/applications/wsgi.py", line 71, in build
    return server.configure(config)
  File "./ores/wsgi/server.py", line 28, in configure
    scoring_system = ScoringSystem.from_config(config, ss_name)
  File "./ores/scoring_systems/scoring_system.py", line 329, in from_config
    return Class.from_config(config, name)
  File "./ores/scoring_systems/celery_queue.py", line 242, in from_config
    config, name, section_key=section_key)
  File "./ores/scoring_systems/scoring_system.py", line 298, in _kwargs_from_config
    config, name, section_key=section_key)
  File "./ores/scoring_systems/celery_queue.py", line 234, in _build_context_map
    for name in section['scoring_contexts']}
  File "./ores/scoring_systems/celery_queue.py", line 234, in <dictcomp>
    for name in section['scoring_contexts']}
  File "./ores/scoring_context.py", line 222, in from_config
    scorer_model = ScorerModel.from_config(config, key)
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/revscoring/scorer_models/scorer_model.py", line 96, in from_config
    return Class.from_config(config, name, section_key=section_key)
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/revscoring/scorer_models/scorer_model.py", line 160, in from_config
    return cls.load(open(section['model_file'], 'rb'))
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/revscoring/scorer_models/scorer_model.py", line 73, in load
    return pickle.load(f)
ImportError: No module named 'revscoring.scorer_models.test_statistics.recall_at_precision'

(venv)halfak@scb1002:/srv/deployment/ores/venv$ pip freeze | grep revscoring
revscoring==1.2.8

Related Objects

Mentioned Here: T157694: scap should handle changes to .gitmodules

Event Timeline

Halfak created this task.Feb 8 2017, 9:43 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 8 2017, 9:43 PM

Halfak triaged this task as High priority.Feb 8 2017, 9:43 PM

Here's the check that should have been run:

venv="/srv/deployment/ores/venv"
deploy_dir="/srv/deployment/ores/deploy"
mkdir -p $venv
virtualenv --python python3 --system-site-packages $venv
$venv/bin/pip freeze | xargs $venv/bin/pip uninstall -y
$venv/bin/pip install --use-wheel --no-deps $deploy_dir/submodules/wheels/*.whl

Somehow that last command doesn't seem to have been run as expected.

Ladsgroup awarded a token.Feb 9 2017, 6:43 AM

Hypothesis -- the problem is that changing a submodule URL requires git submodule sync. We saw this on the deploy node (tin.eqiad.wmnet). It could be that the worker nodes (scb100[1-4]) need a git submodule sync.

FWIW, git submodule sync is idempotent, so we can run it over and over again with no ill effect.

T157694: scap should handle changes to .gitmodules

In the meantime, let's add "submodule sync" to https://phabricator.wikimedia.org/diffusion/1880/browse/master/scap/cmd_worker.sh

Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Feb 9 2017, 7:56 PM

Halfak claimed this task.Feb 10 2017, 10:36 PM

Halfak closed this task as Resolved.Mar 16 2017, 9:21 PM

Investigate failed ORES deploymentClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Investigate failed ORES deployment
Closed, ResolvedPublic
Actions