Page MenuHomePhabricator

Investigate failed ORES deployment
Closed, ResolvedPublic


I tried to deploy ores/deploy@7c80636 today and the process failed at the canary node with "Internal Server Error". I found the following in the logs. I think this error happened because revscoring==1.3.6 wasn't successfully installed from its wheel. See paste below that confirms that revscoring==1.2.8 was installed during the deployment.

Traceback (most recent call last):
  File "/srv/deployment/ores/deploy/", line 6, in <module>
    application =
  File "./ores/applications/", line 71, in build
    return server.configure(config)
  File "./ores/wsgi/", line 28, in configure
    scoring_system = ScoringSystem.from_config(config, ss_name)
  File "./ores/scoring_systems/", line 329, in from_config
    return Class.from_config(config, name)
  File "./ores/scoring_systems/", line 242, in from_config
    config, name, section_key=section_key)
  File "./ores/scoring_systems/", line 298, in _kwargs_from_config
    config, name, section_key=section_key)
  File "./ores/scoring_systems/", line 234, in _build_context_map
    for name in section['scoring_contexts']}
  File "./ores/scoring_systems/", line 234, in <dictcomp>
    for name in section['scoring_contexts']}
  File "./ores/", line 222, in from_config
    scorer_model = ScorerModel.from_config(config, key)
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/revscoring/scorer_models/", line 96, in from_config
    return Class.from_config(config, name, section_key=section_key)
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/revscoring/scorer_models/", line 160, in from_config
    return cls.load(open(section['model_file'], 'rb'))
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/revscoring/scorer_models/", line 73, in load
    return pickle.load(f)
ImportError: No module named 'revscoring.scorer_models.test_statistics.recall_at_precision'
(venv)halfak@scb1002:/srv/deployment/ores/venv$ pip freeze | grep revscoring

Event Timeline

Halfak triaged this task as High priority.Feb 8 2017, 9:43 PM

Here's the check that should have been run:

mkdir -p $venv
virtualenv --python python3 --system-site-packages $venv
$venv/bin/pip freeze | xargs $venv/bin/pip uninstall -y
$venv/bin/pip install --use-wheel --no-deps $deploy_dir/submodules/wheels/*.whl

Somehow that last command doesn't seem to have been run as expected.

Hypothesis -- the problem is that changing a submodule URL requires git submodule sync. We saw this on the deploy node (tin.eqiad.wmnet). It could be that the worker nodes (scb100[1-4]) need a git submodule sync.

FWIW, git submodule sync is idempotent, so we can run it over and over again with no ill effect.