After a deployment today, we're getting poor performance in CODFW, but fine performance in EQIAD. I'll post notes in the discussion blow.
Description
Related Objects
Event Timeline
Here's some scorings directly to EQIAD (Note the 1.3-1.5s speed)
halfak@scb1002:~$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242326 { "scores": { "fiwiki": { "goodfaith": { "scores": { "3242326": { "prediction": true, "probability": { "false": 0.04025263981650706, "true": 0.9597473601834929 } } }, "version": "0.3.0" } } } } real 0m1.383s user 0m0.004s sys 0m0.004s halfak@scb1002:~$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242327 { "scores": { "fiwiki": { "goodfaith": { "scores": { "3242327": { "prediction": true, "probability": { "false": 0.33890319809272784, "true": 0.6610968019072722 } } }, "version": "0.3.0" } } } } real 0m1.367s user 0m0.008s sys 0m0.000s halfak@scb1002:~$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242328 { "scores": { "fiwiki": { "goodfaith": { "scores": { "3242328": { "prediction": true, "probability": { "false": 0.03130334081841879, "true": 0.9686966591815812 } } }, "version": "0.3.0" } } } } real 0m1.445s user 0m0.004s sys 0m0.004s
Same scorings to a server in CODFW (2 timeout and 1 runs in 12s)
halfak@scb2002:~$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242326 { "scores": { "fiwiki": { "goodfaith": { "scores": { "3242326": { "error": { "message": "Timed out after 15 seconds.", "type": "TimeoutError" } } }, "version": "0.3.0" } } } } real 0m15.296s user 0m0.012s sys 0m0.000s halfak@scb2002:~$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242327 { "scores": { "fiwiki": { "goodfaith": { "scores": { "3242327": { "error": { "message": "Timed out after 15 seconds.", "type": "TimeoutError" } } }, "version": "0.3.0" } } } } real 0m18.208s user 0m0.004s sys 0m0.004s halfak@scb2002:~$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242328 { "scores": { "fiwiki": { "goodfaith": { "scores": { "3242328": { "prediction": true, "probability": { "false": 0.03130334081841879, "true": 0.9686966591815812 } } }, "version": "0.3.0" } } } } real 0m12.460s user 0m0.012s sys 0m0.004s
I get this in the applog on scb2002:
Traceback (most recent call last): File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 240, in trace_task R = retval = fun(*args, **kwargs) File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 438, in __protected_call__ return self.run(*args, **kwargs) File "/srv/deployment/ores/deploy-cache/revs/cc12103e95b91e1677c5e1dd40eb93764233cfec/ores/scoring_systems/celery_queue.py", line 82, in _lookup_score_in_map score_map = score_map_result.get(timeout=self.timeout) File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/result.py", line 175, in get raise meta['result'] TypeError: _process_score_map() got an unexpected keyword argument 'include_features'
That didn't help anything. Still seeing this error in the logs. I've confirmed that new code was in fact deployed to scb2002 and for some reason uwsgi is just not using it.
Here's some IRC:
[16:43:36] <mutante> halfak: done.. how is it now? [16:43:50] * halfak watches the errors suddenly stop on codfw [16:43:54] <halfak> checking on grafana [16:45:31] <halfak> mutante, it's looking good. [16:45:34] <mutante> yay! [16:45:39] <halfak> Thanks so much :))))) [16:45:48] * halfak continues to monitor [16:45:55] <mutante> you're welcome ! i'll hop on a bus now but i'll have wifi, heh [16:46:27] <halfak> I'm gonna buy you the biggest beer (or equivalent) next time I see you :) [16:47:20] <mutante> i'll be back in a little [16:47:29] <mutante> halfak: ok :) [16:47:34] <mutante> alright [16:47:46] <halfak> mutante, confirmed no errors on eqiad. [16:47:50] <halfak> Declaring victory [16:48:46] <mutante> sweet!
So switching back to eqiad saved us for now.
Have we figured this out yet ?
FWIW, I think one important step was missed in the above sequence of actions, which is reverting the deploy. Whatever was deployed should have been un-deployed ASAP and then, assuming that did not work, the traffic rerouted back to eqiad.
Hey @akosiaris. Thanks for your thoughts on this. From what I could tell at the time, the deploy system (scap) had failed -- not the code that was deployed (new code was running in some but not all places -- only affected one datacenter), so I did not have faith that a rollback deploy would have addressed the issue as opposed to making it worse. It was clear that a switch to EQIAD would guarantee recovery. In hindsight, I don't think it was scap, but rather some configuration difference between CODFW and EQIAD.
Oh and to answer your question, the issue is resolved with redirecting traffic to EQIAD, but we still don't know what caused the problem. See T163950: Investigate failed deploy to CODFW
Yes, I get why a rollback did not feel like a solution at that point in time. That being said, experience has shown more than once that a rollback that seems like it would not have fixed the issue, actually did. Which would have happened in this case as well from what I gather from T163950. Anyway, I am happy we figured out the cause of the issue (and have a plan to finally fix that cause).