Page MenuHomePhabricator

Timeouts on CODFW
Closed, ResolvedPublic

Description

After a deployment today, we're getting poor performance in CODFW, but fine performance in EQIAD. I'll post notes in the discussion blow.

Event Timeline

Halfak triaged this task as Unbreak Now! priority.Apr 26 2017, 9:01 PM

Here's some scorings directly to EQIAD (Note the 1.3-1.5s speed)

halfak@scb1002:~$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242326
{
  "scores": {
    "fiwiki": {
      "goodfaith": {
        "scores": {
          "3242326": {
            "prediction": true,
            "probability": {
              "false": 0.04025263981650706,
              "true": 0.9597473601834929
            }
          }
        },
        "version": "0.3.0"
      }
    }
  }
}
real	0m1.383s
user	0m0.004s
sys	0m0.004s
halfak@scb1002:~$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242327
{
  "scores": {
    "fiwiki": {
      "goodfaith": {
        "scores": {
          "3242327": {
            "prediction": true,
            "probability": {
              "false": 0.33890319809272784,
              "true": 0.6610968019072722
            }
          }
        },
        "version": "0.3.0"
      }
    }
  }
}
real	0m1.367s
user	0m0.008s
sys	0m0.000s
halfak@scb1002:~$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242328
{
  "scores": {
    "fiwiki": {
      "goodfaith": {
        "scores": {
          "3242328": {
            "prediction": true,
            "probability": {
              "false": 0.03130334081841879,
              "true": 0.9686966591815812
            }
          }
        },
        "version": "0.3.0"
      }
    }
  }
}
real	0m1.445s
user	0m0.004s
sys	0m0.004s

Same scorings to a server in CODFW (2 timeout and 1 runs in 12s)

halfak@scb2002:~$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242326
{
  "scores": {
    "fiwiki": {
      "goodfaith": {
        "scores": {
          "3242326": {
            "error": {
              "message": "Timed out after 15 seconds.",
              "type": "TimeoutError"
            }
          }
        },
        "version": "0.3.0"
      }
    }
  }
}
real	0m15.296s
user	0m0.012s
sys	0m0.000s
halfak@scb2002:~$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242327
{
  "scores": {
    "fiwiki": {
      "goodfaith": {
        "scores": {
          "3242327": {
            "error": {
              "message": "Timed out after 15 seconds.",
              "type": "TimeoutError"
            }
          }
        },
        "version": "0.3.0"
      }
    }
  }
}
real	0m18.208s
user	0m0.004s
sys	0m0.004s
halfak@scb2002:~$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242328
{
  "scores": {
    "fiwiki": {
      "goodfaith": {
        "scores": {
          "3242328": {
            "prediction": true,
            "probability": {
              "false": 0.03130334081841879,
              "true": 0.9686966591815812
            }
          }
        },
        "version": "0.3.0"
      }
    }
  }
}
real	0m12.460s
user	0m0.012s
sys	0m0.004s

I get this in the applog on scb2002:

Traceback (most recent call last):
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/srv/deployment/ores/deploy-cache/revs/cc12103e95b91e1677c5e1dd40eb93764233cfec/ores/scoring_systems/celery_queue.py", line 82, in _lookup_score_in_map
    score_map = score_map_result.get(timeout=self.timeout)
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/result.py", line 175, in get
    raise meta['result']
TypeError: _process_score_map() got an unexpected keyword argument 'include_features'

I just ran scap deploy --service-restart

That didn't help anything. Still seeing this error in the logs. I've confirmed that new code was in fact deployed to scb2002 and for some reason uwsgi is just not using it.

Here's some IRC:

[16:43:36] <mutante> halfak: done.. how is it now?
[16:43:50] * halfak watches the errors suddenly stop on codfw
[16:43:54] <halfak> checking on grafana
[16:45:31] <halfak> mutante, it's looking good. 
[16:45:34] <mutante> yay!
[16:45:39] <halfak> Thanks so much :))))) 
[16:45:48] * halfak continues to monitor
[16:45:55] <mutante> you're welcome ! i'll hop on a bus now but i'll have wifi, heh
[16:46:27] <halfak> I'm gonna buy you the biggest beer (or equivalent) next time I see you :) 
[16:47:20] <mutante> i'll be back in a little 
[16:47:29] <mutante> halfak: ok :)
[16:47:34] <mutante> alright
[16:47:46] <halfak> mutante, confirmed no errors on eqiad. 
[16:47:50] <halfak> Declaring victory
[16:48:46] <mutante> sweet!

So switching back to eqiad saved us for now.

Have we figured this out yet ?

FWIW, I think one important step was missed in the above sequence of actions, which is reverting the deploy. Whatever was deployed should have been un-deployed ASAP and then, assuming that did not work, the traffic rerouted back to eqiad.

Hey @akosiaris. Thanks for your thoughts on this. From what I could tell at the time, the deploy system (scap) had failed -- not the code that was deployed (new code was running in some but not all places -- only affected one datacenter), so I did not have faith that a rollback deploy would have addressed the issue as opposed to making it worse. It was clear that a switch to EQIAD would guarantee recovery. In hindsight, I don't think it was scap, but rather some configuration difference between CODFW and EQIAD.

Oh and to answer your question, the issue is resolved with redirecting traffic to EQIAD, but we still don't know what caused the problem. See T163950: Investigate failed deploy to CODFW

Yes, I get why a rollback did not feel like a solution at that point in time. That being said, experience has shown more than once that a rollback that seems like it would not have fixed the issue, actually did. Which would have happened in this case as well from what I gather from T163950. Anyway, I am happy we figured out the cause of the issue (and have a plan to finally fix that cause).