Page MenuHomePhabricator

Add slow-logs for ML isvcs
Open, Needs TriagePublic

Description

In T362503 we found out that some requests seem to have caused high processing time during preprocess() ending up in a total failure of the isvc (clients hanging for several seconds before failing). This is likely due to a bug in revscoring, and since we don't log the json payload of every request landing to our isvcs, we don't have a good way to find repro use cases.

We should think about adding one of the following (or both?):

  • Logs of the request's JSON payload landing to every isvcs. This seems to be the easiest but it will produce a lot of logs, and it might add some complexity when reviewing our access logs (more data is not always the better). Also in case we pass very complex strings etc.. into the JSON payload we may not want them to be printed every time (say a string that is 100 lines long). We shouldn't have these use cases yet, but worth to mention.
  • Log verbosely the request's JSON payload only if preprocess() or process() fail or take too much time to complete (say more than X seconds).

If we had the second option before the ruwiki outage we'd have seen slow logs in the isvcs's access logs (or on logstash) with a clear way on how to reproduce the problem.

Event Timeline

Change #1021923 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: add request payload logging to all revscoring isvcs

https://gerrit.wikimedia.org/r/1021923

Change #1021923 merged by Elukey:

[operations/deployment-charts@master] ml-services: add request payload logging to all revscoring isvcs

https://gerrit.wikimedia.org/r/1021923

Change #1023877 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] revscoring_model: log request_id alongside with inputs

https://gerrit.wikimedia.org/r/1023877

Change #1023877 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revscoring_model: log request_id alongside with inputs

https://gerrit.wikimedia.org/r/1023877

Change #1023880 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update Docker image for revscoring-editquality-damaging

https://gerrit.wikimedia.org/r/1023880

Change #1023880 merged by Elukey:

[operations/deployment-charts@master] ml-services: update Docker image for revscoring-editquality-damaging

https://gerrit.wikimedia.org/r/1023880

Change #1024425 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] utils: slow function execution wrapper

https://gerrit.wikimedia.org/r/1024425