Add slow-logs for ML isvcs
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	elukey
	Tue, Apr 16, 1:57 PM

Description

In T362503 we found out that some requests seem to have caused high processing time during preprocess() ending up in a total failure of the isvc (clients hanging for several seconds before failing). This is likely due to a bug in revscoring, and since we don't log the json payload of every request landing to our isvcs, we don't have a good way to find repro use cases.

We should think about adding one of the following (or both?):

Logs of the request's JSON payload landing to every isvcs. This seems to be the easiest but it will produce a lot of logs, and it might add some complexity when reviewing our access logs (more data is not always the better). Also in case we pass very complex strings etc.. into the JSON payload we may not want them to be printed every time (say a string that is 100 lines long). We shouldn't have these use cases yet, but worth to mention.
Log verbosely the request's JSON payload only if preprocess() or process() fail or take too much time to complete (say more than X seconds).

If we had the second option before the ruwiki outage we'd have seen slow logs in the isvcs's access logs (or on logstash) with a clear way on how to reproduce the problem.

Details

Subject	Repo	Branch	Lines +/-
utils: slow function execution wrapper	machinelearning/liftwing/inference-services	main	+77 -1
ml-services: update Docker image for revscoring-editquality-damaging	operations/deployment-charts	master	+1 -1
revscoring_model: log request_id alongside with inputs	machinelearning/liftwing/inference-services	main	+4 -1
ml-services: add request payload logging to all revscoring isvcs	operations/deployment-charts	master	+16 -6

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open	BUG REPORT	None	T362503 ORES doesn't work (at least for ru- and ukwiki)
		Open		None	T362663 Add slow-logs for ML isvcs

Event Timeline

elukey created this task.Tue, Apr 16, 1:57 PM

elukey mentioned this in T362503: ORES doesn't work (at least for ru- and ukwiki).

Change #1021923 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: add request payload logging to all revscoring isvcs

https://gerrit.wikimedia.org/r/1021923

gerritbot added a project: Patch-For-Review.Fri, Apr 19, 1:18 PM

Change #1021923 merged by Elukey:

[operations/deployment-charts@master] ml-services: add request payload logging to all revscoring isvcs

https://gerrit.wikimedia.org/r/1021923

Change #1023877 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] revscoring_model: log request_id alongside with inputs

https://gerrit.wikimedia.org/r/1023877

Change #1023877 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revscoring_model: log request_id alongside with inputs

https://gerrit.wikimedia.org/r/1023877

elukey mentioned this in rMLISaeaee66b05b3: revscoring_model: log request_id alongside with inputs.Wed, Apr 24, 3:38 PM

Change #1023880 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update Docker image for revscoring-editquality-damaging

https://gerrit.wikimedia.org/r/1023880

Change #1023880 merged by Elukey:

[operations/deployment-charts@master] ml-services: update Docker image for revscoring-editquality-damaging

https://gerrit.wikimedia.org/r/1023880

Change #1024425 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] utils: slow function execution wrapper

https://gerrit.wikimedia.org/r/1024425

Add slow-logs for ML isvcsOpen, Needs TriagePublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add slow-logs for ML isvcs
Open, Needs TriagePublic
Actions

Related Objects
Search...