Page MenuHomePhabricator

Investigate temporary high latency in revscoring service for wikidata
Closed, ResolvedPublic3 Estimated Story Points

Description

On 2024-03-25, we were alerted to a high backlog/latency in the Changeprop Kafka consumers:

https://grafana.wikimedia.org/goto/mEhasC1Ik?orgId=1

Investigating, I found that the wikidata revscoring services on Lift Wing experienced high query volume and latency:

https://grafana.wikimedia.org/goto/B-QqsC1Sk?orgId=1
https://grafana.wikimedia.org/goto/LOVUyj1Ik?orgId=1

It does not look like we ever served increased numbers of errors, so the increased latency is the only symptom.

Digging through kserve logstash, Luca found entries like:

preprocess_ms: 1303.630828857,
Function get_revscoring_extractor_cache took 1.2865 seconds to execute.

We should investigate whether there is an underlying/deeper problem here that needs addressing.

Eventually the situation resolved

Event Timeline

klausman set the point value for this task to 3.Mar 26 2024, 2:16 PM
klausman moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.

Since this has not re-occurred, I am closing the task for now. If it happens again, we can always re-open.