Page MenuHomePhabricator

Investigate temporary high latency in revscoring service for wikidata
Open, Needs TriagePublic3 Estimated Story Points

Description

On 2024-03-25, we were alerted to a high backlog/latency in the Changeprop Kafka consumers:

https://grafana.wikimedia.org/goto/mEhasC1Ik?orgId=1

Investigating, I found that the wikidata revscoring services on Lift Wing experienced high query volume and latency:

https://grafana.wikimedia.org/goto/B-QqsC1Sk?orgId=1
https://grafana.wikimedia.org/goto/LOVUyj1Ik?orgId=1

It does not look like we ever served increased numbers of errors, so the increased latency is the only symptom.

Digging through kserve logstash, Luca found entries like:

preprocess_ms: 1303.630828857,
Function get_revscoring_extractor_cache took 1.2865 seconds to execute.

We should investigate whether there is an underlying/deeper problem here that needs addressing.

Eventually the situation resolved