Page MenuHomePhabricator

User: Wikipedia recent changes list the edit highlighting by ORES has disappeared
Closed, ResolvedPublic

Description

From Seawolf35: https://www.mediawiki.org/wiki/Topic:Xplk44hmlwwu97ki

"Recently, I have found that on the Wikipedia recent changes list the edit highlighting by ORES has disappeared. Is this because of the new open source infrastructure? Any solutions other than to just wait."

Event Timeline

first reported at en.wiki help desk (permalink).

this appears only to affect relatively recent edits.

the edit to the help desk, 64 minutes ago, is highlighted, while the edit to the teahouse 10 minutes ago is not.

delete.jpg (1×2 px, 293 KB)

As @lettherebedarklight mentioned it seems that older revisions are properly highlighted while newer ones are not.

Digging a bit deeper I found out that for the revisions revisions that are not highlighter yet their score is absent from the ores_classification table (the table used to fetch scores by the ores extension).
An example is the following revision:

SELECT * FROM ores_classification
where oresc_rev=1175364403;

At the same time Lift Wing returns a proper response for the above revision:

curl https://api.wikimedia.org/service/lw/inference/v1/models/enwiki-goodfaith:predict -X POST -d '{"rev_id": 1175364403}'
{"enwiki":{"models":{"goodfaith":{"version":"0.5.1"}},"scores":{"1175364403":{"goodfaith":{"score":{"prediction":true,"probability":{"false":0.006275536794257741,"true":0.9937244632057423}}}}}}}

If I understand correctly this means that although the jobs that score the revisions are completed these take too much time to complete.

By looking at the failed jobs in logstash there isn't a significant change in the distribution of failed jobs (although the failures have increased)

However looking at grafana dashboard for Job Queue there is a spike in Job Backlog time which is the time delay between inserting and running the job

Screenshot 2023-09-14 at 7.45.03 PM.png (672×2 px, 93 KB)

Really nice finding! It seems to match exactly https://sal.toolforge.org/log/HIAEhIoBGiVuUzOdDi6t, that is when we moved wikidata and enwiki to Lift Wing. Maybe it is only a matter of adding more pods?

I suggest upping concurrency value of ORESFetchScore jobs in change prop (it's in helmfile.d/services/changeprop-jobqueue/values.yaml in deployment charts and it's 20 currently and I think that's a bit too low right now).

The Kafka consumer lag dashboard shows what Ilias pointed out, namely changeprop is lagging in consuming (and processing) ORESFetchSCore jobs.

The number of messages didn't change , just to confirm.

Change 957864 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: increase concurrency in ORESFetchScoreJob's changeprop cfg

https://gerrit.wikimedia.org/r/957864

Change 957864 merged by Elukey:

[operations/deployment-charts@master] services: increase concurrency in ORESFetchScoreJob's changeprop cfg

https://gerrit.wikimedia.org/r/957864

Another interesting thing seen in the dashboards (and in the attached image is that job running time went up to 25s (from 3-4s) when we enabled Lift Wing to most wikis (except enwiki and wikidata). However this didn't trigger this big lag that we are seeing in the previous chart. The lag seems to have been caused by these two things combined - job duration + increased number of jobs introduced by adding enwiki and wikidata.

Screenshot 2023-09-15 at 12.22.00 PM.png (1×2 px, 207 KB)

The metrics now look better! One thing that I noticed is that we have a lot of events in the ORESFetchScoreJob retry topic.

I checked via kafkacat -C -t eqiad.cpjobqueue.retry.mediawiki.job.ORESFetchScoreJob -b kafka-main1001.eqiad.wmnet:9092 -o latest on stat1004 and I see a continuous stream of HTTP 500, we should figure out what's happening.

Great work @elukey and @Ladsgroup ! I'll keep an eye on this and if all continues well, we can resolve this issue.
Summary:
Increasing the concurrency of ORESFetchScoreJob's in changeprop from 20 to 30 seems to have done the trick and Job backlog time has fallen to ms/s levels as before.

We found a serious bug though, namely sometimes the kserve container inside an isvc pod stops working for some reason blackholing traffic. We noticed this since the retry queue in changeprop for ORESFetchScoreJob has been increasing for the past days, and the related kafka topic was constantly getting new events inserted. The related Kafka consumer lag is decreasing, but we need to file a new task to investigate this problem (since it can happen anytime again).

Thanks for this work. Lets talk about that ticket on Tuesday