User: Wikipedia recent changes list the edit highlighting by ORES has disappeared
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	calbon
	Sep 12 2023, 5:45 PM

Description

From Seawolf35: https://www.mediawiki.org/wiki/Topic:Xplk44hmlwwu97ki

"Recently, I have found that on the Wikipedia recent changes list the edit highlighting by ORES has disappeared. Is this because of the new open source infrastructure? Any solutions other than to just wait."

Details

	Subject	Repo	Branch	Lines +/-
	services: increase concurrency in ORESFetchScoreJob's changeprop cfg	operations/deployment-charts	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic

Event Timeline

calbon created this task.Sep 12 2023, 5:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2023, 5:45 PM

Lydia_Pintscher subscribed.Sep 12 2023, 5:47 PM

kostajh added projects: ORES, MediaWiki-Recent-changes.Sep 12 2023, 7:34 PM

Restricted Application added a project: Growth-Team. · View Herald TranscriptSep 12 2023, 7:34 PM

Melos subscribed.Sep 13 2023, 1:19 PM

Ladsgroup subscribed.Sep 14 2023, 9:32 AM

first reported at en.wiki help desk (permalink).

this appears only to affect relatively recent edits.

the edit to the help desk, 64 minutes ago, is highlighted, while the edit to the teahouse 10 minutes ago is not.

As @lettherebedarklight mentioned it seems that older revisions are properly highlighted while newer ones are not.

Digging a bit deeper I found out that for the revisions revisions that are not highlighter yet their score is absent from the ores_classification table (the table used to fetch scores by the ores extension).
An example is the following revision:

SELECT * FROM ores_classification
where oresc_rev=1175364403;

At the same time Lift Wing returns a proper response for the above revision:

curl https://api.wikimedia.org/service/lw/inference/v1/models/enwiki-goodfaith:predict -X POST -d '{"rev_id": 1175364403}'
{"enwiki":{"models":{"goodfaith":{"version":"0.5.1"}},"scores":{"1175364403":{"goodfaith":{"score":{"prediction":true,"probability":{"false":0.006275536794257741,"true":0.9937244632057423}}}}}}}

If I understand correctly this means that although the jobs that score the revisions are completed these take too much time to complete.

By looking at the failed jobs in logstash there isn't a significant change in the distribution of failed jobs (although the failures have increased)

However looking at grafana dashboard for Job Queue there is a spike in Job Backlog time which is the time delay between inserting and running the job

Screenshot 2023-09-14 at 7.45.03 PM.png (672×2 px, 93 KB)

Really nice finding! It seems to match exactly https://sal.toolforge.org/log/HIAEhIoBGiVuUzOdDi6t, that is when we moved wikidata and enwiki to Lift Wing. Maybe it is only a matter of adding more pods?

Latency for enwiki damaging

https://grafana-rw.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?forceLogin&orgId=1&var-backend=enwiki-goodfaith-predictor-default-00013-private&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&var-response_code=All for enwiki goodfaith

https://grafana-rw.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?forceLogin&orgId=1&var-backend=wikidatawiki-damaging-predictor-default-00011-private&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-damaging&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&var-response_code=All for wikidata damaging

https://grafana-rw.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?forceLogin&orgId=1&var-backend=wikidatawiki-goodfaith-predictor-default-00013-private&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&var-response_code=All for wikidata goodfaith

Wikidata performs way better than enwiki for P9* percentiles afaics. The number of requests are not high, so maybe increasing pods will not help.

I suggest upping concurrency value of ORESFetchScore jobs in change prop (it's in helmfile.d/services/changeprop-jobqueue/values.yaml in deployment charts and it's 20 currently and I think that's a bit too low right now).

The Kafka consumer lag dashboard shows what Ilias pointed out, namely changeprop is lagging in consuming (and processing) ORESFetchSCore jobs.

The number of messages didn't change , just to confirm.

Change 957864 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: increase concurrency in ORESFetchScoreJob's changeprop cfg

https://gerrit.wikimedia.org/r/957864

gerritbot added a project: Patch-For-Review.Sep 15 2023, 8:24 AM

Change 957864 merged by Elukey:

[operations/deployment-charts@master] services: increase concurrency in ORESFetchScoreJob's changeprop cfg

https://gerrit.wikimedia.org/r/957864

Maintenance_bot removed a project: Patch-For-Review.Sep 15 2023, 9:10 AM

Another interesting thing seen in the dashboards (and in the attached image is that job running time went up to 25s (from 3-4s) when we enabled Lift Wing to most wikis (except enwiki and wikidata). However this didn't trigger this big lag that we are seeing in the previous chart. The lag seems to have been caused by these two things combined - job duration + increased number of jobs introduced by adding enwiki and wikidata.

Screenshot 2023-09-15 at 12.22.00 PM.png (1×2 px, 207 KB)

The metrics now look better! One thing that I noticed is that we have a lot of events in the ORESFetchScoreJob retry topic.

I checked via kafkacat -C -t eqiad.cpjobqueue.retry.mediawiki.job.ORESFetchScoreJob -b kafka-main1001.eqiad.wmnet:9092 -o latest on stat1004 and I see a continuous stream of HTTP 500, we should figure out what's happening.

The backlog is back to the previous values \o/ https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=ORESFetchScoreJob&viewPanel=5&from=1694166584152&to=1694769940716

Great work @elukey and @Ladsgroup ! I'll keep an eye on this and if all continues well, we can resolve this issue.
Summary:
Increasing the concurrency of ORESFetchScoreJob's in changeprop from 20 to 30 seems to have done the trick and Job backlog time has fallen to ms/s levels as before.

We found a serious bug though, namely sometimes the kserve container inside an isvc pod stops working for some reason blackholing traffic. We noticed this since the retry queue in changeprop for ORESFetchScoreJob has been increasing for the past days, and the related kafka topic was constantly getting new events inserted. The related Kafka consumer lag is decreasing, but we need to file a new task to investigate this problem (since it can happen anytime again).

Thanks for this work. Lets talk about that ticket on Tuesday

elukey mentioned this in T346445: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic.Sep 15 2023, 1:23 PM

isarantopoulos moved this task from Unsorted to Complete Q3 2022/23 on the Machine-Learning-Team board.Sep 18 2023, 5:22 AM

Etonkovidova moved this task from Inbox to Triaged on the Growth-Team board.Sep 22 2023, 12:04 AM

seems to be working now?

elukey closed this task as Resolved.Nov 1 2023, 7:30 AM

	F37726931: Screenshot 2023-09-15 at 12.22.00 PM.png
	Sep 15 2023, 9:22 AM

	F37722415: Screenshot 2023-09-14 at 7.45.03 PM.png
	Sep 14 2023, 4:46 PM

	F37753766: T346175.jpg
	Sep 23 2023, 8:48 AM

	F37726960: grafik.png
	Sep 15 2023, 9:26 AM

	F37721129: delete.jpg
	Sep 14 2023, 1:02 PM

User: Wikipedia recent changes list the edit highlighting by ORES has disappearedClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

User: Wikipedia recent changes list the edit highlighting by ORES has disappeared
Closed, ResolvedPublic
Actions