Review ORES traffic to better understand Lift Wing's requirements
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Dec 21 2022, 4:47 PM

Description

We have 3 major source of traffic for ORES:

ChangeProp - hitting ORES after every wiki-edit to generate the revision-score stream.
ORES MW Extension - hitting ORES for some wikis when an edit happens, to show some augmented info to special wiki pages. It surely overlaps with ChangeProp, but its data ends up in special mediawiki db tables.
External users/bots/etc..

Some ideas about how to find good traffic info:

https://logstash.wikimedia.org/app/dashboards#/view/ORES - this shows all the traffic (internal + external) hitting ORES.
The webrequest_text data in the Data Lake should contain traffic for ores.wikimedia.org, so we could understand what bots and users hit us (and with what traffic patterns).

The final goal is to have an informative report about what traffic patterns/volumes we expect in Lift Wing.

Details

Subject	Repo	Branch	Lines +/-
ml-services: allow scale-to-zero for staging deployments	operations/deployment-charts	master	+7 -0
ml-services: tune autoscaling for ORES model servers	operations/deployment-charts	master	+50 -2
profile::logstash: improve the ORES filter	operations/puppet	production	+16 -2
ores: add per-model metrics and fix label for response codes	operations/puppet	production	+16 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T312518 Migrate ORES clients to LiftWing
		Resolved		elukey	T325763 Review ORES traffic to better understand Lift Wing's requirements

Event Timeline

elukey created this task.Dec 21 2022, 4:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 21 2022, 4:47 PM

Isaac subscribed.Dec 21 2022, 5:38 PM

I checked in the ORES dashboard (https://grafana-rw.wikimedia.org/d/HIRrxQ6mk/ores) and on Thanos (https://thanos.wikimedia.org), I don't see metrics related to specific models, just aggregates. ORES doesn't support Prometheus metrics natively, it pushes statsd metrics locally to a prometheus-statsd-exporter, that is configured to filter some metrics. So I checked with tcpdump:

elukey@ores1001:~$ sudo tcpdump port 9125 -A -i lo | grep damaging
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
ores.ores1001.precache_cache_miss.ruwiki.damaging:1|c
ores.ores1001.score_processed.zhwiki.damaging:225.579500|ms
ores.ores1001.datasources_extracted.ruwiki.damaging:113.399982|ms
ores.ores1001.score_processed.wikidatawiki.damaging:75.897455|ms
ores.ores1001.precache_request.ruwiki.damaging:184.121609|ms
ores.ores1001.score_cache_hit.trwiki.damaging:1|c
ores.ores1001.datasources_extracted.trwiki.damaging:0.009060|ms
ores.ores1001.scores_request.trwiki.damaging:3.347874|ms
ores.ores1001.revision_scored.trwiki.damaging:1|c
ores.ores1001.score_processed.wikidatawiki.damaging:28.475046|ms
ores.ores1001.precache_cache_miss.arwiki.damaging:1|c
ores.ores1001.precache_cache_miss.arwiki.damaging:1|c
ores.ores1001.score_processed.wikidatawiki.damaging:32.613039|ms
ores.ores1001.datasources_extracted.arwiki.damaging:114.239454|ms
ores.ores1001.datasources_extracted.arwiki.damaging:120.795965|ms
ores.ores1001.precache_request.arwiki.damaging:220.360041|ms
ores.ores1001.score_processed.nlwiki.damaging:15.190363|ms
ores.ores1001.precache_request.arwiki.damaging:364.669561|ms
ores.ores1001.score_cache_hit.frwiki.damaging:1|c
ores.ores1001.datasources_extracted.frwiki.damaging:0.008345|ms
ores.ores1001.scores_request.frwiki.damaging:5.069256|ms
ores.ores1001.revision_scored.frwiki.damaging:1|c
ores.ores1001.precache_cache_miss.wikidatawiki.damaging:1|c
ores.ores1001.score_processed.wikidatawiki.damaging:89.386702|ms
ores.ores1001.datasources_extracted.wikidatawiki.damaging:137.208462|ms
ores.ores1001.score_processed.enwiki.damaging:636.673212|ms
ores.ores1001.precache_request.wikidatawiki.damaging:204.805374|ms
ores.ores1001.precache_cache_miss.wikidatawiki.damaging:1|c
ores.ores1001.datasources_extracted.wikidatawiki.damaging:150.700092|ms

So the metrics are emitted but dropped by the exported, probably to reduce the amount of data stored on the Prometheus master nodes.

The alternative, as suggested by Ilias, would be to use Spark on Webrequest data, to figure out how many calls from external clients we receive (in theory we wouldn't see the majority of the traffic that is internal though). There is some value in my opinion to add more metrics, it would be a change to the stasd exporter and not to ORES itself, so it should be safe enough.

Change 887732 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] ores: add per-model metrics and fix label for response codes

https://gerrit.wikimedia.org/r/887732

gerritbot added a project: Patch-For-Review.Feb 8 2023, 9:49 AM

elukey claimed this task.Feb 8 2023, 9:53 AM

elukey moved this task from Unsorted to In Progress on the Machine-Learning-Team board.

Change 887732 merged by Elukey:

[operations/puppet@production] ores: add per-model metrics and fix label for response codes

https://gerrit.wikimedia.org/r/887732

Mentioned in SAL (#wikimedia-operations) [2023-02-09T13:40:15Z] <elukey> restart prometheus-statsd-exporter on ores nodes to pick up label change - T325763

Maintenance_bot removed a project: Patch-For-Review.Feb 9 2023, 2:30 PM

Refactored the https://grafana.wikimedia.org/d/HIRrxQ6mk/ores dashboard using the new per-model metrics that the exporter returns. The main question mark is what is the difference between:

precache_request
scores_processed
scores_request
datasources_extract

From this test in ORES I think that the meaning of the above metrics is:

precache_request are the /precache endpoints HTTP requests
scores_request (plural) is the total time taken by one or more requests to be processed (since ORES' api allows multiple scores to be generated at once).
score_processed (singular) is the time taken by a single request to be processed. In theory a single scores_request value may refer to multiple score_processed ones.
datasources_extracted should be the time taken for feature computation.

elukey added a parent task: T312518: Migrate ORES clients to LiftWing.Feb 13 2023, 8:16 AM

https://w.wiki/6Pbp shows the UAs calling the ores.wikimedia.org endpoint (so all external clients).

We have also Wikimedia Enterprise, that is a surprise, apparently calling ORES directly and fetching data from EventStreams.

Change 893672 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::logstash: improve the ORES filter

https://gerrit.wikimedia.org/r/893672

gerritbot added a project: Patch-For-Review.Mar 2 2023, 9:56 AM

Change 893672 merged by Cwhite:

[operations/puppet@production] profile::logstash: improve the ORES filter

https://gerrit.wikimedia.org/r/893672

Maintenance_bot removed a project: Patch-For-Review.Mar 2 2023, 4:10 PM

Checked https://grafana-rw.wikimedia.org/d/HIRrxQ6mk/ores?forceLogin&from=now-7d&orgId=1&refresh=1m&to=now-1m&var-datasource=codfw%20prometheus%2Fops&var-model=All&viewPanel=74 and came up with some basic autoscaling numbers. We'll need to refine them as we go of course.

Change 898697 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: tune autoscaling for ORES model servers

https://gerrit.wikimedia.org/r/898697

gerritbot added a project: Patch-For-Review.Mar 14 2023, 10:53 AM

Change 898697 merged by Elukey:

[operations/deployment-charts@master] ml-services: tune autoscaling for ORES model servers

https://gerrit.wikimedia.org/r/898697

Maintenance_bot removed a project: Patch-For-Review.Mar 14 2023, 5:30 PM

Change 900239 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: allow scale-to-zero for staging deployments

https://gerrit.wikimedia.org/r/900239

gerritbot added a project: Patch-For-Review.Mar 16 2023, 10:36 AM

Change 900239 merged by Elukey:

[operations/deployment-charts@master] ml-services: allow scale-to-zero for staging deployments

https://gerrit.wikimedia.org/r/900239

Maintenance_bot removed a project: Patch-For-Review.Mar 16 2023, 11:10 AM

elukey moved this task from In Progress to Complete Q3 2022/23 on the Machine-Learning-Team board.Mar 20 2023, 2:43 PM

elukey closed this task as Resolved.Mar 29 2023, 3:25 PM

Review ORES traffic to better understand Lift Wing's requirementsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Review ORES traffic to better understand Lift Wing's requirements
Closed, ResolvedPublic
Actions

Related Objects
Search...