Page MenuHomePhabricator

Review ORES traffic to better understand Lift Wing's requirements
Closed, ResolvedPublic

Description

We have 3 major source of traffic for ORES:

  • ChangeProp - hitting ORES after every wiki-edit to generate the revision-score stream.
  • ORES MW Extension - hitting ORES for some wikis when an edit happens, to show some augmented info to special wiki pages. It surely overlaps with ChangeProp, but its data ends up in special mediawiki db tables.
  • External users/bots/etc..

Some ideas about how to find good traffic info:

  • https://logstash.wikimedia.org/app/dashboards#/view/ORES - this shows all the traffic (internal + external) hitting ORES.
  • The webrequest_text data in the Data Lake should contain traffic for ores.wikimedia.org, so we could understand what bots and users hit us (and with what traffic patterns).

The final goal is to have an informative report about what traffic patterns/volumes we expect in Lift Wing.

Event Timeline

I checked in the ORES dashboard (https://grafana-rw.wikimedia.org/d/HIRrxQ6mk/ores) and on Thanos (https://thanos.wikimedia.org), I don't see metrics related to specific models, just aggregates. ORES doesn't support Prometheus metrics natively, it pushes statsd metrics locally to a prometheus-statsd-exporter, that is configured to filter some metrics. So I checked with tcpdump:

elukey@ores1001:~$ sudo tcpdump port 9125 -A -i lo | grep damaging
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
ores.ores1001.precache_cache_miss.ruwiki.damaging:1|c
ores.ores1001.score_processed.zhwiki.damaging:225.579500|ms
ores.ores1001.datasources_extracted.ruwiki.damaging:113.399982|ms
ores.ores1001.score_processed.wikidatawiki.damaging:75.897455|ms
ores.ores1001.precache_request.ruwiki.damaging:184.121609|ms
ores.ores1001.score_cache_hit.trwiki.damaging:1|c
ores.ores1001.datasources_extracted.trwiki.damaging:0.009060|ms
ores.ores1001.scores_request.trwiki.damaging:3.347874|ms
ores.ores1001.revision_scored.trwiki.damaging:1|c
ores.ores1001.score_processed.wikidatawiki.damaging:28.475046|ms
ores.ores1001.precache_cache_miss.arwiki.damaging:1|c
ores.ores1001.precache_cache_miss.arwiki.damaging:1|c
ores.ores1001.score_processed.wikidatawiki.damaging:32.613039|ms
ores.ores1001.datasources_extracted.arwiki.damaging:114.239454|ms
ores.ores1001.datasources_extracted.arwiki.damaging:120.795965|ms
ores.ores1001.precache_request.arwiki.damaging:220.360041|ms
ores.ores1001.score_processed.nlwiki.damaging:15.190363|ms
ores.ores1001.precache_request.arwiki.damaging:364.669561|ms
ores.ores1001.score_cache_hit.frwiki.damaging:1|c
ores.ores1001.datasources_extracted.frwiki.damaging:0.008345|ms
ores.ores1001.scores_request.frwiki.damaging:5.069256|ms
ores.ores1001.revision_scored.frwiki.damaging:1|c
ores.ores1001.precache_cache_miss.wikidatawiki.damaging:1|c
ores.ores1001.score_processed.wikidatawiki.damaging:89.386702|ms
ores.ores1001.datasources_extracted.wikidatawiki.damaging:137.208462|ms
ores.ores1001.score_processed.enwiki.damaging:636.673212|ms
ores.ores1001.precache_request.wikidatawiki.damaging:204.805374|ms
ores.ores1001.precache_cache_miss.wikidatawiki.damaging:1|c
ores.ores1001.datasources_extracted.wikidatawiki.damaging:150.700092|ms

So the metrics are emitted but dropped by the exported, probably to reduce the amount of data stored on the Prometheus master nodes.

The alternative, as suggested by Ilias, would be to use Spark on Webrequest data, to figure out how many calls from external clients we receive (in theory we wouldn't see the majority of the traffic that is internal though). There is some value in my opinion to add more metrics, it would be a change to the stasd exporter and not to ORES itself, so it should be safe enough.

Change 887732 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] ores: add per-model metrics and fix label for response codes

https://gerrit.wikimedia.org/r/887732

Change 887732 merged by Elukey:

[operations/puppet@production] ores: add per-model metrics and fix label for response codes

https://gerrit.wikimedia.org/r/887732

Mentioned in SAL (#wikimedia-operations) [2023-02-09T13:40:15Z] <elukey> restart prometheus-statsd-exporter on ores nodes to pick up label change - T325763

Refactored the https://grafana.wikimedia.org/d/HIRrxQ6mk/ores dashboard using the new per-model metrics that the exporter returns. The main question mark is what is the difference between:

  • precache_request
  • scores_processed
  • scores_request
  • datasources_extract

From this test in ORES I think that the meaning of the above metrics is:

  • precache_request are the /precache endpoints HTTP requests
  • scores_request (plural) is the total time taken by one or more requests to be processed (since ORES' api allows multiple scores to be generated at once).
  • score_processed (singular) is the time taken by a single request to be processed. In theory a single scores_request value may refer to multiple score_processed ones.
  • datasources_extracted should be the time taken for feature computation.

https://w.wiki/6Pbp shows the UAs calling the ores.wikimedia.org endpoint (so all external clients).

We have also Wikimedia Enterprise, that is a surprise, apparently calling ORES directly and fetching data from EventStreams.

Change 893672 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::logstash: improve the ORES filter

https://gerrit.wikimedia.org/r/893672

Change 893672 merged by Cwhite:

[operations/puppet@production] profile::logstash: improve the ORES filter

https://gerrit.wikimedia.org/r/893672

Change 898697 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: tune autoscaling for ORES model servers

https://gerrit.wikimedia.org/r/898697

Change 898697 merged by Elukey:

[operations/deployment-charts@master] ml-services: tune autoscaling for ORES model servers

https://gerrit.wikimedia.org/r/898697

Change 900239 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: allow scale-to-zero for staging deployments

https://gerrit.wikimedia.org/r/900239

Change 900239 merged by Elukey:

[operations/deployment-charts@master] ml-services: allow scale-to-zero for staging deployments

https://gerrit.wikimedia.org/r/900239