Page MenuHomePhabricator

Apply common settings to publish events from Lift Wing staging to EventGate
Closed, ResolvedPublic2 Estimated Story Points

Description

We are currently publishing events from Lift Wing to Event Gate:

  • All revscoring model servers can read a page-change event, generate revision-score event and publish it to Event Gate (the stream name varies, usually it is mediawiki.revision_score_$model.
  • Outlink model server can read a page-change event, generate a prediction_classification_change event and publish it to Event Gate (the stream name is mediawiki.page_outlink_topic_prediction_change.v1.

The stream names are defined in MediaWiki config. When we test in staging we have the following setting:

  • revscoring model servers POST events to Event Gate production, stream mediawiki.revision-score-test.
  • outlink model server POST events to Event Gate production as well, but using the "prod" stream (mediawiki.page_outlink_topic_prediction_change.v1).

The latter is not ideal of course because we cannot easily test the pipeline without interfering with the prod streams. It is unclear what is the future of our streams, if Lift Wing will emit events in the future of if a stream processor will do for it instead, but we should find a strategy for the current settings.

Multiple possibilities:

  1. We keep using Event Gate production for Lift Wing staging, but we create a new testing stream for prediction-change events (the schema is and will be shared by multiple model servers). This will allow us to have something like mediawiki.revision-score-test and test current and future models.
  2. We can use the staging endpoint of Event Gate for Lift Wing staging, that is configured to prefix staging. to all target topics (instead of eqiad|codfw, like the prod ones) so in Kafka the testing data will end up in a different queue, keeping things separated. In this case the stream names in the isvc configs will stay the same, we'll vary only the eventgate endpoint.

The main drawback of the latter is that there is no discovery endpoint for wikikube staging, but only these endpoints:

staging.svc.{eqiad,codfw}.wmnet are simple CNAMEs to some kubestage worker nodes, and it is not clear what endpoint we should call at any given time (for example, now eqiad works and codfw hangs, that suggests Event Gate staging pods are only in eqiad, but will it be like this in the future?).

Whatever we decide, we also need to update https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Streams

Event Timeline

no discovery endpoint for wikikube staging

Perhaps it is possible to make one?

Or, we could deploy a new eventgate-staging|eventgate-dev|eventgate-test instance to eqiad and codfw that produces to kafka test or something!

no discovery endpoint for wikikube staging

Perhaps it is possible to make one?

If the eventgate's chart is migrated to the ingress module we would be able to use the istio ingress discovery endpoints that Janis created: k8s-ingress-staging.discovery.wmnet. We could try to work on it, but it is a delicate change for production, lemme know what you prefer!

I'm all for it, please go for it!

We need to do some vendor module updates and eventgate chart work in T349823: [Event Platform] Gracefully handle pod termination in eventgate Helm chart too.

My brain doesn't remember exactly what this did, but is

If the eventgate's chart is migrated to the ingress module

done now that https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/959181 is deployed?

achou set the point value for this task to 2.

Change 982873 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/mediawiki-config@master] Add a testing stream for page-prediction-change events

https://gerrit.wikimedia.org/r/982873

Change 982873 merged by jenkins-bot:

[operations/mediawiki-config@master] Add a testing stream for page-prediction-change events

https://gerrit.wikimedia.org/r/982873

Mentioned in SAL (#wikimedia-operations) [2023-12-18T14:24:03Z] <urbanecm@deploy2002> Started scap: Backport for [[gerrit:982873|Add a testing stream for page-prediction-change events (T349919)]], [[gerrit:983178|CheckUser: Enable read new for event tables migration everywhere (T341829)]]

Mentioned in SAL (#wikimedia-operations) [2023-12-18T14:34:15Z] <urbanecm@deploy2002> dreamyjazz and aikochou and urbanecm: Backport for [[gerrit:982873|Add a testing stream for page-prediction-change events (T349919)]], [[gerrit:983178|CheckUser: Enable read new for event tables migration everywhere (T341829)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-12-18T14:43:03Z] <urbanecm@deploy2002> Finished scap: Backport for [[gerrit:982873|Add a testing stream for page-prediction-change events (T349919)]], [[gerrit:983178|CheckUser: Enable read new for event tables migration everywhere (T341829)]] (duration: 19m 00s)

Change 984135 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: change EventGate stream value for outlink

https://gerrit.wikimedia.org/r/984135

Change 984135 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: change EventGate stream value for outlink

https://gerrit.wikimedia.org/r/984135

I tested the new testing stream for prediction-change events for outlink model server.

  1. First, I collected an event from mediawiki.page_change.v1
$ cat page_change_en.json 
{"changelog_kind":"update","page_change_kind":"edit","dt":"2023-12-19T11:45:37Z","wiki_id":"enwiki","page":{"page_id":69760631,"page_title":"Moga_Municipal_Corporation","namespace_id":0,"is_redirect":false},"performer":{"user_text":"PrinceofPunjab","groups":["extendedconfirmed","*","user","autoconfirmed"],"is_bot":false,"is_system":false,"is_temp":false,"user_id":31398767,"registration_dt":"2017-06-27T10:12:43Z","edit_count":3183},"revision":{"rev_id":1190712671,"rev_dt":"2023-12-19T11:45:37Z", 
[...]
  1. I sent the event to staging.liftwing.test-outlink-events which is configured in the Change-Prop's values-staging.yaml
$ cat page_change_en.json | kafkacat -P -t staging.liftwing.test-outlink-events -b kafka-main1001.eqiad.wmnet:9093 -X security.protocol=ssl -X ssl.ca.location=/etc/ssl/certs/wmf-ca-certificates.crt
  1. In the outlink model server's logs on Lift Wing staging, an access log entry with the new request logged.
WARNING:charset_normalizer:Encoding detection on empty bytes, assuming utf_8 intention.
2023-12-19 12:13:22.750 kserve.trace requestId: 1e601225-2e1d-444a-b99a-fd9aebd23fb7, preprocess_ms: 0.028371811, explain_ms: 0, predict_ms: 26.603221893, postprocess_ms: 0.000953674
2023-12-19 12:13:22.750 uvicorn.access INFO:     127.0.0.6:0 1 - "POST /v1/models/outlink-topic-model%3Apredict HTTP/1.1" 200 OK
  1. I verified on the target Kafka topic codfw.mediawiki.page_prediction_change.rc0 that the event has been posted correctly
$ kafkacat -C -b kafka-main1001.eqiad.wmnet:9093 -t codfw.mediawiki.page_prediction_change.rc0 -X security.protocol=ssl -X ssl.ca.location=/etc/ssl/certs/wmf-ca-certificates.crt -o latest -c 1 -e

I drew a simple plot yesterday to help myself recall the testing flow in staging while I was reading https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Streams. :)

IMG_0319.jpg (1×1 px, 386 KB)

I will also update the docs with the changes we have made.

Updated https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Streams. Let me know if anything is unclear. Going to resolve this task. :)