Page MenuHomePhabricator

Send score to eventgate when requested
Closed, ResolvedPublic

Description

There is a complex workflow that powers the Kafka stream mediawiki.revision-score, used in various part of our infrastructure:

  • An editor makes a change to a wiki page, Mediawiki serves it and creates a mediawiki.revision-create event.
  • The event is sent to EventGate, that validates it against a schema and saves it in a Kafka topic (called mediawiki.revision-create).
  • Changeprop is instructed to read every event in the Kafka revision-create topic, calling ORES to get a score (with the /precache URI). The score is then sent to EventGate as mediawiki.revision-score event.
  • EventGate validates it, and inserts it into the related Kafka revision-score topic.

Multiple consumers of the mediawiki.revision-create topic use the data for various purposes:

The main problem with the above architecture is that ChangeProp has a special hack to call ORES and create the revision-score event, meanwhile it should be ORES that should send it (leaving ChangeProp more ignorant and simple).

Since we are building Lift Wing, we should probably add the config/code that it is needed to send the revision-score event when requested. Some thoughts:

  • The first thing that comes to mind is Knative eventing, but we are not currently deploying it via helm and our version of Knative is very old (since more recent ones require an higher version of K8s). So I'd avoid using it for the moment.
  • We could add a simple Python function to send an HTTP POST to EventGate, it should be relatively easy to do.
  • We can implement the above with a flag passed as parameter, but then we shouldn't expose it to API-Gateway (since we want only some clients to be able to trigger this). Changeprop will probably call the inference.discovery.wmnet endpoint, so it could be the only one allowed to use the feature flag.
  • To avoid duplicated events, we should probably use (at least for the moment) a new EventGate schema, not mediawiki.revision-score. We can come up with a new name and talk with Data Engineering about it. Then we could experiment with it, maybe instructing Changeprop to call Lift Wing once the MVP is ready. We'd have a nice test pipeline to see how things go, and where fixes are needed.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+14 -1
machinelearning/liftwing/inference-servicesmain+10 -14
machinelearning/liftwing/inference-servicesmain+21 -1
operations/deployment-chartsmaster+1 -3
operations/deployment-chartsmaster+14 -0
machinelearning/liftwing/inference-servicesmain+52 -28
operations/deployment-chartsmaster+5 -8
machinelearning/liftwing/inference-servicesmain+51 -27
operations/deployment-chartsmaster+12 -7
operations/deployment-chartsmaster+15 -0
operations/deployment-chartsmaster+27 -4
machinelearning/liftwing/inference-servicesmain+51 -27
integration/configmaster+6 -0
machinelearning/liftwing/inference-servicesmain+40 -29
machinelearning/liftwing/inference-servicesmain+129 -94
operations/deployment-chartsmaster+1 -1
machinelearning/liftwing/inference-servicesmain+1 -0
operations/deployment-chartsmaster+1 -1
machinelearning/liftwing/inference-servicesmain+2 -2
operations/deployment-chartsmaster+12 -5
machinelearning/liftwing/inference-servicesmain+114 -9
operations/deployment-chartsmaster+42 -0
operations/mediawiki-configmaster+9 -0
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
Resolvedelukey

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Totally makes sense, thanks for the clarifications!

Given the work to do on the Python side (no time for the new lib sorry), I am wondering if we could (for the moment) do the following:

  1. Add a new stream to mediawiki-config called mediawiki.revision-score-editquality.
  2. Push events from Lift Wing, related to the editquality models, to eventgate. The events would follow the revision-score schema.

In this way Lift Wing would just need to send a POST to eventgate, without extra bits. We'd use the new stream to test the new workflow, waiting for the new events mentioned in T301878#8008932.

Would it be feasible? I had totally missed your comment related to a new stream earlier on, sorry for the confusion and thanks for the patience :)

Ya that sounds good!

FWIW, if you are just trying to test stuff, you are welcome to produce directly to a test kafka topic. There will be no validation or automated Hive ingestion, but you can at least test the stream content?

Let's bikeshed final stream/name schema later in CR. :)

Ya that sounds good!

FWIW, if you are just trying to test stuff, you are welcome to produce directly to a test kafka topic. There will be no validation or automated Hive ingestion, but you can at least test the stream content?

Yep yep but I'd prefer to test the whole pipeline, so the new stream seems better overall! If I can avoid a kafka dependency + code in my pods is also good :)

Let's bikeshed final stream/name schema later in CR. :)

Sure! Going to prep some work and add you later on. Thanks again!

Change 810007 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/mediawiki-config@master] Add a new Eventgate stream for revision-score events

https://gerrit.wikimedia.org/r/810007

Looks good. My only worry is that these making these new streams now, and planning to refactor their data model them later based on T308017 might confuse folks? How do you expect people to use 'mediawiki.revision-score-editquality' now?

Looks good. My only worry is that these making these new streams now, and planning to refactor their data model them later based on T308017 might confuse folks? How do you expect people to use 'mediawiki.revision-score-editquality' now?

There is a little risk yes, but people will likely reach out to ML or DE in case they want to use it and we can explain the use case. My idea is to test one new stream from end to end (possibly using Data Platform's workflows instead of ChangeProp) to validate Lift Wing and the whole pipeline while we wait for the new refactor. If you want we could rename it mediawiki.revision-score-editquality-test' or mediawiki.revision-score-test` to add more clarity, what do you think?

Otherwise I can go back to the kafka approach, but it would limit a bit the overall testing.

One thing to note: it is a bit difficult to change the schema a stream uses after the fact. So maybe while doing the test and development, using a 'test' stream name of some kind does make sense.

Ah ok I was assuming that the revision score streams would have changed their name anyway after T308017, but it may not be the case so keeping the names available is good.

I updated the mediawiki config change, thanks!

Ah ok I was assuming that the revision score streams would have changed their name anyway after T308017

Yeah they might, so maybe it doesn't matter! But, probably best to keep them in a 'test' namespace anyway unless you are excited about someone finding them and using them :)

@Ottomata if you have time I'd need some help in deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/810007, I haven't done this in a while :) (otherwise I can add it to the deployment window schedule).

After the mw deployment, do I need to roll restart the eventgate main pods?

Change 810007 merged by jenkins-bot:

[operations/mediawiki-config@master] Add a new Eventgate stream for revision-score events

https://gerrit.wikimedia.org/r/810007

Mentioned in SAL (#wikimedia-operations) [2022-07-06T13:53:21Z] <urbanecm@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:810007|Add a new Eventgate stream for revision-score events (T301878)]] (duration: 03m 46s)

I tried to send an event manually with curl (see below) and I am getting:

context":{"message":"event 50571b69-47d4-4923-8502-8524fe5dfc61 of schema at /mediawiki/revision/score/2.0.0 destined to stream mediawiki.revision-score-test is not allowed in stream; mediawiki.revision-score-test is not configured.
{
   "$schema":"/mediawiki/revision/score/2.0.0",
   "meta":{
      "stream":"mediawiki.revision-score-test"
   },
   "database":"enwiki",
   "page_id":70553880,
   "page_title":"Victoria_Jubilee_Government_High_School",
   "page_namespace":0,
   "page_is_redirect":false,
   "performer":{
      "user_text":"Tmail4770",
      "user_groups":[
         "*",
         "user"
      ],
      "user_is_bot":false,
      "user_id":43710286,
      "user_registration_dt":"2022-04-12T11:55:08Z",
      "user_edit_count":1
   },
   "rev_id":1096766853,
   "rev_parent_id":1096765819,
   "rev_timestamp":"2022-07-06T14:14:40Z",
   "model_name":{
      "model_name":"goodfaith",
      "model_version":"0.5.1",
      "predictions":{
         "prediction":true,
         "probability":{
            "false":0.034048135316042005,
            "true":0.965951864683958
         }
      }
   }
}
curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d @test-revision-score.json https://eventgate-main.discovery.wmnet:4492/v1/events

Yup! And just fixed that link, thank you!

Mentioned in SAL (#wikimedia-operations) [2022-07-07T13:19:29Z] <elukey> roll restart eventgate-main pods to add a new stream - T301878

The curl command above works now! I can see the eqiad.mediawiki.revision-score-test topic in Kafka main too.

Change 812010 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: Add knative and egress config for eventgate-main

https://gerrit.wikimedia.org/r/812010

Change 812010 merged by Elukey:

[operations/deployment-charts@master] ml-services: Add knative and egress config for eventgate-main

https://gerrit.wikimedia.org/r/812010

Change 808247 merged by Elukey:

[machinelearning/liftwing/inference-services@main] editquality: add support for revision-score events

https://gerrit.wikimedia.org/r/808247

Change 813633 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: test event production for editquality in ml-staging

https://gerrit.wikimedia.org/r/813633

Change 813633 merged by Elukey:

[operations/deployment-charts@master] ml-services: test event production for editquality in ml-staging

https://gerrit.wikimedia.org/r/813633

Change 813834 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] editquality: use json.dumps instead of urlencode for EventGate

https://gerrit.wikimedia.org/r/813834

Change 813834 merged by Elukey:

[machinelearning/liftwing/inference-services@main] editquality: use json.dumps instead of urlencode for EventGate

https://gerrit.wikimedia.org/r/813834

Change 813866 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update image in ml-staging for enwiki editquality-goodfaith

https://gerrit.wikimedia.org/r/813866

Change 813866 merged by Elukey:

[operations/deployment-charts@master] ml-services: update image in ml-staging for enwiki editquality-goodfaith

https://gerrit.wikimedia.org/r/813866

Change 814098 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] editquality: set Content-type when sending events to EventGate

https://gerrit.wikimedia.org/r/814098

Change 814098 merged by Elukey:

[machinelearning/liftwing/inference-services@main] editquality: set Content-type when sending events to EventGate

https://gerrit.wikimedia.org/r/814098

Change 814718 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update Docker image for editquality goodfaith

https://gerrit.wikimedia.org/r/814718

Change 814718 merged by Elukey:

[operations/deployment-charts@master] ml-services: update Docker image for editquality goodfaith

https://gerrit.wikimedia.org/r/814718

I was able to generate a revision-score-test event from the enwiki editquality goodfaith model in ml-staging (verified that the event landed correctly on kafka).

Next steps: extend the event generation code to all other revscoring-based models. We should probably use this task to figure out if there is a way to factor out some common code in a base/common module, to avoid code repetition (now and in the future).

Change 820134 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] editquality: refactor Blubber config to share code

https://gerrit.wikimedia.org/r/820134

Change 820134 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] editquality: refactor Blubber config to share code

https://gerrit.wikimedia.org/r/820134

Change 820381 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] editquality: move rev-id preprocess functions to a separate module

https://gerrit.wikimedia.org/r/820381

Change 820388 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] articlequality: add code to send events to EventGate

https://gerrit.wikimedia.org/r/820388

Change 820401 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] draftquality: add code to send events to EventGate

https://gerrit.wikimedia.org/r/820401

Change 820418 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] drafttopic: add code to send events to EventGate

https://gerrit.wikimedia.org/r/820418

Change 820456 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] outlink: move Blubber config to the new standard

https://gerrit.wikimedia.org/r/820456

Change 821166 had a related patch set uploaded (by Elukey; author: Elukey):

[integration/config@master] zuul: trigger inference-services Docker image rebuild with more files

https://gerrit.wikimedia.org/r/821166

Change 821167 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] python: Add more info about Docker image rebuild

https://gerrit.wikimedia.org/r/821167

Change 820381 merged by Elukey:

[machinelearning/liftwing/inference-services@main] editquality: move rev-id preprocess functions to a separate module

https://gerrit.wikimedia.org/r/820381

Change 820388 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] articlequality: add code to send events to EventGate

https://gerrit.wikimedia.org/r/820388

Change 821166 merged by jenkins-bot:

[integration/config@master] zuul: trigger inference-services Docker image rebuild with more files

https://gerrit.wikimedia.org/r/821166

Change 821247 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update editquality's Docker image and settings

https://gerrit.wikimedia.org/r/821247

Change 821248 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: test the new Docker image for articlequality

https://gerrit.wikimedia.org/r/821248

Change 821247 merged by Elukey:

[operations/deployment-charts@master] ml-services: update editquality's Docker image and settings

https://gerrit.wikimedia.org/r/821247

Change 821248 merged by Elukey:

[operations/deployment-charts@master] ml-services: test the new Docker image for articlequality

https://gerrit.wikimedia.org/r/821248

Change 821263 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: add environment variables to editquality pods/isvcs

https://gerrit.wikimedia.org/r/821263

Change 821263 merged by Elukey:

[operations/deployment-charts@master] ml-services: add environment variables to editquality pods/isvcs

https://gerrit.wikimedia.org/r/821263

@Ottomata Hi! I am slowly rolling out the code to allow to all revscoring-based models to push mediawiki.revision-score events to EventGate main (precisely, to a test stream). Now I'd like to do the next step, namely having something that:

  • listens to mediawiki.revision-create events
  • based on some easy rules, decide what Lift Wing endpoint/model to call (for example, call articlequality/editquality/etc.. when an enwiki revision is created, etc..)

The current workflow, based on ChangeProp, assumes that ORES knows what models to call for any given revision/wiki combination. So ChangeProp simply calls ORES passing a revision-create event, and ORES generates as many scores as it is configured. Lift Wing is more granular and models are more independent on it, so our end goal is to have separate streams for separate models. I thought about Airflow at the beginning, but it would be a "batch" tool running every X minutes, meanwhile ChangeProp seems more oriented towards streaming. What do you think it is best? Any new tools in DE that I should try? :)

Hello! Separate streams for different models seems fine, but perhaps what you want are separate events for each model, not necessarily different streams? I guess it depends on how you expect people to consume these events. If you expect a given consumer to ever only be interested in a single model, then separate streams makes sense. However, if you expect a consumer to want some or all model scores, then keeping them in the same stream, or even in the same event, might be better. It makes reasoning about ordering easier. If a page is edited often, you'll want to make it as easy as possible to consume the scores in the order that the edits happen. The more events and streams you have the more you have to worry about ordering.

Anyway, more directly, I think you want a streaming model, no? We don't use spark streaming much, but here's a python example that pulls together Event Platform stuff like stream config and schemas, to get you a Spark streaming DataFrame: https://gist.github.com/ottomata/26e35c39ff4025d73238fba87c4e8e68

There are also ways to do a similar thing with Flink, which might be nice. Old example here (but don't use this as is, talk to me if you want to use Flink, we have better stuff now).

You could also just consume with a Kafka consumer :)

Change 821599 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update articlequality's docker image

https://gerrit.wikimedia.org/r/821599

Change 820401 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] draftquality: add code to send events to EventGate

https://gerrit.wikimedia.org/r/820401

Change 821599 merged by Elukey:

[operations/deployment-charts@master] ml-services: update articlequality's docker image

https://gerrit.wikimedia.org/r/821599

Change 821696 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update settings for draftquality to test the new image

https://gerrit.wikimedia.org/r/821696

Change 820418 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] drafttopic: add code to send events to EventGate

https://gerrit.wikimedia.org/r/820418

Change 821696 merged by Elukey:

[operations/deployment-charts@master] ml-services: update settings for draftquality to test the new image

https://gerrit.wikimedia.org/r/821696

Change 821699 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: apply the staging's draftquality image to prod

https://gerrit.wikimedia.org/r/821699

Change 821699 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: apply the staging's draftquality image to prod

https://gerrit.wikimedia.org/r/821699

Change 820456 merged by Elukey:

[machinelearning/liftwing/inference-services@main] outlink: move Blubber config to the new standard

https://gerrit.wikimedia.org/r/820456

Change 821167 merged by Elukey:

[machinelearning/liftwing/inference-services@main] python: Add more info about Docker image rebuild

https://gerrit.wikimedia.org/r/821167

Change 821749 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update drafttopic docker image and settings

https://gerrit.wikimedia.org/r/821749

Change 821749 merged by Elukey:

[operations/deployment-charts@master] ml-services: update drafttopic docker image and settings

https://gerrit.wikimedia.org/r/821749

All revscoring-based models are now able to accept a revision-create event, generate a revision-score one and send it to EventGate! \o/

Hello! Separate streams for different models seems fine, but perhaps what you want are separate events for each model, not necessarily different streams? I guess it depends on how you expect people to consume these events. If you expect a given consumer to ever only be interested in a single model, then separate streams makes sense. However, if you expect a consumer to want some or all model scores, then keeping them in the same stream, or even in the same event, might be better. It makes reasoning about ordering easier. If a page is edited often, you'll want to make it as easy as possible to consume the scores in the order that the edits happen. The more events and streams you have the more you have to worry about ordering.

Definitely, the ordering is a valid point. From my team's perspective, people should care about single model predictions (in theory) so we are more oriented towards the separate streams way. Moreover the Lift Wing architecture points people to this view of things, since the API doesn't offer a way (nor the way that models are deployed via KServe) to score a rev-id with multiple models and aggregate the result. We can proceed with this assumption and change if anybody has doubts/suggestions as we move forward :)

Anyway, more directly, I think you want a streaming model, no? We don't use spark streaming much, but here's a python example that pulls together Event Platform stuff like stream config and schemas, to get you a Spark streaming DataFrame: https://gist.github.com/ottomata/26e35c39ff4025d73238fba87c4e8e68

Very nice thanks a lot! So this spark job would run on the Hadoop cluster as always, configured via puppet etc.. It may be a good alternative to the current ChangeProp ORES config and code!

There are also ways to do a similar thing with Flink, which might be nice. Old example here (but don't use this as is, talk to me if you want to use Flink, we have better stuff now).

You could also just consume with a Kafka consumer :)

Will explore some options and report back, thanks as always!

So this spark job would run on the Hadoop cluster as always, configured via puppet

We don't currently maintain any streaming jobs in Hadoop for production, and I'd be reluctant to do so. However, its a great way to develop!

You'd probably want to run whatever streaming job in k8s.

Resolved?!

Yes sorry we were grooming and I was supposed to add some wrap-up notes, you were too quick :D

  • The basic functionality to send events to EventGate from Lift Wing are deployed and working for most of the current model-servers.
  • The stream processor choice is a big one, T319214 and T320374 should give us good info about the direction to take.
  • The deprecation of revision-score is being discussed/worked-on in T317768.

As far as I can see there is no more work for this task, so I called the work done. If anything is missing lemme know and I'll reopen :)

Cool! Just curious as to the 'deployed/prod' state of the thing. Sounds like WIPs work, but prod things still not decided/in flux?

Nono you can now hit Lift Wing with a request carrying a revision-create event and a correspondent revision-score event will be emitted to EventGate. I'd say that the functionality is prod-ready, we are now doing a more in depth testing with Benthos etc..

At the moment the Lift Wing code handles only revision-create events, in order to generalize the code we'll need more inputs (especially from tasks mentioned in T301878#8008932) but shouldn't take too much.