Page MenuHomePhabricator

Implement new mediawiki.revision-score streams with Lift Wing
Closed, ResolvedPublic

Description

In T317768 new streams to replace mediawiki.revision-score have been created (see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/884155).

We now need to:

  1. Add scale-up options to LiftWing's for the ORES' most requested wikis (see https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&refresh=1m). Basically we'd need to add the settings in deployment-charts to support scaling up/down some model servers (say all the enwiki ones) so that they will be able to support traffic generated in the next point.
  2. Add the rules to change-prop to effectively implement the new streams.
  3. Confirm that the new streams are pulled by DE on Hadoop/Hive, and decide if we want to have them exposed in Event Streams as well.
  4. Ask folks using mediawiki.revision-score to migrate to the new streams (see https://wikitech.wikimedia.org/wiki/Search/articletopic).

Details

SubjectRepoBranchLines +/-
machinelearning/liftwing/inference-servicesmain+120 -48
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
machinelearning/liftwing/inference-servicesmain+6 -6
operations/deployment-chartsmaster+2 -1
operations/deployment-chartsmaster+1 -1
machinelearning/liftwing/inference-servicesmain+5 -1
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+2 -1
operations/deployment-chartsmaster+14 -6
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+14 -0
operations/deployment-chartsmaster+4 -0
operations/deployment-chartsmaster+9 -9
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+9 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2023-02-02T09:11:32Z] <elukey> roll restart of eventgate-main pods in wikikube eqiad/codfw to pick up new stream configs - T328576

Change 886917 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: add autoscaling to en/wikidata for goodfaith

https://gerrit.wikimedia.org/r/886917

Change 886918 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: add the first lift wing stream to change-prop

https://gerrit.wikimedia.org/r/886918

Decided to start with goodfaith, and filed two code reviews:

  • one to allow autoscaling for enwiki and wikidatawiki, the rest seems not needing the scale up settings afaics. We can think about having minReplicas set to two by default though..
  • one to add the first mediawiki.revision-score-goodfaith stream :)

Change 886917 merged by Elukey:

[operations/deployment-charts@master] ml-services: add autoscaling to en/wikidata for goodfaith

https://gerrit.wikimedia.org/r/886917

Change 888190 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] WIP - events: support multiple source events

https://gerrit.wikimedia.org/r/888190

@Ottomata I tried to add support for page_change in Lift Wing, it shouldn't be hard :) As far as I can see all the info that we need to create a revision-score event are in page_change, so on that side we are good. I noticed that for the moment the RC streams are only on Jumbo, and due to how ChangeProp works is may be difficult for us to use them.. Is there a plan to mirror them to main?

Change 888653 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] changeprop: use a more generic name for events in liftwing's config

https://gerrit.wikimedia.org/r/888653

Awesome! Now if only we can move to the new data model too! :D But naw, that is probably for newer ML streams, right? These you are just trying to get off the old ORES backend? Or would you like to remodel these too?

I noticed that for the moment the RC streams are only on Jumbo, and due to how ChangeProp works is may be difficult for us to use them.. Is there a plan to mirror them to main?

mediawiki.page_change should be on main. We are at rc1, so on kafka-main1001, eqiad.rc1.mediawiki.page_change exists.

rc1.mediawiki.page_content_change is only on kafka-jumbo for now. That's the one that we hope to move over to wikikube + kafka main by the end of the quarter (with no SLOs yet).

@Ottomata I tried kafkacat -C -t eqiad.rc1.mediawiki.page_change -b localhost:9092 on kafka-main1001 and I don't get any event, meanwhile if I do it on Jumbo I see a stream of events, this is why I am asking (maybe there is something that I am missing).

OHH!!! YOU are right! We are producing the rc1s to eventgate-analytics-external, I forgot. Sorry about that.

Yes, the intention is to move to eventgate-main. Today I will increase deployment of this to group 1 wikis. After that we will go all wikis for a while. Then after that, we will remove the rc1 prefix, and move to eventgate-main.

Change 888653 merged by Elukey:

[operations/deployment-charts@master] changeprop: use a more generic name for events in liftwing's config

https://gerrit.wikimedia.org/r/888653

Updates:

  • Andrew rolled out rc1 page_change stream to Kafka Main, so we can test it with ChangeProp (thanks!).
  • After a brief chat on IRC, it seems that we should consider only page_change events carrying the page_change_kind flag to either create or edit (to mimic as much as possible revision-create). More brainstorm with @Ottomata is needed, but we can start testing it.
  • The code review to config the first revision-score sub-stream with Change-prop is ready, it has been changed to use page_change.
  • Before adding the rule to ChangeProp, we need to complete the review of https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/888190 and roll it out to Lift Wing.

Change 888190 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] events: support multiple source events

https://gerrit.wikimedia.org/r/888190

Change 889773 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update docker images for outlink and revscoring

https://gerrit.wikimedia.org/r/889773

Before proceeding, let's list the ORES models that we currently have:

  • goodfaith
  • reverted
  • damaging
  • articletopic + itemtopic (wikidata)
  • drafttopic
  • draftquality
  • articlequality + itemquality (wikidata)

After a chat with Research it seems that they don't really need/care about any of the above streams, and that the following will happen:

Moreover from T328276, it seems that the Search team will need drafttopic.

The idea of the ML team was to add one stream for each of the ORES models, but as far as we can see we may want to reduce their number (it will surely increase maintainability and efficiency). So I propose the following:

  • In T328276 we'll add a new outlink stream.
  • In this task we start with drafttopic, so that the Search team will be able to move away from ORES' revision-score asap.
  • In T326179 we are working on the revert-risk one.

After the above work, we'll be able to decide what streams to add next. In my opinion we could simply add them if consumers have a use case, if not just avoid multiple streams when not needed.

Change 889773 merged by Elukey:

[operations/deployment-charts@master] ml-services: update docker images for outlink and revscoring

https://gerrit.wikimedia.org/r/889773

I know this is more work, and maybe not worth it since we want to eventually deprecate these ORES models (right?), but ...

The mediawiki/revision/score schema's scores field is a map field, meaning the data is actually little bit difficult to query. If you are only now emitting one model score per event, you can do away with the complicated map field. Perhaps it would be worth slightly re-modeling to make a new mediawiki/revision/ores_score schema (or something)? It could be the same as mediawiki/revision/score, except with a single score field.

That is of course...you want to remodel based on entity state change (like page change)...JK! I know not for these older streams.

I know this is more work, and maybe not worth it since we want to eventually deprecate these ORES models (right?), but ...

The mediawiki/revision/score schema's scores field is a map field, meaning the data is actually little bit difficult to query. If you are only now emitting one model score per event, you can do away with the complicated map field. Perhaps it would be worth slightly re-modeling to make a new mediawiki/revision/ores_score schema (or something)? It could be the same as mediawiki/revision/score, except with a single score field.

That is of course...you want to remodel based on entity state change (like page change)...JK! I know not for these older streams.

Makes sense! Does it make sense if we create version 3.x of the revision-score schema instead of another one? It seems to be the best compromise in my opinion - if we'll ever need to make changes to revision-score 2.x (I doubt it) we'll have the 2.x versioning to do it.

Does it make sense if we create version 3.x of the revision-score schema

Yes that makes sense. And since these are new streams anyway, there is no backwards compatibility problems with doing this. This is exactly the reason why we don't do backwards compatibility checks for major versions!

:)

Research it seems that they don't really need/care about any of the above streams

Research might not, but what about the community at large? mediawiki.revision-score is exposed publicly at stream.wikimedia.org. We should probably do a bit announcement with a long deprecation period before we turn off mediawiki.revision-score, eh?

Research it seems that they don't really need/care about any of the above streams

Research might not, but what about the community at large? mediawiki.revision-score is exposed publicly at stream.wikimedia.org. We should probably do a bit announcement with a long deprecation period before we turn off mediawiki.revision-score, eh?

Definitely yes, we are reaching out also to Enterprise since they have some code mentioning it. What I'd like to avoid is to keep around something that is not used only because we think that the community might use it, keeping tech debt and spending less time on more useful and productive streams.

Change 901671 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: add autoscaling settings for enwiki drafttopic

https://gerrit.wikimedia.org/r/901671

Change 901671 merged by Elukey:

[operations/deployment-charts@master] ml-services: add autoscaling settings for enwiki drafttopic

https://gerrit.wikimedia.org/r/901671

Change 886918 merged by Elukey:

[operations/deployment-charts@master] services: add the first lift wing stream to change-prop

https://gerrit.wikimedia.org/r/886918

Change 902123 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: stop changeprop's lift wing test

https://gerrit.wikimedia.org/r/902123

Change 902123 merged by Elukey:

[operations/deployment-charts@master] services: stop changeprop's lift wing test

https://gerrit.wikimedia.org/r/902123

Tried to deploy, and it didn't work as expected. The main reason is:

uri: '{{ $.Values.main_app.changeprop.liftwing.uri }}/v1/models/{{ `{{message.database}}` }}-{{ $model_name }}:predict'

I naively hardcoded "message.database" to select the wiki part of the host header for Lift Wing, but events like page_change doesn't have the field. I need to go back to the changeprop logic and tweak this bit :)

Change 902237 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] changeprop: improve liftwing streams configurability

https://gerrit.wikimedia.org/r/902237

Change 902307 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: add the first lift wing stream in changeprop

https://gerrit.wikimedia.org/r/902307

Change 902237 merged by Elukey:

[operations/deployment-charts@master] changeprop: improve liftwing streams configurability

https://gerrit.wikimedia.org/r/902237

Change 902307 merged by Elukey:

[operations/deployment-charts@master] services: add the first lift wing stream in changeprop

https://gerrit.wikimedia.org/r/902307

Change 902400 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: add option for lift wing config in changeprop staging

https://gerrit.wikimedia.org/r/902400

Change 902406 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: disable lift wing stream on changeprop

https://gerrit.wikimedia.org/r/902406

Tried to enable the stream, changeprop now works as expected but we need to fix drafttopic's support for page_change. The following is a validation error from eventgate:

"message":"'.performer' should have required property 'user_groups', '.performer' should have required property 'user_is_bot'","$schema":"/error/1.0.0","errored_schema_uri":"/mediawiki/revision/score/2.0.0","errored_stream_name":"mediawiki.revision_score_drafttopic"}

Change 902406 merged by Elukey:

[operations/deployment-charts@master] services: disable lift wing stream on changeprop

https://gerrit.wikimedia.org/r/902406

Change 902400 merged by Elukey:

[operations/deployment-charts@master] services: add option for lift wing config in changeprop staging

https://gerrit.wikimedia.org/r/902400

Change 902430 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] events: fix revision_score support for page_change

https://gerrit.wikimedia.org/r/902430

Change 902430 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] events: fix revision_score support for page_change

https://gerrit.wikimedia.org/r/902430

Change 902678 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update docker image for draft topic model servers

https://gerrit.wikimedia.org/r/902678

Change 902678 merged by Elukey:

[operations/deployment-charts@master] ml-services: update docker image for draft topic model servers

https://gerrit.wikimedia.org/r/902678

Change 902684 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update lift wing config for changeprop's staging env

https://gerrit.wikimedia.org/r/902684

Change 902684 merged by Elukey:

[operations/deployment-charts@master] services: update lift wing config for changeprop's staging env

https://gerrit.wikimedia.org/r/902684

Change 902689 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] events.py: prioritize the excp handling of ClientResponseError

https://gerrit.wikimedia.org/r/902689

Change 902689 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] events.py: prioritize the excp handling of ClientResponseError

https://gerrit.wikimedia.org/r/902689

Change 902721 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update docker image for draft topic model servers

https://gerrit.wikimedia.org/r/902721

Change 902721 merged by Elukey:

[operations/deployment-charts@master] ml-services: update docker image for draft topic model servers

https://gerrit.wikimedia.org/r/902721

Change 902723 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update docker image for goodfaith model servers

https://gerrit.wikimedia.org/r/902723

Change 902723 merged by Elukey:

[operations/deployment-charts@master] ml-services: update docker image for goodfaith model servers

https://gerrit.wikimedia.org/r/902723

Finally we have something working, I've just tested ~20k events in changeprop staging, hitting the ml-staging-codfw's goodfaith backend, and I didn't see a validation error from EventGate.

Change 902725 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: enable (again) the first lift wing stream in changeprop

https://gerrit.wikimedia.org/r/902725

Change 902725 merged by Elukey:

[operations/deployment-charts@master] services: enable (again) the first lift wing stream in changeprop

https://gerrit.wikimedia.org/r/902725

Stream deployed, I see traffic!!

https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-drafttopic&var-backend=All&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99

Weird that it is only in ml-serve-codfw, not eqiad, I expected traffic in both DCs (since the page change topics are prefixed with eqiad|codfw and present in both DCs..). Will investigate :)

It is ok to see events only in Lift Wing codfw because, as expected, page_change emits events only in the DC where Mediawiki is accepting edits, my previous comment was a clear Friday PEBKAC :)

Status:

  • draft topic stream working fine
  • no pressure on Lift Wing codfw, we manage it very well.
  • we are still using revscoring 2.x events in Lift Wing, we'll see if we want to upgrade to a different schema later on (it also depends on what the Search team needs etc..).

To complete the pipeline, we'll need to verify how/if events are ingested in hdfs/hive automatically.

Verified with Joseph, the data can be seen in hive -> event database -> mediawiki_revision_score_drafttopic table \o/

Next steps:

  • Verify if we need other revision-score streams, if not we are done :)

This is complete, I don't think that there are more streams to migrate over. The only nit to fix is that the page_change stream is stil in release candidate, but DE will alert us when ready and we'll easily switch the Changeprop's config.