Page MenuHomePhabricator

Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score-<model>
Open, Needs TriagePublic

Description

Problem statement

The Machine Learning team inherited a big workflow to produce the mediawiki.revision-score stream, heavily based on ORES' architecture and capabilities. The current worflow is the following:

  • An edit happens, and a correspondent mediawiki.revision-create event is created.
  • ChangeProp catches the event, and calls the /precache endpoint in ORES with the event as payload.
  • ORES processes the special /precache call, and returns a special response with multiple scores. The scores are related to the models that are configured for a specific wiki, configured in ORES.
  • ChangeProp wraps the response from ORES (containing the scores) into a mediawiki.revision-score event, and sends it to EventGate.

Then, the mediawiki.revision-score events are available from multiple sources:

  • From HDFS, via the {event,event_sanitized}.mediawiki_revision_score Hive table.
  • From Eventstreams.
  • From the {eqiad,codfw}.mediawiki.revision-score topics in Kafka.

ORES is going to be replaced with Lift Wing, a new Kubernetes based approach for serving ML models that we have been working on during the past couple of years. There are some differences between ORES and Lift Wing, but the biggest one is that (for the moment) we are not going to implement a score cache for Lift Wing, so we will not have any need to implement a /precache-like endpoint. In ORES this meant creating (heavily customized) code to be able to call multiple models from the same API endpoint, meanwhile in Lift Wing we followed a different approach: keep it simple and dedicate separate endpoints for every model. The idea is to avoid entanglements and ease the deployment or deprecation of models, trying to impact as few as possible workflows that people may have.

Proposal

Lift Wing is able to generate mediawiki.revision-score events, we have demonstrated the feature in T301878 creating an ad-hoc test stream. The idea is to reduce the above workflow to something more streamlined:

  • ChangeProp (or Flink or Benthos or similar) listens for mediawiki.revision-create events.
  • Following a simple logic, it decides what Lift Wing endpoints to call.
    • Every time and endpoint is called (so every time a model generates a score), an mediawiki.revision-score-<model-name> event is generated and sent to EventGate (directly by Lift Wing)

Ad-hoc hacks like ores_update.js could be removed, simplifying the maintenance of other tools as well.

All the final consumption points (Kafka, HDFS/Hive, Eventstreams) would be available, but of course generating different datasources (one for each model type). This would allow us to add/remove streams more easily, and it would allow people to selectively get smoother sources of data focused on specific models.
Drawback: people would have more endpoints/datasources to check for a specific revision, but so far it seems that this shouldn't be a concern (we are not sure yet, this is why I opened the task :).

Ongoing/Related work

We are aware of T308017 and we are following it closely, but at the same time we'd like to establish a timeline to deprecate ORES and the revision-score stream is the bigger user (at the moment) of it.

What we are seeking

Comments and use cases about the usages of the revision-score stream, and if the above proposal could impact them in a bad way.

Event Timeline

elukey updated the task description. (Show Details)
elukey updated the task description. (Show Details)

To me, this approach makes a lot of sense. Anyone that needs something like mediawiki.revision-score-combined can just join all the relevant mediawiki.revision-score-<model-name> streams. One use case like that came to mind, WME, so I'm cc-ing @RBrounley_WMF.

The only possible problem would be if we need that -combined stream to happen as quickly as possible in some cases. If that's important enough, we can always generate it in step 1 of your proposal above (with ChangeProp (or Flink or Benthos or similar)). So I think this is future-proof and makes a lot of sense. +1.

+1 for me as well - I see only benefits in this move except for the changes of already existing consumers.

Tagging @Tgr as someone who might know more about Growth's consumption of the articletopic scores as described on Wikitech, which is a major Product consumer of this stream.

My general input: separate streams makes much more sense to me (it's much more transparent / scalable even if potentially more complicated to query). Thanks for doing this work!

Maybe out of scope, but I'm also aware of the ores_model and ores_classification tables on MariaDB (documentation). I assume this won't change because it seems to be about ORES scores show up in RecentChanges as opposed to this streaming pipeline, but curious if those are expected to be continued to be supported or deprecated? It's very nice having public history of model scores but it's also a second set of data that needs maintenance and frankly in the past I have found them very opaque as far as what is contained within them and how they are structured.

Maybe out of scope, but I'm also aware of the ores_model and ores_classification tables on MariaDB (documentation). I assume this won't change because it seems to be about ORES scores show up in RecentChanges as opposed to this streaming pipeline, but curious if those are expected to be continued to be supported or deprecated? It's very nice having public history of model scores but it's also a second set of data that needs maintenance and frankly in the past I have found them very opaque as far as what is contained within them and how they are structured.

Hi Isaac! Thanks for your inputs :) The ORES extension use case is being discussed in T312518, in theory we could move the extension to the new Lift Wing endpoint but we are wondering who is using the feature and if it is worth to support it (or maybe change its format etc..). A lot of traffic going through ORES is duplicate between the revision-score stream and the ORES extension, reducing the overlapping would be great but it may take time. If you want to follow up in T312518 please do, the more feedback the better!

Tagging @Tgr as someone who might know more about Growth's consumption of the articletopic scores as described on Wikitech, which is a major Product consumer of this stream.

Happy to provide details on that if needed, but AIUI this wouldn't change anything for users of the score information, it would just make the process of getting the information to where it needs to be more modular. In our case the place where it needs to be is the ElasticSearch index, so Search team members (who are already subscribed to this task) might be able to provide feedback on whether/how this change would affect them. We wouldn't be affected and I don't know much about the current pipeline, so I don't have an opinion either way (other than the issue below).

Maybe out of scope, but I'm also aware of the ores_model and ores_classification tables on MariaDB (documentation).

Good point. These are needed for SQL joins for change lists (recent changes, user contributions etc). Granted we probably shouldn't use SQL joins to construct those lists (see e.g. T307328: Scalability issues of recentchanges table), but that's a whole other conversation and a big effort to fix. So for now keeping those DB tables functional is important.

IIRC currently they are created on demand, by MediaWiki querying the ORES API and storing results. That workflow is reliant on the ORES API being cached (as otherwise it would be very slow). Maybe it would make sense to flip it and push the information to MediaWiki as edits happen, instead of having it pull them.

A lot of traffic going through ORES is duplicate between the revision-score stream and the ORES extension, reducing the overlapping would be great but it may take time. If you want to follow up in T312518 please do, the more feedback the better!

Makes sense. And added my comment - thanks @elukey !

This convo (also discussed in T301878: Send score to eventgate when requested) is very relevant to T307959: [Shared Event Platform] Design and Implement POC Flink Service to Combine Existing Streams, Enrich and Output to New Topic, in that we'd like to standardize a a process (or at least stream/topic layout) for consuming streams, enriching them with extra data (like ORES scores), and producing new streams. One particularly relevant bit will be the stream/topic layout as suggested by @JAllemandou here. In that case, we are talking about page change streams with wiki content in them, which will make the events quite large. In that case, will likely not want to have all wiki projects in the same topics, as it would mean that someone wanting to consume changes for e.g. wiktionary will also consume changes for wikidata. For non-content streams, this isn't a big deal, as the events are small enough, and filtering out unwanted events is fine. Anyway, more convos to have.

@Ottomata noted, will review the previous conversations! And if you need me in future ones please let me know :)

One thing that I didn't get - do you think that this task's proposal can be implemented, or is it clashing with the aforementioned work so better to have more discussions? I am asking since I am trying to come up with a plan to deprecate ORES in favor of Lift Wing, and the revision-score use case is the biggest one to manage. Lemme know!

do you think that this task's proposal can be implemented

Ya absolutely! It can be done now with the current revision-score data model, or, if you are okay with waiting until we figure out page change schema, as discussed in https://phabricator.wikimedia.org/T301878#8008932, we can work together to make a nice unified data model for changes to pages, with a standardized way to add enriched info, like scores.

But also, I don't want to block, so proceed with the current data model too!

do you think that this task's proposal can be implemented

Ya absolutely! It can be done now with the current revision-score data model, or, if you are okay with waiting until we figure out page change schema, as discussed in https://phabricator.wikimedia.org/T301878#8008932, we can work together to make a nice unified data model for changes to pages, with a standardized way to add enriched info, like scores.

But also, I don't want to block, so proceed with the current data model too!

I am absolutely fine to wait for the new page change schema, but I'd ask more info about the timeline of its rollout. Our goal is to deprecate ORES asap during the next months, so we'd prefer to avoid waiting too much (OOW hardware, old models, git-lfs, etc..). Lemme know :)

We put it in our current sprint to get a WIP 'test topic' version of it deployed, at least in beta, but maybe testwiki too. I've been traveling a lot recently and not had time to work on it. But I hope to really focus on it in the next month or two...so perhaps that is the timeline?

I'd hope to feel good about you basing a WIP revision score model on it by Oct 17.

We put it in our current sprint to get a WIP 'test topic' version of it deployed, at least in beta, but maybe testwiki too. I've been traveling a lot recently and not had time to work on it. But I hope to really focus on it in the next month or two...so perhaps that is the timeline?

I'd hope to feel good about you basing a WIP revision score model on it by Oct 17.

Perfect! The ideal sunset time for ORES is during the next 6 months, so I think we can definitely work together on the new stream. Thanks!

FYI, we have deployed a rc0.mediawiki.page_change stream to group0 wikis! Example event here. It has the development/mediawiki/page/change schema. We put this in development/ as we wanted to indicate that it is still a WIP and subject to change.

Anyway, to extend this schema scores, check out the development/mediawiki/page/change/current.yaml file. It $refs the /development/fragment/mediawiki/state/change/page/1.0.0 schema, and then adds any extra information it needs, like the content_slots field.

You could do the same for new /development/mediawiki/page_score_change/current.yaml schema (name to be bikeshed :) ) like:

allOf:
  - $ref: /fragment/common/2.0.0#
  - $ref: /development/fragment/mediawiki/state/change/page/1.0.0
properties:
  revision:
    type: object
    properties:
      # your scores model here, whatever that might be
      scores:
        type: object
        properties:
          model_name: 
            type: string
          score:
            type: number
          prediction:
            type: string

If you really wanted to keep these based on revision entities instead of page entities, you can $ref the revision entity model, and/or any of the other mediawiki entity models we made. You could also put your scores model into its own reusable fragment schema, similar to how this content model has its own fragment schema.

@Ottomata recognizing that this might be long past the time when you'd want this feedback but a question about an additional field:

Similar to is_redirect, we often use whether an article is a disambiguation / list page as a determination for how to handle with ML models -- e.g., it's not intended behavior to run many models like add-a-link or the topic model on disambiguation / list pages. While I don't think list article is easy to determine without making a call to Wikidata (I assume that's out of the question), disambiguation pages are tracked by Mediawiki -- e.g., https://en.wikipedia.org/w/api.php?action=query&titles=Albert&prop=pageprops&format=json&ppprop=disambiguation.

What would be the process to consider whether this could be included as part of the page info in the event?