Problem statement
The Machine Learning team inherited a big workflow to produce the mediawiki.revision-score stream, heavily based on ORES' architecture and capabilities. The current worflow is the following:
- An edit happens, and a correspondent mediawiki.revision-create event is created.
- ChangeProp catches the event, and calls the /precache endpoint in ORES with the event as payload.
- ORES processes the special /precache call, and returns a special response with multiple scores. The scores are related to the models that are configured for a specific wiki, configured in ORES.
- ChangeProp wraps the response from ORES (containing the scores) into a mediawiki.revision-score event, and sends it to EventGate.
Then, the mediawiki.revision-score events are available from multiple sources:
- From HDFS, via the {event,event_sanitized}.mediawiki_revision_score Hive table.
- From Eventstreams.
- From the {eqiad,codfw}.mediawiki.revision-score topics in Kafka.
ORES is going to be replaced with Lift Wing, a new Kubernetes based approach for serving ML models that we have been working on during the past couple of years. There are some differences between ORES and Lift Wing, but the biggest one is that (for the moment) we are not going to implement a score cache for Lift Wing, so we will not have any need to implement a /precache-like endpoint. In ORES this meant creating (heavily customized) code to be able to call multiple models from the same API endpoint, meanwhile in Lift Wing we followed a different approach: keep it simple and dedicate separate endpoints for every model. The idea is to avoid entanglements and ease the deployment or deprecation of models, trying to impact as few as possible workflows that people may have.
Proposal
Lift Wing is able to generate mediawiki.revision-score events, we have demonstrated the feature in T301878 creating an ad-hoc test stream. The idea is to reduce the above workflow to something more streamlined:
- ChangeProp (or Flink or Benthos or similar) listens for mediawiki.revision-create events.
- Following a simple logic, it decides what Lift Wing endpoints to call.
- Every time and endpoint is called (so every time a model generates a score), an mediawiki.revision-score-<model-name> event is generated and sent to EventGate (directly by Lift Wing)
Ad-hoc hacks like ores_update.js could be removed, simplifying the maintenance of other tools as well.
All the final consumption points (Kafka, HDFS/Hive, Eventstreams) would be available, but of course generating different datasources (one for each model type). This would allow us to add/remove streams more easily, and it would allow people to selectively get smoother sources of data focused on specific models.
Drawback: people would have more endpoints/datasources to check for a specific revision, but so far it seems that this shouldn't be a concern (we are not sure yet, this is why I opened the task :).
Ongoing/Related work
We are aware of T308017 and we are following it closely, but at the same time we'd like to establish a timeline to deprecate ORES and the revision-score stream is the bigger user (at the moment) of it.
What we are seeking
Comments and use cases about the usages of the revision-score stream, and if the above proposal could impact them in a bad way.