My understanding is that ORES will be called for any given revision (via async job) twice to get the content scored and precached (to overcome the slowness of the score).
This drives the usage of Ores up a lot, so we could simply think about enabling the precache for the biggest wikis (like enwiki/wikidata/etc..). There will be some performance penalty for small wikis, but at the same time we'd reduce the overall Ores usage by a lot and hopefully this will prove its overall performances and stability.
Description
Related Objects
Event Timeline
I like the idea, but instead of the evaluation criteria being small wikis, we should look at popular vs. unpopular wikis in terms of API requests.
I just don't want to be in a situation where we remove the precache on a small Wiki but a Wiki whose community relies on the speed precaching gives them.
Very interestingly, the pre-caching stuff is what powers https://stream.wikimedia.org/?doc#/streams/get_v2_stream_mediawiki_revision_score. The scores are sent to kafka and then exposed, so I am not sure if we can turn this off. It is also a good thing to keep in mind for Lift Wing.
Could there be a dedicated precache instance/cluster for ORES that didn't serve regular traffic?
Is the stream being used widely for all wikis? during one of my deployments I realized one of models was getting information from the wrong wiki and no one noticed.
This stream is exposed publicly so may be used by the community for various purposes. It is also ingested into Hive and used by researchers and product analysts, but I don't have an understanding of how much or for what :)
This entire piece of infrastructure needs serious refactoring since there's a lot of duplication. Last discussion of this happened in T201868.
TLDR as far as I understand, there's a set of models enabled in ORES for various wikis, some of the models are pre-cached in ORES redis, populated via change-prop, double-processed in each datacenter, and that's what's powering the revision-score stream. In addition to that, some other subset of ORES scores in stored in MySQL and is populated via job queue. Precached and stored in MySQL subsets of enabled scores are not the same. Change-prop and job queue are racing each other on update, so worst case schenario we calculate scores 3 times for each change ( eqiad precache, codfw precache, eqiad jobqueue).
To finally solve this we would not need a lot of time, but we'd need some dedication from ORES team, PET and analytics all combined.
@Pchelolo the summary is great! I can certainly work from the ML/Analytics side when needed, so let me know if we can start scoping out the problem (maybe another task?).
One addition to what you listed above - the ML team is working on a new pipeline to serve models (called Lift Wing), all k8s based (on Kubeflow). Our main idea was to leverage KNative to avoid loading models not used at all (or just unload the ones accessed infrequently), but this use case of hitting the "score" API for every revision is going to make all the efforts not worth it of course :)
Two main things to work on in my opinion:
- Reduce the number of scores api call from 2/3 to 1 would be great to reduce the traffic to the actual ORES infra.
- Think about what score "streams" we'd like to publish long term is also something to decide, since we may not need to publish scores for all the wikis.
Let me know your thoughts and how I can help!
Reduce the number of scores api call from 2/3 to 1 would be great to reduce the traffic to the actual ORES infra.
We can just stop precaching it. I have never seen any data suggesting that precaching actually speeds things up for end users of the API. I have a gut feeling that replacing pre-caching with regular caching (Varnish/ATS?) would not be detremental in practice, but I have nothing to prove it either.
If we do that, we won't need change-prop job anymore if we move revision-score event creation to a more standard model using a hook and EventBus extension.
more standard model using a hook and EventBus extension.
Meaning, EventBus would make a score request to ORES, and then submit the revision-score event?
No, right now scores are stored in MySQL as well (some subset of enabled models) - when we are storing we will call a hook and emit event. Which scores to include in the event I don't know - right now it's some other subset of enabled models - we can think about that and figure out the logic behind different subsets when we are doing it.