Page MenuHomePhabricator

[Discuss] Future ORES architecture
Open, LowPublic

Description

Some tasks have contained discussion about changes to ORES architecture. ORES is currently a collection of web nodes communicating with celery workers via celery. This architecture lets us handle several different use-cases with predictability.

  • Realtime scoring of edits/revisions/pages as they are created/saved. (patrollers/bots)
  • Historical scoring edits long after they are saved. (research/patroller/organizer)
  • Batch processing of large amounts of edits/revisions/pages. (research/analytics)

Some issues have been raised about ORES current architecture:

  • Redis SPOF: Redis is a SPOF
  • MWAPI IO: IO via MWAPI calls takes a non-negligible amount of time
  • Nonstandard API: ores.wikimedia.org is an independent API endpoint

Some proposals have been raised for improving the functionality of ORES.

  • Feature store: Feature stores are becoming common in modern ML services/systems. We should invest in one.
  • Stream architecture: We could decouple IO operations and CPU operations using Kafka or some other streaming architecture to move away from our worker pool/result store strategy.

Event Timeline

Halfak created this task.Jun 20 2019, 3:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 20 2019, 3:00 PM

We could address quite a lot of MWAPI IO by having ChangeProp package data with the request to ORES. We already have the functionality in place to take in such data during a request to ORES. E.g. we might have ChangeProp send the text of the current revision and of the parent revision. This would be a lot of data to send, but we'd be fetching it from the MWAPI anyway. Having ChangeProp send it to us would save the time of a API lookup within ORES. But it might be ChangeProp would have to do the lookup anyway. Do the events that ChangeProp receives contain text data or would it somehow have more direct access to such data than ORES' web nodes?

From T166161:

But FWIW, we also use redis for managing our queue of celery workers. So even if we were able to drop the use of redis as a score cache, we'd still have redis as a SPOF for ORES until we can either (1) transition away from redis for managing our workers, (2) implement a cluster-based redis strategy, or (3) simplify ORES away from handling batch processing requests and thus not need celery. (1) is a blocker because Ops would rather us use redis vs. RabbitMQ. (2) is currently under discussion. (3) would result in a severe performance hit -- unless we manage to decouple feature extraction and thus reduce the cost/offload the complexity of IO.

If we could somehow make it so that batch IO was less of a performance boost, then it would very likely be reasonable to just drop our worker queue entirely and have the web workers do the scoring. This would dramatically simplify ORES architecture.

Halfak added a subscriber: ACraze.Jul 23 2019, 7:25 PM

@ACraze, this task should be interesting. I want to talk to you about some of ORES limitations at some point and we can work together on a roadmap that makes sense.

Halfak claimed this task.Jul 23 2019, 7:25 PM
Halfak triaged this task as Low priority.
Halfak moved this task from Untriaged to Epic on the Scoring-platform-team board.

@Ottomata & @akosiaris, @ACraze and I had a conversation about future ORES architecture that brought us toward Kafka and Faust. We also discussed KubeFlow as a more powerful strategy for re-thinking our ML infra. Would you two be willing have a quick chat with us to help us get a sense for what investing in these types of technologies would mean for your work and how you envision the future of tech at Wikimedia?

Ottomata added subscribers: JAllemandou, Nuria.EditedAug 6 2019, 3:29 PM

For sure! Faust, cool! Hadn't heard of that. Looks a bit like Kafka Streams, but for Python.

Would be happy to talk. I know @Nuria has some thoughts about future ML pipeline stuff. This discussion clearly overlaps with the Stream Processing component of Modern Event Platform, which has implications for things like Daniel Kinzler's dependency tracking project.

Sooo, yeah this is a biggy.

More ML Pipeline techs to research:

No decision has been made about what general purpose stream processing framework we want to use (for stuff other than ML too), but @JAllemandou and I are partial to Flink. Flink can run the same (or mostly the same) programs in batch or streaming mode, in Scala or Python, and has a streaming SQL component, which would make debugging and building streaming applications really pleasant.

I don't actually have much practical experience with stream processing frameworks, and even less with ML and ML pipeline frameworks, but for both general purpose stream processing and ML pipeline stuff, integration with Kubernetes does seem pretty key.

I'll be gone August 8 - 25th; and I know Nuria is out too. Perhaps we can set up a time to talk in Sept.? Joseph will be back then too.

Hey, just saw this. I am around now. I have some minor ML pipeline experience so I am not sure of how much help I would end up being, but I wouldn't mind discussing.

elukey added a subscriber: elukey.Sep 12 2019, 6:52 AM