Page MenuHomePhabricator

Use RESTBase for ORES precaching
Open, LowPublic

Description

Currently ORES uses a per-DC redis cluster for preaching the analysis results with no cross-DC replication. That means that the scores are calculated twice, per each DC separately, which is doubling the significant amount of work that ORES is doing.

Since Redis cross-DC replication is not easy, we could use Cassandra to store the ORES results and reuse it's support for cross-DC replication. In that case we'd only ever calculate the scores in one DC relying for Cassandra to automagically make the scores appear in another DC.

The storage spec is pretty straightforward, storing one JSON blob per revision. When the ORES itself changes, all the stored content must be purged. For that we'd use the content-regeneration filter: The content-type emitted by ORES would have the version embedded, like application/json; profile=https://mediawiki.org/wiki/Specs/ORES/v1.1.0 and the RESTBase spec would have a configuration property of the expected content type. When the expected version is higher then the stored version, stored content would be rejected and the never version will be requested from ORES on demand. In that way on ORES changes we wouldn't create massive regeneration jobs - cached versions will be generated on demand. This approach is well tested with other RESTBase content types. One little downside of this is that when the expected version changes a RESTBase deploy is required, but it's a minor inconvenience.

In order to do that we'd need to put ORES behind RESTBase, and there's a bunch of open questions here:

  1. In the API, ORES exposes the context property, while RESTBase API is domain-sentric. We could expose the ORES endpoint under the global domain and continue with the current API or we could mangle the ORES api to better fit in RESTBase api layout and have something like /{domain}/v1/page/scores/... where the domain is representing the context
  2. Currently ORES is normally accessed via the ores.wikimedia.org domain. To avoid latency in proxying requests we could add a varnish-level redirect. Do we want that? Or, instead we could add the ores.wikimedia.org domain to RESTBase and expose ORES APIs completely separately from all the rest of the RESTBase APIs.
  3. Currently ORES version is v3 - do we want to expose all the versions? If so, how would it fit into the fact that currently the global REST API version is v1. How stable are the global ORES API versions? Do we expect to have a v4 soon or is it unlikely at this point?

Event Timeline

Pchelolo created this task.May 23 2017, 5:36 PM
Restricted Application added a project: Scoring-platform-team. · View Herald TranscriptMay 23 2017, 5:36 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Halfak it sounds good to me, what do you think?

We'll still want our own internal caching for batch processing, I think. We get a substantial speedup for requests that follow the pattern /enwiki/?revids=<50 or so rev_ids>&models=damaging|goodfaith as opposed to 50 requests of /enwiki/<single revid>/?models=damaging|goodfaith.

I should note that when one requests 50 revids, if 35 of them are already cached, then scores will intelligently only be generated for the remaining 15. This actually gets down to the model level, so we could have a different number of scores cached for damaging and goodfaith and the system would fill in the blanks. This is often used by researchers doing historical analysis. I just saw a few new analyses at WikiCite 2017.

This seems like a duplicate of T107196: Set up revscoring entry points in RESTBase when it comes to the API (putting ORES behind RESTBase).

For the precaching itself and batch requests, in the short term ORES could contact RESTBase about the scores instead of talking to Redis?

Halfak triaged this task as Normal priority.Jun 1 2017, 2:33 PM
Halfak moved this task from Untriaged to Ideas on the Scoring-platform-team board.
Halfak lowered the priority of this task from Normal to Low.Jul 20 2017, 3:27 PM
Pchelolo closed this task as Resolved.Aug 22 2018, 6:42 PM

Been done a long. time ago

Pchelolo reopened this task as Open.Aug 28 2018, 3:15 PM
Pchelolo edited projects, added Services (later); removed Services (done).

Ouch, it's not actually done. Not sure why did I close it.

awight removed a subscriber: awight.Mar 21 2019, 4:01 PM

This will reduce the load on the ORES cluster significantly, by relieving of httpd duty when serving cached results—and more results will be cached since Cassandra is disk-backed. Further from the action, it's also nice to eliminate the Redis machines and their upkeep.

Given the high proportion of effort that goes into calculating ORES scores for recent changes, I'm not sure whether this RESTBase ticket will initially accomplish anything more than swapping out the cache backend—but this change has potential to completely decouple ORES from the front-end, and allow for architectural experiments like stream processing, or decoupling prediction from feature extraction. RESTBase can be fed from multiple pipelines, e.g. one system doing backfills and another following recent changes, it will allow the service to evolve.

[From the task description]

In order to do that we'd need to put ORES behind RESTBase, and there's a bunch of open questions here:

  1. In the API, ORES exposes the context property, while RESTBase API is domain-centric. We could expose the ORES endpoint under the global domain and continue with the current API or we could mangle the ORES api to better fit in RESTBase api layout and have something like /{domain}/v1/page/scores/... where the domain is representing the context

My +1 would be for domain-centric. This would be a good opportunity to bump the API version, to simplify the client changes.

ORES data is requested, generated, and refined by each wiki's community, so it's conceptually right that they would "host" the ORES models.

2 [.... ]Or, instead we could add the ores.wikimedia.org domain to RESTBase and expose ORES APIs completely separately from all the rest of the RESTBase APIs.

Now that's an interesting idea. It would make the migration extra cushy. FWIW, this is still compatible with the domain-centric suggestion above, we can expose old API versions on the ores.wikimedia.org domain and the new version on the project's API.

  1. Currently ORES version is v3 - do we want to expose all the versions? If so, how would it fit into the fact that currently the global REST API version is v1. How stable are the global ORES API versions? Do we expect to have a v4 soon or is it unlikely at this point?

I think we do want to expose all versions, but will defer to @Halfak. Whatever happens, let's tag metrics with API version and monitor on the ORES dashboard.

This seems like a duplicate of T107196: Set up revscoring entry points in RESTBase when it comes to the API (putting ORES behind RESTBase).

We could delete one or the other, but I'd like to say that the other task is very low-level, I think this one approaches it at the right level for being the initial stages of such a project. We need an "epic" at this point.

For the precaching itself and batch requests, in the short term ORES could contact RESTBase about the scores instead of talking to Redis?

Oh, a circular dependency ;-) Agreed that it's just a short-term workaround, but a good suggestion.

In the long-term, should we evaluate whether this is a functionality to generalize to all of RESTBase? Put another way, if some of the scores in a bulk request are already cached, there's no reason for RESTBase to have called ORES in the first place.

A plan for migration feels a bit like the chicken-or-egg dilemma, but probably safe to do it like 1) ores keeps its own redis, but lives behind restbase. We're keeping double the cache. 2) migrate ores to restbase, can be safely rolled back. Congratulations, we avoided ever having a cold cache! 3) eliminate redis.

We certainly aren't struggling with serving cached results, so I don't think doing this will result in a "significant" decrease in load to the ORES cluster, but I agree that it would reduce some load and having scores stored indefinitely would be desirable -- especially if they could be stored with the model version.

In general, I think that any proposal to move towards advancing ORES architecture (e.g. towards a streaming strategy or decoupling feature extraction) should be wholistic, but of course steps can/should be incremental. This probably deserves its own task. As it's not something we're really considering right now (pending bringing @accraze up to speed), so I'm not sure that's a productive direction to take this discussion.

But FWIW, we also use redis for managing our queue of celery workers. So even if we were able to drop the use of redis as a score cache, we'd still have redis as a SPOF for ORES until we can either (1) transition away from redis for managing our workers, (2) implement a cluster-based redis strategy, or (3) simplify ORES away from handling batch processing requests and thus not need celery. (1) is a blocker because Ops would rather us use redis vs. RabbitMQ. (2) is currently under discussion. (3) would result in a severe performance hit -- unless we manage to decouple feature extraction and thus reduce the cost/offload the complexity of IO.

So that said, I think the first step here would be to identify where ORES scores *should* be stored in RESTBase. It seems that along with page content would make sense. RESTBase could also serve scores for individual revisions, but would not be suitable for any batch scoring jobs (AFAICT). So it seems to me that RESTBase can work as a client-cache of ORES without turning over any scary rocks. But that some core ORES functionality should probably remain at ores.wikimedia.org until it makes sense to start to completely overhaul how ORES works.

Given that ORES works pretty darn well for its core use-cases and how the team maintaining it is minimally staffed, it's hard to prioritize turning those scary rocks over right now, but a parallel discussion would be very valuable, so I've started T226193: [Discuss] Future ORES architecture to explore some of these ideas.

Joe added a subscriber: Joe.EditedThu, Jun 20, 3:38 PM

Hi!

ORES is one of the few tools managing its own cache, which is actually a good thing.

SRE has been asking for a long time not to use restbase both as an API proxy and as a storage at the same time. There is ongoing work in splitting RESTbase in an api routing and a storage components, so we should divide the ticket in two issues:

  1. Wiring ORES in our REST api (which I think might be desirable and should require no work on the ORES side), whitout restbase-level caching. Caching of fully-rendered results will happen at the edge (varnish), and not internally.
  2. Substituting Redis with $service_in_front_of_cassandra as a cache storage system, but that would need some justification, as it would mean adding latency and cost to the caching system.