Page MenuHomePhabricator

Use RESTBase for ORES precaching
Open, LowPublic

Description

Currently ORES uses a per-DC redis cluster for preaching the analysis results with no cross-DC replication. That means that the scores are calculated twice, per each DC separately, which is doubling the significant amount of work that ORES is doing.

Since Redis cross-DC replication is not easy, we could use Cassandra to store the ORES results and reuse it's support for cross-DC replication. In that case we'd only ever calculate the scores in one DC relying for Cassandra to automagically make the scores appear in another DC.

The storage spec is pretty straightforward, storing one JSON blob per revision. When the ORES itself changes, all the stored content must be purged. For that we'd use the content-regeneration filter: The content-type emitted by ORES would have the version embedded, like application/json; profile=https://mediawiki.org/wiki/Specs/ORES/v1.1.0 and the RESTBase spec would have a configuration property of the expected content type. When the expected version is higher then the stored version, stored content would be rejected and the never version will be requested from ORES on demand. In that way on ORES changes we wouldn't create massive regeneration jobs - cached versions will be generated on demand. This approach is well tested with other RESTBase content types. One little downside of this is that when the expected version changes a RESTBase deploy is required, but it's a minor inconvenience.

In order to do that we'd need to put ORES behind RESTBase, and there's a bunch of open questions here:

  1. In the API, ORES exposes the context property, while RESTBase API is domain-sentric. We could expose the ORES endpoint under the global domain and continue with the current API or we could mangle the ORES api to better fit in RESTBase api layout and have something like /{domain}/v1/page/scores/... where the domain is representing the context
  2. Currently ORES is normally accessed via the ores.wikimedia.org domain. To avoid latency in proxying requests we could add a varnish-level redirect. Do we want that? Or, instead we could add the ores.wikimedia.org domain to RESTBase and expose ORES APIs completely separately from all the rest of the RESTBase APIs.
  3. Currently ORES version is v3 - do we want to expose all the versions? If so, how would it fit into the fact that currently the global REST API version is v1. How stable are the global ORES API versions? Do we expect to have a v4 soon or is it unlikely at this point?

Event Timeline

Pchelolo created this task.May 23 2017, 5:36 PM
Restricted Application added a project: Scoring-platform-team. · View Herald TranscriptMay 23 2017, 5:36 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Halfak it sounds good to me, what do you think?

We'll still want our own internal caching for batch processing, I think. We get a substantial speedup for requests that follow the pattern /enwiki/?revids=<50 or so rev_ids>&models=damaging|goodfaith as opposed to 50 requests of /enwiki/<single revid>/?models=damaging|goodfaith.

I should note that when one requests 50 revids, if 35 of them are already cached, then scores will intelligently only be generated for the remaining 15. This actually gets down to the model level, so we could have a different number of scores cached for damaging and goodfaith and the system would fill in the blanks. This is often used by researchers doing historical analysis. I just saw a few new analyses at WikiCite 2017.

This seems like a duplicate of T107196: Set up revscoring entry points in RESTBase when it comes to the API (putting ORES behind RESTBase).

For the precaching itself and batch requests, in the short term ORES could contact RESTBase about the scores instead of talking to Redis?

Halfak moved this task from Untriaged to Ideas on the Scoring-platform-team board.Jun 1 2017, 2:33 PM
Halfak triaged this task as Normal priority.
Halfak lowered the priority of this task from Normal to Low.Jul 20 2017, 3:27 PM
Pchelolo closed this task as Resolved.Aug 22 2018, 6:42 PM

Been done a long. time ago

Pchelolo reopened this task as Open.Aug 28 2018, 3:15 PM
Pchelolo edited projects, added Services (later); removed Services (done).

Ouch, it's not actually done. Not sure why did I close it.

awight removed a subscriber: awight.Mar 21 2019, 4:01 PM