Page MenuHomePhabricator

Create generalized "precache" endpoint for ORES
Closed, ResolvedPublic

Description

Problem

There's a configuration block in the change propagation rules ( https://github.com/wikimedia/mediawiki-services-change-propagation-deploy/blob/master/scap/templates/config.yaml.j2#L327 ) dedicated to exhaustively listing supported (wiki, model) pairs.

Presumably, that matrix is already duplicated in several places. It'll be easy to reuse that if we write the change propagation filter as a lightweight glue module

Also: the precache GET parameter seems to have a side-effect, which is not very RESTful. We should replace that with a PUT? It doesn't seem like the endpoint needs to receive a list of models, either--the notification is of a new revision, and the choice of what actions to take is up to the backend. (Which is always, evaluate all enabled models on each new revision.)

Solution

Implement /precache endpoint in ORES that will be able to receive a request for all changes and will internally work out that (1) it has contexts relevant to the change and (2) which models should be precached for it.

Event Timeline

awight renamed this task from Replace ORES precaching change propagation configuration as a code module to Rewrite ORES precaching change propagation configuration as a code module.Oct 20 2016, 9:42 AM
mobrovac subscribed.

Presumably, that matrix is already duplicated in several places. It'll be easy to reuse that if we write the change propagation filter as a lightweight glue module

There is a PR to move it to code. But I fail to see how that would eliminate the list? ChangeProp still needs to know when to send a request, and how to formulate that request (wiki, model).

Also: the precache GET parameter seems to have a side-effect, which is not very RESTful. We should replace that with a PUT? It doesn't seem like the endpoint needs to receive a list of models, either--the notification is of a new revision, and the choice of what actions to take is up to the backend. (Which is always, evaluate all enabled models on each new revision.)

This looks like an ORES-only problem and not at all related to the above, so should go into its own task, IMHO.

I think it is important that the primary config for precaching (through the ORES module or Change Propagation) live in the ORES config repos:

I think it is important that the primary config for precaching (through the ORES module or Change Propagation) live in the ORES config repos:

Right now we don't support picking up configs/modules from other locations, and it's not quite clear how to achieve that.

  1. Previously we've discussed a model with dynamic subscriptions, where the services (ORES in this case) makes an HTTP request on startup providing ChangeProp the config for when and how should it be notified. Change-Prop then starts executing the dynamic rule. However, this approach has a whole bunch of issues: it's not clear how to achieve changing/removing the subscription and also, ChangeProp could be really dangerous if misconfigured, so having clear control on what's being executed is important.
  1. Next option - ORES providing a REST endpoint giving out this information - in this case the exact precaching config would be controlled by ORES, but the code executing it will live in change-prop. The code executing the updates would still live in the CP repo, but it can be made quite independent from ORES itself.
  1. Finally, the last option I see is to declare a standard interface for CP updater modules and fetch them as git submodules, or publish them on NPM.

Personally I like option 2 better then others and think it worths exploring. However, it's also important to keep in mind that the ORES module needs to be used in the ReviewStreams project, so if we gonna proceed with option 2, the interface should be generic enough to support that.

Thanks for taking a look at this!

Options (1) and (2) seem problematic to me, wouldn't it be nicer if Change-Prop behavior were static so that its configuration is predictable by anyone reading the code? Especially worried about "really dangerous if misconfigured", I don't even know how we would fail closed in either of these scenarios. It would be a maintenance pain to catch up with missed revisions in case the configuration isn't loaded, and obviously a bad thing to have extra data sprayed all over the place.

On that last note however--what about sending *all* revision-create events to ORES? Is that a huge waste of resources, or breaks the division of responsibilities you're hoping for? At that point, though, we're just one step from implementing a custom Kafka consumer...

I was personally imagining something between options (2) and (3), where we have a custom CP module which reads from config that's deployed to the CP server statically, as a git submodule or elsewhere on the filesystem.

Thanks for taking a look at this!

Options (1) and (2) seem problematic to me, wouldn't it be nicer if Change-Prop behavior were static so that its configuration is predictable by anyone reading the code? Especially worried about "really dangerous if misconfigured", I don't even know how we would fail closed in either of these scenarios. It would be a maintenance pain to catch up with missed revisions in case the configuration isn't loaded, and obviously a bad thing to have extra data sprayed all over the place.

That's exactly why we decided to avoid these approaches, although something like that was planned initially.

On that last note however--what about sending *all* revision-create events to ORES? Is that a huge waste of resources, or breaks the division of responsibilities you're hoping for?

Right now ~ 1/3 of revision-create events result in an ORES update, however given that the rate is pretty low (only about 20-30 events per second) and that ORES is expanding support to other languages/projects and the gap is shrinking, I think it's totally fine. However, right now we're specifying the models to precompute. Looks like ORES accepts requests without explicit 'models' specification like https://ores.wikimedia.org/v2/scores/cswiki/?revids=12345&precache=true, so I think this is totally doable. If ORES is not supported for the project, 404 is returned and ChangeProp will just ignore it. I think it's a good idea and I can implement it.

@Ladsgroup @mobrovac What's your opinion?

We need to limit requests in some ways. E.g. there's no good reason to precache scores for bot edits. Removing those cuts down our load quite dramatically. If we were to check that in ORES, that would slow down request processing quite dramatically.

I should note as well that other models will have new precaching dynamics. E.g. when we get the Detox models implemented, we'll want to limit precaching of those edits to talk namespaces (1, 3, 5, etc.)

We need to limit requests in some ways. E.g. there's no good reason to precache scores for bot edits. Removing those cuts down our load quite dramatically. If we were to check that in ORES, that would slow down request processing quite dramatically.

We can still filter out the bot edits. Just right now we send a request to ORES for a non-bot edit only when we know that ORES supports this domain(project) which ends up in this giant config that doesn't belong in our repo. What's proposed is to notify about all non-bot edits for all projects, and ORES would choose models it needs to precache based on it's own config.

I should note as well that other models will have new precaching dynamics. E.g. when we get the Detox models implemented, we'll want to limit precaching of those edits to talk namespaces (1, 3, 5, etc.)

We could add a namespace query parameter?

Indeed. So, if we're going to have ORES figure out which requests to honor based on a set of parameters, I'm not sure why we don't just use the precached utility that we implemented to solve this problem before Change Propagation was available. It certainly seems like a simpler solution.

$ ores precached -h
Runs a pre-caching server against an ORES instance by listening to an RCStream
and submitting requests for scores to the web interface for each revision as it
happens.

If ran by systemd, it uses watchdog to stay alive.

:Usage:
    precached -h | --help
    precached <stream-url> <ores-url> [--config=<path>] [--delay=<secs>]
                                      [--verbose]

:Options:
    -h --help        Prints this documentation
    <stream-url>     URL of an RCStream
    <ores-url>       URL of base ORES webserver
    --delay=<secs>   The delay between when an event is received and when the
                     request is sent to ORES. [default: 0]
    --config=<path>  The path to a yaml config file
                     [default: config]
    --verbose        Print debugging information

This utility reads from ORES' config (the same one that runs the system) which contains details about what should be precached.

@Halfak As I remember from out conversation with @Ladsgroup on wikimania, there were several issues with it which included instability (it crashed), lack of retry support, lack of support for rate limiting of updates etc. In general CP is more reliable as it's backed by Kafka and used for many other use-cases. Also, the results obtained by ChangeProp from ORES will not be thrown away as soon as the ReviewStreams project kick in - it will be used to create a modified revision-create stream with ores data included.

Doesn't ORES already decide which requests to honour? If I make a precache request for an unsupported project it returns 404 - isn't that all we want?

ORES will respond to any request a user sends. You can literally score any edit ever with any model. You're right that we'll 404 for a model/wiki that doesn't exist, but as I mentioned earlier, that we don't want to precache every revision for every model. If we're going to be filtering revisions ourselves, then that logic is captured well in the precached utility (which no longer suffers from caching and retry issue are not a big concern to us).

Still, I don't mean to suggest that we'd do this, so much as to show how we're talking about doing a lot of logic on ORES' side -- enough that I'm starting to wonder why we don't just listen to the stream ourselves.

Still, I don't mean to suggest that we'd do this, so much as to show how we're talking about doing a lot of logic on ORES' side -- enough that I'm starting to wonder why we don't just listen to the stream ourselves.

Then we'd need to think of a better solution here :)

If I understand this task correctly, the ultimate goal here is to design a method when service owners would also own the rules for updating the service while still preserving all the advantages of having a static config that we have right now. This question is not specific to ores, we need a mechanism for that for other services too.

listen to the stream ourselves

Yeah, @Pchelolo if you don't mind, would you be able to position the option of (4) writing our own Kafka consumer for the revision-create topic in proper context?

I'm totally on-board with the ideal you just described, though--anything we do here should be easy, and should also help make future service integrations easier.

Writing your own Kafka consumer with proper multi-DC handling, retry, rate limiting, metrics etc is fairly complex, compared to letting changeprop handle this. Especially considering that changeprop will need the scores soon anyway, as @Pchelolo mentioned.

You'll need some logic to determine which models to support for any given project somewhere. Placing this logic in ORES itself doesn't seem any more complex than any other place, and has the benefit of also being usable for random "give me the default models for project X" requests.

From IRC crash-scroll:

<halfak> awight, how come you didn't like having an API endpoint that provides config info to Change Propagation?

My biggest concern is that we would lose repeatability and visibility. A bug in the config endpoint could cause infinite badness or even crash CP itself. And we wouldn't have insight into where the badness came from, without some very verbose logging.

Code-generating the CP config with a compile-time script in the ORES repo is a decent workaround for the moment, that would at least DRY things up. The manual finagling involved is unfortunate, but light years ahead of synchronizing the config files by eyeball.

Writing your own Kafka consumer with proper multi-DC handling, retry, rate limiting, metrics etc is fairly complex, compared to letting changeprop handle this. Especially considering that changeprop will need the scores soon anyway, as @Pchelolo mentioned.

Thanks for the explanation, I'm happy to avoid the Kafkaesque complexity.

I'm uncomfortable with planning anything to accomodate ReviewStreams as I understand it, it's unclear how we'll be converting "score calculated" back into a usable stream, I don't know whether we can make any assumption that these scores should be handled by CP code at all.

OK. I was thinking about this more last night. We can have a precaching endpoint that takes all revision-create events from ChangeProp as a raw JSON object. This endpoint would have logic baked in for how a revision create event would be handled by ORES.

E.g. send something like {"wiki": "enwiki", "revision": {"new": 123456}, "bot": ""} as data to https://ores.wikimedia.org/v2/scores/precache/ and ORES will internally work out that (1) it has models for 'enwiki' and (2) which models should be precached for this bot edit.

With this, we'd have the flexibility of doing whatever we could do with RCStream in the precached utility and we'll be using ChangeProp for

multi-DC handling, retry, rate limiting, metrics

Hmmm, ideas....not really. A couple of comments maybe:

Writing your own Kafka consumer with proper multi-DC handling, retry, rate limiting, metrics etc is fairly complex, compared to letting changeprop handle this.

I'd disagree with this. I don't think the Kafka consumer part is complex at all. Retries, rate limiting, and metrics (assuming these all refer to the HTTP request) could be more complicated, and change-prop is built to deal with this.

Looks like you want to deploy and run scalable and reliable code (not just configuration) that consumes a stream and does X, which sounds like stream processing to me! :D We'd like to build a stream processing system in production sometime soon, but we don't have this yet.

Halfak triaged this task as Medium priority.Feb 16 2017, 3:58 PM
Halfak moved this task from Unsorted to New development on the Machine-Learning-Team board.

E.g. send something like {"wiki": "enwiki", "revision": {"new": 123456}, "bot": ""} as data to https://ores.wikimedia.org/v2/scores/precache/ and ORES will internally work out that (1) it has models for 'enwiki' and (2) which models should be precached for this bot edit.

That actually sounds like a good idea to me. Is this endpoint life already? Do you have any docs for it?

That endpoint does not exist. I'd been waiting on feedback about it being a good/terrible idea. :) I'll get it prioritized. Thanks @Pchelolo!

Halfak renamed this task from Rewrite ORES precaching change propagation configuration as a code module to Create generalized "precache" endpoint for ORES.Feb 17 2017, 5:39 PM
Halfak claimed this task.
Halfak updated the task description. (Show Details)

@Halfak One moment we need to clarify is what to return in case the wiki is not supported.

In MCS we've started to stick with a pattern to return 501. If that's OK with you, 501 is preferable for us. 404 is quite bad in this case since 404 can be returned for many other reasons, like URI misconfiguration, so if we configure 404 to be OK, we might miss some bug in configuration.

That sounds fine to me. Was thinking that we could return 200 if the request is successfully handled whether or not a scoring process was started. But if 501 would help metrics/logging or something else, I'm happy with that.

I'm thinking that we'll want a minimal response that describes the action that was performed. Something like:
{"precache": {"context": "enwiki", "revid": 12345, "scored": ["damaging", "goodfaith"], "cached": ["reverted", "wp10", "draftquality"]}

The "scored" field would contain a list of model for which a score was generated and the "cached" field would contain the list of models for which a score was already cached.

@Halfak Hm... So, in the context of Edit-Review-Improvements-ReviewStream project we might want to reuse the response to integrate it into the new stream. Why wouldn't we just return the normal response for that? Basically treat a precaching request like a request for scores with 'all models'?

That makes a lot of sense and I'm happy to not have to define a new response type :)

Just quickly noting that it seems we still have a memory leak in ores precached (the utility that runs in labs because ChangeProp isn't there)

The work that moves the logic out of ores precached and into this new endpoint should consider that ongoing issue.

Just quickly noting that it seems we still have a memory leak in ores precached (the utility that runs in labs because ChangeProp isn't there)

We have a CP instance in labs the beta cluster and it's updating ORES on deployment-sca03. Are you talking about some other ORES instance?

Yeah. We have an experimental installation at ores.wmflabs.org where we deploy experimental models that aren't ready for production yet.

Is there a ChangeProp in labs that will allow us to track the production wikis?

Is there a ChangeProp in labs that will allow us to track the production wikis?

Nope, we don't have access to prod Kafka from Labs..

Boo. OK. No worries. Onward with ores precached for now.

If there's enough need, we could set up a Kafka cluster in labs that is mirrored from Prod. And/or we can expose any (non-PII) events in prod Kafka via EventStreams. Please comment at T149736 if you want that :)

The endpoint is being implemented https://github.com/wiki-ai/ores/pull/189. Once this gets merged and deployed we can move the change propagation config to this endpoint (In order to add the support for the extension, I have some concerns. Let's do it later)

In T159615, @mobrovac asked about this. This change is now deployed. We should be ready to receive changeprop requests at this endpoint.

Thanks @Halfak for the info!

I've just taken a look at the PR and am wondering why you opted to consume the whole event, rather than just use the parts that you need? The contract with CP could still be send me just relevant events, e.g. no events generated by bots. In fact, I think we should keep it that way to minimise network traffic: if both sides know an event would be discarded, then there is no need in sending it in the first place. Another concern that I have is that the endpoint receives the event as a GET parameter. With it being in JSON format, there will be a lot of escaping. In general, such things should be consumed as POST bodies.

We did that when more data needed for example there is a model that should only precache when the page is created and not in the following edits so using more data we can omit such cases. I agree we should still not send bot edits (and @Halfak agreed to) but let's have another sanity check. For now, you can send just needed information in data argument. As a long-term solution, I think we can switch to POST. If you agree, I make a phab card and other stuff.

I agree that the minimum that should be done here is to switch to POST-style requests. Would you be available to work on it soon? I think it'd be better to have the POST endpoint in place before we even start sending traffic to the /precache endpoint.

Yeah, Just please make a phab card (a subtask of this) and assign it to me. I get it done ASAP.

Change 411087 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/change-propagation/deploy@master] Simplify ORES precaching by using the new endpoint

https://gerrit.wikimedia.org/r/411087

Argh, attached to the wrong bug, please ignore.