Page MenuHomePhabricator

[Epic] Implement ORES service proxy in api.php
Closed, ResolvedPublic

Description

This task is done when ORES scores can be accessed in an arbitrary way through api.php

Right now there are two ways to store scores of edits into the database: 1- if the edit is made when the extension is up and ores service is up in that moment 2- using the maintenance script for a limited number of edits (5K at the most)

But we need a way (e.g. an API endpoint) that can request to ORES service and store scores (probably batch them if needed).

Event Timeline

If I read the code correctly, Extension:ORES copies ORES scores to the ores_classification table, but only does it for revisions which are in recentchanges table, even with the maintenance script. So basically for the prop module you'd want to look up the given revisions in ores_classification, and if some of those are not found it makes a bunch of parallel requests to the REST API, right? And for the filter options, just hope the data is in the table, since doing otherwise would require an arbitrary amount of backfilling and then repeating the query, which is not realistically doable. For recent changes that would be OK-ish since normally the data is always present; for watchlist/contribs, it would require backfilling data for the whole revision table somehow. Or should edits that happened before the extension was deployed always get filtered out?

Hey @Tgr, thanks for the comment

If I read the code correctly, Extension:ORES copies ORES scores to the ores_classification table, but only does it for revisions which are in recentchanges table, even with the maintenance script.

Not correct, we do it for every revision after the deployment.

So basically for the prop module you'd want to look up the given revisions in ores_classification, and if some of those are not found it makes a bunch of parallel requests to the REST API, right?

ORES service supports batch requests too.

And for the filter options, just hope the data is in the table, since doing otherwise would require an arbitrary amount of backfilling and then repeating the query, which is not realistically doable. For recent changes that would be OK-ish since normally the data is always present; for watchlist/contribs, it would require backfilling data for the whole revision table somehow.

In GUI, the plan is to run a javascript on top of the page that when a score is not the db, makes requests to the service, show the scores to the user and store it in the db. I don't know how to implement that in the API ATM.

Not correct, we do it for every revision after the deployment.

Well, it happens on RecentChange_save so it won't happen for changes which are do not create an RC entry (such as description page creation on new file upload). More importantly, it never gets backfilled for pre-deployment entries which already fell out of RC, even with the maintenance script.

In GUI, the plan is to run a javascript on top of the page that when a score is not the db, makes requests to the service, show the scores to the user and store it in the db. I don't know how to implement that in the API ATM.

There is no reason the API couldn't do that in theory (and if you do that, there is no point in duplicating the functionality in JS, it can just call the API), although you might run into performance problems. Filters are more problematic: if you implement an ucshow=oresreview for list=usercontrib, how should it deal with revisions which are missing from ores_classification? ucshow=oresreview&uclimit=10 would be assumed to mean "give me the last 10 edits of this user which are above the ORES damaging threshold"; there is no way to tell how many edits you have to scan to do that. If the user made ten damaging edits, then a thousand non-damaging ones, and none of those have been cached in ores_classification, there is just no sane strategy that could result in the ten damaging edits being returned.

for watchlist/contribs

Note watchlist is just a filtered recentchanges, so it's only contribs that would potentially be a problem of the things you mentioned.

Or should edits that happened before the extension was deployed always get filtered out?

As long as it's documented, I don't think that would necessarily be a bad restriction. It would depend on the details, e.g. does it affect only pre-ORES revisions or can it affect just-made revisions too before ORES gets a chance to score them?

In GUI, the plan is to run a javascript on top of the page that when a score is not the db, makes requests to the service, show the scores to the user and store it in the db. I don't know how to implement that in the API ATM.

You'd have to block the API request while making the request to the service. Assuming this is ~1 request to the service per API request and isn't likely to take a very long time, that's not necessarily a bad thing. Or else return the result to the client without the scores, and schedule a job to do it in the background so it'll hopefully be there for next time.

You'd have to block the API request while making the request to the service. Assuming this is ~1 request to the service per API request and isn't likely to take a very long time, that's not necessarily a bad thing.

Requests to ORES service can take long time too. Specially if the edit is big. Actually around 0.5% of scoring jobs fail because of the time out (15 seconds) and these scores are likely to be missing from ores_classification table.

Or else return the result to the client without the scores, and schedule a job to do it in the background so it'll hopefully be there for next time.

Yeah, I thought about that but it's not very nice saying the user "Here's the page, some scores are missing, please refresh to see them now"

Yeah, I thought about that but it's not very nice saying the user "Here's the page, some scores are missing, please refresh to see them now"

Well, it's up to the client what it does when the API responds without scores. It might wait a few seconds and resend the query for those revisions instead of just giving up.

If it's okay for you, it's okay for me.

Getting back to the topic of backfilling, did you investigate the possibility of populating ores_classification for every single revision and found it unfeasible, or is that something you just did not have the resources to look into? (I imagine feasibility might differ between wikis.) How often do you add new models / model versions (which means new data needs to be added to all revisions)? Do existing scores ever change (e.g. due to a bugfix in the model code)?

Getting back to the topic of backfilling, did you investigate the possibility of populating ores_classification for every single revision and found it unfeasible, or is that something you just did not have the resources to look into? (I imagine feasibility might differ between wikis.)

It will put a huge pressure on the service, probably causing disruption. Also the data mostly useless (except for recent scores) and once a new model gets deployed in the service we need to do it again for all revisions making a huge burden. Also we have database storage issues too (There are several phab cards, I can point you to it if you want to)

How often do you add new models / model versions (which means new data needs to be added to all revisions)?

Once in two or three months, sometimes faster, sometime slower.

Do existing scores ever change (e.g. due to a bugfix in the model code)?

For the sake of consistency of results we avoid it as much as possible but I think we did it once or twice when the difference between old and new scores was negligible.

(There are several phab cards, I can point you to it if you want to)

Thanks. Sounds like you got this covered, but it is always nice to have the relevant information linked :)

Lots of talks happened in T123795: ORES extension tables should be able to be populated using maintenance scripts (The title is misleading). Some discussions on DB performance can be found in T124443: (Potential) ORES database schema improvements too. I'm guessing there are one other phab card too but I couldn't find it.

It looks like this is ready to be resolved. Is that right, @Anomie?