Page MenuHomePhabricator

DRAFT: Use rate limiting for ORES Action API score retrieval
Closed, ResolvedPublic

Description

This task was created as an action item for @Tgr / @dr0ptp4kt at the 20-March-2017 ORES weekly meeting

T157206: ORES Overloaded (particularly 2017-02-05 02:25-02:30) describes overloading of the ORES subsystem by a remote Python bot prior to a broader announcement of ORES Action API support, and T159753: Concerns about ores_classification table size on enwiki describes one of the consequences of that overloading.

This task is for defining a simple and reasonable strategy for rate limiting Action API calls for ORES and avoiding unnecessary storage in MariaDB, in order to restore ORES scores retrieval capabilities to the Action API.

Is it possible to do the following?

  1. Allow scores to be returned in Action API responses provided there are corresponding records in the recent changes corresponding to the revisions. Rely on existing limits for number of revisions.
  2. When revisions exist but they're not in the recent changes table, don't allow more than X unavailable revision scores to be fetched at a time. Use API continuation in batches of only X revisions at a time for some small X, but don't store them upon fetch; instead, delegate the decision on whether to store or somehow cache in the ORES backend to the ORES backend.

As a follow on action to this task, we'd like to consider the possibility of storing additional scored model output in MariaDB (e.g., wp10), and this will beg the question of whether to normalize the response into MariaDB columns, or to instead simplify assumptions and store the scored model output as a blob per revision. The context of showing or using this additional modeling output would likely be while while a user is reading an individual article, or in the context of applying filtering on a modestly sized result set (imagine top X viewed articles or top X geographically closest articles subsorting).

Event Timeline

dr0ptp4kt updated the task description. (Show Details)

Is it possible to do the following?

  1. Allow scores to be returned in Action API responses provided there are corresponding records in the recent changes corresponding to the revisions. Rely on existing limits for number of revisions.
  2. When revisions exist but they're not in the recent changes table, don't allow more than X unavailable revision scores to be fetched at a time. Use API continuation in batches of only X revisions at a time for some small X, but don't store them upon fetch; instead, delegate the decision on whether to store or somehow cache in the ORES backend to the ORES backend.

The API already has similar logic (except currently the second alternative is to fetch a small amount of data, store it, return it and fetch and store more data via the job queue). But you can't generally guarantee to return the next X scores matching some filter if not all scores are in the local table, not even for X=1. So, this is fine as long as the filtering options (such as "show damaging edits only") are not used.

I'm not sure there is much point in adding rate limiting to the action API though when the same queries can be sent directly to ORES. It's simpler to do all limiting upstream.

Tgr claimed this task.

As a follow on action to this task, we'd like to consider the possibility of storing additional scored model output in MariaDB (e.g., wp10), and this will beg the question of whether to normalize the response into MariaDB columns, or to instead simplify assumptions and store the scored model output as a blob per revision.

Given that we already have a table to put them into, I don't think storing as blobs would simplify or solve anything. It would not reduce space needed (JSON notation probably takes up more bytes then what can be shaved off by having less fields), and it would be unsearchable (we might not need that but the option that gives it for free is still preferable).

Other than that, I think between T159753: Concerns about ores_classification table size on enwiki, T163687: Re-enable ORES data in action API and T137962: [Spec] Tracking and blocking specific IP/user-agent combinations everything discussed here is covered already, and has been implemented there (mainly in T159753), so let's resolve this.