Page MenuHomePhabricator

Citoid/Zotero: Create rate limiting configurable on a per site basis
Open, Needs TriagePublic

Description

We've been asked to institute a global rate limit of no more than 10 requests per minute for one of the sites.

Event Timeline

Mvolz updated the task description. (Show Details)

@akosiaris Do you know what the infrastructure for this might look like? Is this something we could do with url downloader? Or would this be built into citoid?

How would this get distributed amongst workers/ between servers?

Can/does the site in question communicate their enforced limit in an HTTP standardised and machine-readable way? For example, a 429 Status with Retry-After header, or (even better, to avoid hitting a failure first) a crawl limit in robots.txt.

Can/does the site in question communicate their enforced limit in an HTTP standardised and machine-readable way? For example, a 429 Status with Retry-After header, or (even better, to avoid hitting a failure first) a crawl limit in robots.txt.

At present, no. Could ask if they could. (Though I suppose they might want Google to crawl them faster than we hit them.)

akosiaris renamed this task from Create rate limiting configurable on a per site basis to Citoid/Zotero: Create rate limiting configurable on a per site basis.Wed, Jun 12, 9:12 AM

This is a first. We never had to implement something like that in the past. Historically, we did have some ingress ratelimiting functionality in RESTBase (it ended up abandonware) and we very recently did add some rate limiting functionality to the service-mesh, but it's internal requests only. Egressing rate-limiting functionality from our infrastructure has never been implemented, to my knowledge at least.

So, we have no infrastructure for this right now. We might be able to reuse some parts of recent work, to provide something for storing the rate limiting configuration and the counting state, but this will definitely require Citoid modifications to implement the business logic for it.

Since no ready solution exists, this will require time and work from all sides to implement. What's your intended ETA @Mvolz? We 'll need to investigate if and how we can help here.

@akosiaris This is just a technical exploration at the moment, there is no timeline or prioritisation yet. Knowing that this would a first within our infrastructure is useful information, thanks.

@Mvolz @Krinkle I know that at least one of the properties throttling us is doing so with 403s and not 429s. They said as much.

Apologies if I'm asking too soon, or if this is already taken into consideration...

As I understand it the demand and result of Citoid is directly part of a user experience. Specifically, inserting a citation with VisualEditor when writing an article. If I recall correctly, the UI contains a progress bar and other information is presented in a "paused" style such that end-users generally perceive that they're supposed to wait for this to complete before they they can move on. AFAIK we also do not have use for "eventually" getting the result from the external site if it is async and several minutes/hours later, since it is helping an edit where the editor still has to decide what to do with it. It's not a situation where the intent is final and we can process it later, I think?

Are we thus looking for a way to decide in code whether to make/cancel the request from Citoid? Or more for a way to make Citoid async process these eventually?

The former would presumably only need a way to share a cluster-wide semaphore. You can count a per-domain threshold, if reached, resulting in Citoid effectively skipping its main code and the editor then needs to reflect that inaction (e.g. Citoid unavailable, failed, try again later, whatever you decide). We have numerous mechanisms like this in MediaWiki, that you could borrow or build on. We typically use Memcached for this (ADD/INCR/DECR). For example, MediaWiki's PingLimiter (rate limit for edits per 5 minutes etc.) works this way, and will deny the action. This would not require new infrastructure and can be implemented within the Citoid service (or the MW REST API in front of that service), so long as it has access to an appropriate Memcached host (MW has a separate Memc cluster for security reasons).

The latter would involve locking/delaying/distributing the work, on the assumption that per-domain concurrency is likely low enough that a lock with a short timeout will allow most work to succeed on the first try? This is more risky in terms of building up load, and requires more state/data persistence, but there might be other factors at play for you that could make this attractive. For example, if you think there's a way to retroactively fix-up citations even if several minutes/hours have passed. The JobQueue might make more sense in that case.

Come to think of it, the first approach could be smoothened by involving a client-side retry (e.g. retry upto 2 times without needing to tell the editor, just pretend its taking longer). I believe VisualEditor does this already in certain cases, to mask common failures that are easy to mitigate (e.g. expired CSRF token).

Apologies if I'm asking too soon, or if this is already taken into consideration...

As I understand it the demand and result of Citoid is directly part of a user experience. Specifically, inserting a citation with VisualEditor when writing an article. If I recall correctly, the UI contains a progress bar and other information is presented in a "paused" style such that end-users generally perceive that they're supposed to wait for this to complete before they they can move on. AFAIK we also do not have use for "eventually" getting the result from the external site if it is async and several minutes/hours later, since it is helping an edit where the editor still has to decide what to do with it. It's not a situation where the intent is final and we can process it later, I think?

Are we thus looking for a way to decide in code whether to make/cancel the request from Citoid? Or more for a way to make Citoid async process these eventually?

The former would presumably only need a way to share a cluster-wide semaphore. You can count a per-domain threshold, if reached, resulting in Citoid effectively skipping its main code and the editor then needs to reflect that inaction (e.g. Citoid unavailable, failed, try again later, whatever you decide). We have numerous mechanisms like this in MediaWiki, that you could borrow or build on. We typically use Memcached for this (ADD/INCR/DECR). For example, MediaWiki's PingLimiter (rate limit for edits per 5 minutes etc.) works this way, and will deny the action. This would not require new infrastructure and can be implemented within the Citoid service (or the MW REST API in front of that service), so long as it has access to an appropriate Memcached host (MW has a separate Memc cluster for security reasons).

The latter would involve locking/delaying/distributing the work, on the assumption that per-domain concurrency is likely low enough that a lock with a short timeout will allow most work to succeed on the first try? This is more risky in terms of building up load, and requires more state/data persistence, but there might be other factors at play for you that could make this attractive. For example, if you think there's a way to retroactively fix-up citations even if several minutes/hours have passed. The JobQueue might make more sense in that case.

Come to think of it, the first approach could be smoothened by involving a client-side retry (e.g. retry upto 2 times without needing to tell the editor, just pretend its taking longer). I believe VisualEditor does this already in certain cases, to mask common failures that are easy to mitigate (e.g. expired CSRF token).

Client side is sadly not sufficient as probably most of this excess is coming from third party users of the api in the back -end.

The api is not in MW REST API, it's currently behind restbase and at some point will move to the new api gateway T361576... so unfortunately any mw infrastructure is not super useful here. Gateway presumably has caching infrastructure, though... somewhere? Which maybe can do the same thing?

Requests would looks something like https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/https%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS2090123221001491

Could we set memcache deny for any urls starting in https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/https%3A%2F%2Fwww.sciencedirect.com, https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/http%3A%2F%2Fwww.sciencedirect.com, or https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/https%3A%2F%2Fsciencedirect.com, or https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/http%3A%2F%2Fsciencedirect.com

This actually does affect the transition because the current stalled migration actually re-enables using urls as query params again, but those wouldn't be cachable so would prevent this "hack" - > https://gerrit.wikimedia.org/r/c/mediawiki/services/citoid/+/907481 something it consider (i.e. not re-enabling it!)

I don't know how caching works in this new gateway though!

This comment was removed by Mvolz.