Page MenuHomePhabricator

Optimize large number of Citoid requests for coverage estimation research project
Closed, ResolvedPublic

Description

Web2Cit is a tool under development that will let users collaboratively and visually define translation (i.e., metadata extraction) rules to circumvent Zotero translation deficits (until proper JavaScript translators are written/fixed for each case). More info about the project here.

As described in the project's proposal, one of the Web2Cit-Research subproject's goals is to develop an automated script that compares how well Citoid is doing now, vs how well it will do in the future, when Web2Cit will have been running for some time.

To this aim, @Nidiah, @Gimenadelrioriande and Romina De León have been collecting URLs cited in Wikipedia featured articles from different languages. Given that they come from featured articles, we more or less assume that their metadata have been curated and are generally correct (we have extensively discussed the validity of this assumption within our team and also with our Advisory Board).

For the second part, @Nidiah is currently collecting Citoid responses for the URLs extracted (these will be compared against the corresponding "correct" metadata extracted). She's doing this from a PAWS notebook, and she will use a cache to make sure that we don't ask twice for the same URL (in case it appears more than once in the pool of >450k citations extracted). However, she noticed that the response was relatively slow. We considered making parallel requests, but we don't want to overload the Citoid service.

@Mvolz, in your opinion what would be the best way (as fast as possible without disrupting the service) to do this?

Alternatively, we could set up a custom Citoid service (as long as we make sure that it is running the exact same code). But in addition to the extra work, I assume that it would involve similar hardware and network resources anyway, because we would have to run it on a Wikimedia server. In addition, it wouldn't benefit from Wikimedia's RESTBase caching capabilities.

Finally, given the large amount of citations, we could also consider randomly sampling a smaller subset each time (how large?), assuming that it shouldn't change the results much.

Event Timeline

I've been looking into Citoid API request rate limits.

We access the Citoid API through Wikmedia's RESTBase proxy. I found two 429 HyperSwitch errors for exceeded request rates: https://www.mediawiki.org/wiki/HyperSwitch/errors/rate_exceeded and https://www.mediawiki.org/wiki/HyperSwitch/errors/request_rate_exceeded

Here it says that there is a global limit of up to 200 requests per second, but that individual endpoints may have specific limits. However, the Citoid API documentation doesn't seem to say anything about it.

On the other hand, I found this thread where @Mvolz mentions a "1000/10s (100/s long term, with 1000 burst)" limit.

She also refers to how long requests take and timeouts, but I'm not sure what she means. How does time to response affect request rate limit? Say we make 1000 requests at t=0s of which only 500 have returned a response at t=10s, can we make another 1000-request batch now? Or do pending requests count against our request rate limit?

Unless @Mvolz disagrees, it sounds to me like we could parallelize requests, up to 1000 requests every 10s. That would mean it would take us around 4500s (1.25h) to send 450 batches of 1000 requests each (450k requests).

Even if pending requests still count against the rate limit, assuming a very bad scenario where all requests take 40s to respond, that would mean 18000s (5h) to send the 450 1k-request batches. That's still much much better than what it was taking us with one request at a time!!

One way or another, I think we should set a custom user agent for our requests, to help others diagnose possible problems our script may be causing: T302826.

I've been looking into Citoid API request rate limits.

We access the Citoid API through Wikmedia's RESTBase proxy. I found two 429 HyperSwitch errors for exceeded request rates: https://www.mediawiki.org/wiki/HyperSwitch/errors/rate_exceeded and https://www.mediawiki.org/wiki/HyperSwitch/errors/request_rate_exceeded

Here it says that there is a global limit of up to 200 requests per second, but that individual endpoints may have specific limits. However, the Citoid API documentation doesn't seem to say anything about it.

On the other hand, I found this thread where @Mvolz mentions a "1000/10s (100/s long term, with 1000 burst)" limit.

She also refers to how long requests take and timeouts, but I'm not sure what she means. How does time to response affect request rate limit? Say we make 1000 requests at t=0s of which only 500 have returned a response at t=10s, can we make another 1000-request batch now? Or do pending requests count against our request rate limit?

Reponse time doesn't affect rate limit.

Unless @Mvolz disagrees, it sounds to me like we could parallelize requests, up to 1000 requests every 10s. That would mean it would take us around 4500s (1.25h) to send 450 batches of 1000 requests each (450k requests).

Even if pending requests still count against the rate limit, assuming a very bad scenario where all requests take 40s to respond, that would mean 18000s (5h) to send the 450 1k-request batches. That's still much much better than what it was taking us with one request at a time!!

One way or another, I think we should set a custom user agent for our requests, to help others diagnose possible problems our script may be causing: T302826.

That seems like a good idea. One thing I should mention would be if you would please not make requests for isbns as we have to pay for those requests and have a limit - which I assume won't be a problem anyway since you're only collecting urls!

Thanks, @Mvolz! We will probably be doing this around the end of March. As mentioned, we will properly identify our requests with a custom user agent. Hopefully we won't cause any disruptions, but please let us know in case we do!

diegodlh claimed this task.