Page MenuHomePhabricator

Determine appropriate API request limits for InternetArchiveBot
Closed, DeclinedPublic

Description

InternetArchiveBot (https://meta.wikimedia.org/wiki/InternetArchiveBot) provides dead link fixing and reference enhancement services for over 100 Wikimedia wikis. As part of this work, it regularly inspects the wikitext source of articles to detect links, in-line references, and formatted citations. (In the future we will switch to Annotated HTML but for now we use wikitext).

InternetArchiveBot currently services a very large number of wikis, operating on a scale very few other bots in the Wikimedia world do, and our vision is to operate on every Wikimedia wiki. Short of becoming a Wikimedia production service this will require some amount of coordination between Wikimedia SRE and the bot operators at the Internet Archive. I am happy to manage the bot operator side of the relationship (though @Cyberpower678 should definitely be kept in the loop). I think with more proactive coordination there will be fewer surprises.

As part of this I would like for us to work and come to an agreement on what an appropriate level of concurrent requests would be. The idea is to ensure InternetArchiveBot can effectively serve its many users while preventing our operations from having a destabilizing effect on yours. From there we can figure out the action plan for implementing concurrency limits.

Event Timeline

To give an idea of how many requests InternetArchiveBot was sending, the following screenshot shows number of requests over a 24 hour period for all user-agents that match [Bb]ot. It's somewhat concerning that IABot is above both Googlebot, bingbot, and quite a few others.

iab_turnilo_redacted.png (1×3 px, 337 KB)

That said, not all API requests are equal and some are more expensive than others. I think T296258: Audit API calls and assess batching strategies is a precursor to this, as that will help cut down on the number of requests in the first place, but more importantly give us an estimate of how expensive each request is.