Determine appropriate API request limits for InternetArchiveBot
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Harej
	Nov 27 2021, 10:35 PM

Description

InternetArchiveBot (https://meta.wikimedia.org/wiki/InternetArchiveBot) provides dead link fixing and reference enhancement services for over 100 Wikimedia wikis. As part of this work, it regularly inspects the wikitext source of articles to detect links, in-line references, and formatted citations. (In the future we will switch to Annotated HTML but for now we use wikitext).

InternetArchiveBot currently services a very large number of wikis, operating on a scale very few other bots in the Wikimedia world do, and our vision is to operate on every Wikimedia wiki. Short of becoming a Wikimedia production service this will require some amount of coordination between Wikimedia SRE and the bot operators at the Internet Archive. I am happy to manage the bot operator side of the relationship (though @Cyberpower678 should definitely be kept in the loop). I think with more proactive coordination there will be fewer surprises.

As part of this I would like for us to work and come to an agreement on what an appropriate level of concurrent requests would be. The idea is to ensure InternetArchiveBot can effectively serve its many users while preventing our operations from having a destabilizing effect on yours. From there we can figure out the action plan for implementing concurrency limits.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T296577 Determine appropriate API request limits for InternetArchiveBot
		Declined		Harej	T296258 Audit API calls and assess batching strategies

Event Timeline

Harej created this task.Nov 27 2021, 10:35 PM

Restricted Application added a project: Internet-Archive. · View Herald TranscriptNov 27 2021, 10:35 PM

Some context: https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud/20211122.txt

GeneralNotability subscribed.Nov 28 2021, 12:57 AM

Peachey88 subscribed.Nov 28 2021, 1:16 AM

Perryprog subscribed.Nov 28 2021, 2:17 AM

Harej moved this task from Backlog to Integrations on the Internet-Archive board.Nov 29 2021, 6:15 PM

Harej moved this task from Inbox to Backlog: Configuration and Deployment on the InternetArchiveBot board.Nov 29 2021, 6:51 PM

To give an idea of how many requests InternetArchiveBot was sending, the following screenshot shows number of requests over a 24 hour period for all user-agents that match [Bb]ot. It's somewhat concerning that IABot is above both Googlebot, bingbot, and quite a few others.

iab_turnilo_redacted.png (1×3 px, 337 KB)

That said, not all API requests are equal and some are more expensive than others. I think T296258: Audit API calls and assess batching strategies is a precursor to this, as that will help cut down on the number of requests in the first place, but more importantly give us an estimate of how expensive each request is.

Legoktm added a subtask: T296258: Audit API calls and assess batching strategies.Dec 1 2021, 1:58 AM

Umherirrender mentioned this in T297214: Is the bot stuck again?.Dec 10 2021, 10:16 AM

MatthewVernon edited projects, added serviceops; removed SRE.Dec 14 2021, 4:06 PM

Cyberpower678 triaged this task as Medium priority.Dec 15 2021, 8:51 PM

Harej closed subtask T296258: Audit API calls and assess batching strategies as Declined.Feb 24 2022, 11:38 PM

Harej closed this task as Declined.Mar 9 2022, 7:18 PM