Proton has a queue system to limit the number of simultaneous render jobs, but according to https://wikitech.wikimedia.org/wiki/Incident_documentation/20190301-proton it did not work as intended.
The Proton endpoint is a queue manager that does not do work other than checking if there are available workers and assigning render jobs to them. If there are no free workers (more precisely if the render queue is full) it returns a HTTP 503. A few dozen requests per second should not give it any trouble, and the number of simultaneously running renders should not go above queue limits.
The incident documentation says there were several dozen processes, some of them a month old. So maybe there is some situation (somehow triggered by high load?) where a rendering process gets stuck or zombified and it gets evicted from the queue without actually being terminated?