The service-checker script is an awesome way to monitor the health of individual endpoints, but I've recently discovered one little flaw. We have a monitoring request for the recently created onthisday endpoint, which is quite expensive to calculate. Also the response is not purged from Cassandra, so we store it there with a TTL. The script makes a request of content from en.wiki for Jan 01, and there's around 20 duplicate entries in Cassandra.
As I understand, the content get's expired, then all the checks on all the hosts fire roughly at the same time, so all of them see that the storage is empty and go to Mobile-Content-Service to generate the content. And that's actually pretty bad, since the number of parallel requests equals the number of nodes, and since generating this content is a pretty computationally expensive task, the checker puts a burst of load on the service.