Page MenuHomePhabricator

Randomise the checker script delay
Closed, DeclinedPublic

Description

The service-checker script is an awesome way to monitor the health of individual endpoints, but I've recently discovered one little flaw. We have a monitoring request for the recently created onthisday endpoint, which is quite expensive to calculate. Also the response is not purged from Cassandra, so we store it there with a TTL. The script makes a request of content from en.wiki for Jan 01, and there's around 20 duplicate entries in Cassandra.

As I understand, the content get's expired, then all the checks on all the hosts fire roughly at the same time, so all of them see that the storage is empty and go to Mobile-Content-Service to generate the content. And that's actually pretty bad, since the number of parallel requests equals the number of nodes, and since generating this content is a pretty computationally expensive task, the checker puts a burst of load on the service.

Event Timeline

Randomising (or even setting a delay between nodes) for the script is tricky because we have a lot of hosts + the LVS check. I would propose instead to remove that x-ample from RESTBase and have it only in the mobile content service given that he only hydrate the response. If the hydration doesn't work, we would notice it on the feed endpoint anyway.

@mobrovac That works too. I think we don't have this issue on other endpoints since there's no expiration for them. I'll create a PR to get rid of monitoring for this endpoint in RESTBase as soon as @bearND adds monitoring for it on the MCS side.

On the second thought, it's not too much of a load to bother, and having monitoring is always good.