Page MenuHomePhabricator

WDQS overloaded in codfw
Closed, DuplicatePublic

Description

At around 9AM UTC today (Sep 3) we started experiencing stability issues with WDQS, localized (at least at the moment) to a single, of two, datacenter. Unfortunately, we haven't been able to pinpoint the issue as of now. We suspect that someone is running a query that affects Blazegraph - that happened a few times in the past. Unfortunately, our usual tactics did not help us to find which one.

We are working on identifying the issue, but it's clear that this could in a few hours bring the service down, so we are working on a quick workaround. Since we observed the issue is only causing actual service failures after ~2h after restart, for now we are going to introduce a procedure that will restart servers randomly, so that uptime for each will be at max around 1h. Only one server should be restarted at any given time. This will cause some queries to be killed, when each of the servers is restarted, but the alternative is worse.