Page MenuHomePhabricator

Investigate why elastic@codfw alerted during codfw row B switch upgrade
Closed, ResolvedPublic

Description

Two pages were sent during the upgrade:

  1. search.svc.codfw.wmnet/LVS HTTP IPv4 is CRITICAL
  2. search.svc.codfw.wmnet/ElasticSearch health check for shards is CRITICAL

The second one is probably a consequence of the first one because:
ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch http://10.2.1.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.2.1.30, port=9200): Read timed out. (read timeout=4)

Looking at the cluster the number of active shards was never at a critical level, it went yellow and lost few shards but not enough to trigger the health check alert if the check worked properly.

These alerts were not expected during this kind of operation.

Related Objects

Event Timeline

debt triaged this task as Medium priority.Jul 13 2017, 5:14 PM
debt added a subscriber: debt.

Let's take a look at this.

The check on LVS might be a bit off, we'll need to look at this, as we might be sending information to servers that are down.

Mentioned in SAL (#wikimedia-operations) [2017-10-10T17:26:22Z] <gehel> shutting down and restarting elasticsearch on relforge1001 for testing - T170378

The check looks good. I was worried we were doing just a TCP check and not detecting nginx was working but not elasticsearch.

Since we've only seen this issue once in forever, I'd propose we close it. There are plenty of idea of things that could have gone wrong, but not really any way to validate those after the fact.

Note that the most likely explanation (to me at least) is that the check done on the LVS endpoint ended being routed to one of the server that just recovered network connectivity.

Thanks for the investigation, @Gehel !