Investigate why elastic@codfw alerted during codfw row B switch upgrade
Closed, ResolvedPublic

Description

Two pages were sent during the upgrade:

  1. search.svc.codfw.wmnet/LVS HTTP IPv4 is CRITICAL
  2. search.svc.codfw.wmnet/ElasticSearch health check for shards is CRITICAL

The second one is probably a consequence of the first one because:
ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch http://10.2.1.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.2.1.30, port=9200): Read timed out. (read timeout=4)

Looking at the cluster the number of active shards was never at a critical level, it went yellow and lost few shards but not enough to trigger the health check alert if the check worked properly.

These alerts were not expected during this kind of operation.

Related Objects

dcausse created this task.Jul 12 2017, 9:27 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 12 2017, 9:27 AM
debt triaged this task as Normal priority.
debt added a subscriber: debt.

Let's take a look at this.

debt assigned this task to Gehel.Sep 19 2017, 5:46 PM

The check on LVS might be a bit off, we'll need to look at this, as we might be sending information to servers that are down.

Mentioned in SAL (#wikimedia-operations) [2017-10-10T17:26:22Z] <gehel> shutting down and restarting elasticsearch on relforge1001 for testing - T170378

The check looks good. I was worried we were doing just a TCP check and not detecting nginx was working but not elasticsearch.

Since we've only seen this issue once in forever, I'd propose we close it. There are plenty of idea of things that could have gone wrong, but not really any way to validate those after the fact.

Gehel added a comment.Oct 10 2017, 5:30 PM

Note that the most likely explanation (to me at least) is that the check done on the LVS endpoint ended being routed to one of the server that just recovered network connectivity.

debt closed this task as Resolved.Oct 10 2017, 6:30 PM

Thanks for the investigation, @Gehel !