Page MenuHomePhabricator

Provide snapshot of http://tools-elastic-01.tools.eqiad.wmflabs logs
Closed, ResolvedPublic

Description

Could someone provide a snapshot of the nginx and elasticsearch logs from http://tools-elastic-01.tools.eqiad.wmflabs?

I'm getting back an error "502 Bad Gateway" from Nginx when I'm querying the elasticsearch cluster. I'm making quite a few queries but I don't think I should be making more than a handful of concurrent connections.

I want to work out if I'm the one causing the error (and if so how I might need to throttle things so I don't). Or if the error is caused by something else.

Event Timeline

Can you provide some details on the tool(s) in question and what we are looking for? heads up @bd808

So the tool is wikifactmine-pipeline which auths with elasticsearch as 'tools.wikifactmine-pipeline'.

What it is doing is making queries to one index; parsing the response and then indexing some new documents to a different index based on that response.

For example all of my connections through the official elasticsearch javascript client were ended (after it seems to have tried to retry) at around 10:25 UTC Fec 15th 2017 with an 'error 502 Bad Gateway' from nginx.

I've discovered that I can read the elasticsearch logs by connecting to tools-elastic-01.tools.eqiad.wmflabs and looking in /var/log/elasticsearch.

These don't seems to contain any revelations. Just that something happened (unknown) to the cluster health:

[2017-02-15 10:24:22,958][INFO ][cluster.routing.allocation] [tools-elastic-01] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[wikifactmine-papers][0]] ...]).
[2017-02-15 10:24:23,628][INFO ][cluster.routing.allocation] [tools-elastic-01] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[wikifactmine-papers][0], [wikifactmine-papers][0]] ...]).

Unfortunately (or not) I don't have the permissions to read /var/log/nginx which I guess may contain the original responses from ES with the errors in them.

Not really.

If I can get away with not having to run my own elastic cluster but can instead use the communal tool-labs one that would be cool.

There are a lot of error log entries that look like:

2017/02/15 09:55:45 [crit] 714#0: *1437567 connect() to 127.0.0.1:9200 failed (99: Cannot assign requested address) while connecting to upstream, client: 10.68.17.205, server: , request: "POST /wikifactmine-papers/_search HTTP/1.1", upstream: "http://127.0.0.1:9200/wikifactmine-papers/_search", host: "tools-elastic-01.tools.eqiad.wmflabs"

The errors seem to come in clusters:

$ awk '{print $1, $2}' error.log|sort|uniq -c
      6 2017/02/15 09:55:45
     16 2017/02/15 10:08:46
      6 2017/02/15 10:25:29
$ awk '{print $1, $2}' error.log.1|sort|uniq -c
      7 2017/02/14 13:55:58
      8 2017/02/14 14:31:44

Looking at the time around the largest cluster in the access log:

$ awk '{print $4}' access.log|uniq -c|grep -E 10:08:[34]
    786 [15/Feb/2017:10:08:30
    713 [15/Feb/2017:10:08:31
    697 [15/Feb/2017:10:08:32
    723 [15/Feb/2017:10:08:33
    695 [15/Feb/2017:10:08:34
    670 [15/Feb/2017:10:08:35
    640 [15/Feb/2017:10:08:36
    806 [15/Feb/2017:10:08:37
    631 [15/Feb/2017:10:08:38
    651 [15/Feb/2017:10:08:39
    745 [15/Feb/2017:10:08:40
    594 [15/Feb/2017:10:08:41
    610 [15/Feb/2017:10:08:42
    545 [15/Feb/2017:10:08:43
    453 [15/Feb/2017:10:08:44
    481 [15/Feb/2017:10:08:45
    229 [15/Feb/2017:10:08:46
$ awk '{print $4}' access.log|uniq -c|grep -E 10:08:[34]|awk '{sum += $1} END {print sum}'
10669

This looks a bit like TCP source port exhaustion for the pool of 28,231 local ports available for Nginx to create a TCP connection to the Elasticsearch process on 127.0.0.1:9200.

Lets check to see who's doing all those connections:

$ awk '{print $4, $3}' access.log|grep -E 10:08:[34]|awk '{print $2}'|sort|uniq -c
      1 stashbot
  10668 tools.wikifactmine-pipeline

So it looks like wikifactmine-pipeline is pretty badly behaved with respect to the number of individual HTTP connections it is making in short period of time. There may be something we can to to tune Nginx's reverse-proxy socket handling to make sockets stay in TIME_WAIT for less time, but the first thing to look into would be getting wikifactmine-pipeline to batch it's inserts.

h/t to https://ma.ttias.be/nginx-cannot-assign-requested-address-for-upstream/ for leading me to think about TCP port exhaustion as the possible culprit here.

Tarrow claimed this task.

I'm not sure what the solution to this problem is; or if it might impact other people.

In the end I set up my own cluster under wikifactmine's labs project in an attempt to debug the problem. I did this using the toollabs:elasticsearch puppet role. Unfortunately I was unable to replicate the problems which is surprising. Even continually querying for 12 hours I wasn't able to exhaust the available sockets and replicate the problem.

I guess it either comes down to the combined load from the multiple users of the tools cluster or somewhere a configuration difference has slipped in. I'm not worried about solving this but if it crops up again for someone else then we could look at comparing setups.