Page MenuHomePhabricator

Setup two node elasticsearch cluster on relforge1001-1002
Closed, ResolvedPublic

Description

These new servers are now racked and installed. They need the appropriate puppet configuration deployed to make them into a two node elasticsearch cluster. Port 80 should be made accessible from the labs network (may need a separate ticket for labs admins). Previously we put together the elasticsearch::proxy module in puppet which exposes a limited portion of elasticsearch on port 80, port 9200 should not be available from labs.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 7 2016, 10:21 PM
debt triaged this task as High priority.Jun 16 2016, 10:23 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.

Change 299865 had a related patch set uploaded (by Gehel):
WIP - configure new relevance forge servers

https://gerrit.wikimedia.org/r/299865

Change 299865 merged by Gehel:
Configure new relevance forge servers

https://gerrit.wikimedia.org/r/299865

Change 300241 had a related patch set uploaded (by Gehel):
Adding rack information for new relforge servers

https://gerrit.wikimedia.org/r/300241

Change 300241 merged by Gehel:
Adding rack information for new relforge servers

https://gerrit.wikimedia.org/r/300241

Mentioned in SAL [2016-07-21T09:47:51Z] <gehel> reinstalling and configuring relforge1001/1002 - T137256

Gehel added a subscriber: Gehel.Jul 21 2016, 2:27 PM

Change 300286 had a related patch set uploaded (by Gehel):
New partition scheme for relforge (elasticsearch) servers

https://gerrit.wikimedia.org/r/300286

Change 300286 merged by Gehel:
Changed partition scheme for relforge (elasticsearch) servers

https://gerrit.wikimedia.org/r/300286

Gehel added a comment.Jul 21 2016, 8:36 PM

Deployment of plugins for elasticsearch is done by trebuchet, which requires connecting to redis on tin. This is not allowed by current ferm rules.

Installation done, but elasticsearch master election is failing. Firewall seems to be opened (at least port 9300 is opened between relforge1001 and 1002).

Investigating...

Log extract:

[2016-08-02 17:44:43,292][DEBUG][action.admin.cluster.state] [relforge1001] no known master node, scheduling a retry
[2016-08-02 17:45:05,237][WARN ][discovery.zen.ping.unicast] [relforge1001] failed to send ping to [{#zen_unicast_2#}{10.64.37.21}{relforge1002.eqiad.wmnet/10.64.37.21:9300}]
SendRequestTransportException[[][relforge1002.eqiad.wmnet/10.64.37.21:9300][internal:discovery/zen/unicast]]; nested: NodeNotConnectedException[[][relforge1002.eqiad.wmnet/10.64.37.21:9300] Node not connected];
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:340)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPingRequestToNode(UnicastZenPing.java:440)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPings(UnicastZenPing.java:426)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.ping(UnicastZenPing.java:240)
        at org.elasticsearch.discovery.zen.ping.ZenPingService.ping(ZenPingService.java:106)
        at org.elasticsearch.discovery.zen.ping.ZenPingService.pingAndWait(ZenPingService.java:84)
        at org.elasticsearch.discovery.zen.ZenDiscovery.findMaster(ZenDiscovery.java:886)
        at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:350)
        at org.elasticsearch.discovery.zen.ZenDiscovery.access$4800(ZenDiscovery.java:91)
        at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1237)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: NodeNotConnectedException[[][relforge1002.eqiad.wmnet/10.64.37.21:9300] Node not connected]
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:1132)
        at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:819)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:329)
        ... 12 more
[2016-08-02 17:45:13,294][DEBUG][action.admin.cluster.state] [relforge1001] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2016-08-02 17:45:13,296][WARN ][rest.suppressed          ] /_cluster/state/master_node Params: {metric=master_node}
MasterNotDiscoveredException[null]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$5.onTimeout(TransportMasterNodeAction.java:226)
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:236)
        at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:804)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2016-08-02 17:45:27,096][DEBUG][action.admin.cluster.health] [relforge1001] no known master node, scheduling a retry
[2016-08-02 17:45:31,120][DEBUG][action.admin.cluster.health] [relforge1001] no known master node, scheduling a retry
[2016-08-02 17:45:43,323][DEBUG][action.admin.cluster.state] [relforge1001] no known master node, scheduling a retry
Gehel added a comment.Aug 3 2016, 8:45 AM

Elasticsearch is up and running, cluster is green. I'll keep this task open a bit as we are probably missing a few minor tweaks to make it work perfectly...

debt closed this task as Resolved.Aug 16 2016, 10:07 PM