Page MenuHomePhabricator

Elastica warning about Retrying connection to search.svc.eqiad.wmnet
Closed, ResolvedPublic

Description

Since August 31th 07:00 UTC there is a bunch of logs showing:

Warning: Retrying connection to search.svc.eqiad.wmnet after 2 attempts. [Called from Closure$ElasticaConnection::getClient

Logstash query for last 24 hours: https://logstash.wikimedia.org/goto/05a595608d300efb8b21007ea0318a49

Event Timeline

hashar created this task.Aug 31 2016, 10:04 PM
Restricted Application added a project: Discovery. · View Herald TranscriptAug 31 2016, 10:04 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I'm able to reproduce this fairly regularly from nc, although i need to issue the request 20 to 50 times before i get a 'No route to host' error. I think the pooled connections on the mediawiki servers are saving them from having more issues, as they are reusing open and valid connections most of the time. I havn't come up with much more than that unfortunately.

hashar added a subscriber: Gehel.Sep 1 2016, 8:37 AM

Erik mentioned elastic1028 had puppet disabled and elasticsearch disabed at 18:55 on August 31st.

It is showing in pybal https://config-master.wikimedia.org/pybal/eqiad/search as enabled though:

{ 'host': 'elastic1028.eqiad.wmnet', 'weight': 30, 'enabled': True }

Not sure whether pybal actively monitor ElasticSearch to auto pool/depool servers or whether that is handled via conftool or such.

hashar added a comment.Sep 1 2016, 8:40 AM

From SAL:

2016-08-31

18:54 <gehel>	shutting down elasticsearch on elastic1028 to prepare moving server - T143685
22:17 <gehel>	restarting elasticsearch on elastic1028
22:32 <gehel>	depooling elastic1047 from LVS
22:33 <gehel@palladium>	conftool action : set/pooled=no; selector: name=elastic1047.eqiad.wmnet
hashar triaged this task as Medium priority.Sep 1 2016, 8:41 AM
hashar added a comment.Sep 1 2016, 8:48 AM

ETCD config is at https://config-master.wikimedia.org/conftool/eqiad/search and shows:

{ 'host': 'elastic1047.eqiad.wmnet', 'weight':10, 'enabled': False }

So at least that server is unpooled (or should).

Gehel added a comment.Sep 1 2016, 9:21 AM

Erik mentioned elastic1028 had puppet disabled and elasticsearch disabed at 18:55 on August 31st.
It is showing in pybal https://config-master.wikimedia.org/pybal/eqiad/search as enabled though:

{ 'host': 'elastic1028.eqiad.wmnet', 'weight': 30, 'enabled': True }

Not sure whether pybal actively monitor ElasticSearch to auto pool/depool servers or whether that is handled via conftool or such.

I restarted elasticsearch on elastic1028 on

22:17 gehel: restarting elasticsearch on elastic1028

hashar added a comment.Sep 1 2016, 9:29 AM

Sorry I have been misleading. That link https://config-master.wikimedia.org/pybal/eqiad/search is for pybal conf and hasn't been updated since September. Should probably be garbage collected.

The proper one has /conftool/ in its path https://config-master.wikimedia.org/conftool/eqiad/search

Mentioned in SAL [2016-09-01T10:17:12Z] <gehel> repooled elastic104[456] - T144450

Gehel closed this task as Resolved.Sep 1 2016, 10:19 AM
Gehel claimed this task.

Issue was related to PyBal not re-doing DNS resolution. elastic104[4567] have new IP's now that they have moved in new racks. It was necessary to remove them from PyBal and re-add them a few minutes later:

gehel@palladium:~$ sudo -i confctl --quiet select name=elastic1046.eqiad.wmnet set/pooled=inactive
gehel@palladium:~$ sudo -i confctl --quiet select name=elastic1046.eqiad.wmnet set/pooled=yes

The spam in logstash is gone :] Well done ops!

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM