Maniphest T205672

Elasticsearch puppet config changes broke puppet in various instances
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krenair
	Sep 27 2018, 9:58 PM

Description

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441338/
Instances with known problems: deployment-elastic0[5-7], deployment-logstash2, tools-elastic-XX, keith-logstash

Details

Subject	Repo	Branch	Lines +/-
[deployment-prep] fix elastic config for deployment-logstash2	operations/puppet	production	+5 -1
tools: Update usage of ::elasticsearch (take 2)	operations/puppet	production	+37 -37
tools: Update usage of ::elasticsearch	operations/puppet	production	+6 -1

Customize query in gerrit

Related Objects

Mentioned In: T205863: Logstash in beta doesn't have any logs
Mentioned Here: T153468: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs
T205863: Logstash in beta doesn't have any logs

Event Timeline

Krenair created this task.Sep 27 2018, 9:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 27 2018, 9:58 PM

Krenair added subscribers: Gehel, EBernhardson.Sep 27 2018, 9:58 PM

Change 463385 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] tools: Update usage of ::elasticsearch

https://gerrit.wikimedia.org/r/463385

gerritbot added a project: Patch-For-Review.Sep 27 2018, 9:59 PM

Change 463385 merged by Andrew Bogott:
[operations/puppet@production] tools: Update usage of ::elasticsearch

https://gerrit.wikimedia.org/r/463385

Change 463386 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] tools: Update usage of ::elasticsearch (take 2)

https://gerrit.wikimedia.org/r/463386

Added this to deployment-prep project hiera: profile::elasticsearch::base_data_dir: /srv/elasticsearch
Added the following hiera stuff to each deployment-elasitc*:

profile::elasticsearch::dc_settings:
  tls_port: 9243
  certificate_name: deployment-elastic07.deployment-prep.eqiad.wmflabs

elasticsearch_5@beta-search service on deployment-logstash2 is still failing shortly after starting.

Sep 27 23:04:22 deployment-logstash2 systemd[1]: Starting Elasticsearch (cluster beta-search)...
Sep 27 23:04:22 deployment-logstash2 systemd[1]: Started Elasticsearch (cluster beta-search).
Sep 27 23:04:30 deployment-logstash2 systemd[1]: elasticsearch_5@beta-search.service: main process exited, code=exited, status=1/FAILURE
Sep 27 23:04:30 deployment-logstash2 systemd[1]: Unit elasticsearch_5@beta-search.service entered failed state.

Change 463386 merged by Andrew Bogott:
[operations/puppet@production] tools: Update usage of ::elasticsearch (take 2)

https://gerrit.wikimedia.org/r/463386

debt moved this task from needs triage to watching / waiting on the Discovery-Search board.Oct 4 2018, 5:26 PM

Krenair added a project: Beta-Cluster-Infrastructure.Oct 22 2018, 5:53 PM

Krenair moved this task from To Triage to Puppet errors on the Beta-Cluster-Infrastructure board.

problems seen on deployment-logstash2 so far:

it has the cluster beta-search configured alongside labs-logstash-eqiad but only labs-logstash-eqiad should be present I think.
the instance labs-logstash-eqiad wants to load its data using the default path (java.nio.file.AccessDeniedException: /var/lib/elasticsearch/nodes) instead of /srv/elasticsearch/labs-logstash-eqiad/

I'm seeing only beta-search cluster configured by puppet on deployment-logstash2 ATM:

● elasticsearch.service                                                       masked failed failed    elasticsearch.service
● elasticsearch_5@beta-search.service                                         loaded failed failed    Elasticsearch (cluster beta-search)

And beta-search is failing to start / crashlooping with:

[2018-10-24T07:50:05,874][ERROR][org.elasticsearch.bootstrap.Bootstrap] Exception
java.lang.IllegalArgumentException: unknown setting [ltr.caches.max_mem] please check that any required plugins are installed, or check the breaking changes documentation for removed settings

fgiunchedi mentioned this in T205863: Logstash in beta doesn't have any logs.Oct 24 2018, 7:53 AM

fgiunchedi added a subscriber: Ladsgroup.

Change 469387 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [deployment-prep] fix elastic config for deployment-logstash2

https://gerrit.wikimedia.org/r/469387

dcausse claimed this task.Oct 24 2018, 2:53 PM

dcausse moved this task from watching / waiting to Current work on the Discovery-Search board.

dcausse edited projects, added Discovery-Search (Current work); removed Discovery-Search.

dcausse moved this task from Incoming to Waiting on the Discovery-Search (Current work) board.

Change 469387 merged by Filippo Giunchedi:
[operations/puppet@production] [deployment-prep] fix elastic config for deployment-logstash2

https://gerrit.wikimedia.org/r/469387

Patch merged, though ferm fails because of a known problem with ferm and @resolve for A+AAAA records when the latter are not present:

Oct 25 08:57:01 deployment-logstash2 systemd[1]: Starting LSB: ferm firewall configuration...
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: Starting Firewall: fermError in /etc/ferm/conf.d/10_prometheus_elasticsearch_exporter_9108 line 4:
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: deployment-prometheus01.deployment-prep.eqiad.wmflabs
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: )
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: , AAAA
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: )
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: <--
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: DNS query for 'deployment-prometheus01.deployment-prep.eqiad.wmflabs' failed: NXDOMAIN
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: (warning).
Oct 25 08:57:01 deployment-logstash2 systemd[1]: ferm.service: control process exited, code=exited status=255
Oct 25 08:57:01 deployment-logstash2 systemd[1]: Failed to start LSB: ferm firewall configuration.
Oct 25 08:57:01 deployment-logstash2 systemd[1]: Unit ferm.service entered failed state.

dcausse moved this task from Waiting to Needs Reporting on the Discovery-Search (Current work) board.Oct 25 2018, 12:10 PM

Looks like logs in deployment-prep are back now (cc @Ladsgroup) though we still have the ferm issue (tracked in T153468: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs). As far as I'm concerned logstash in deployment-prep is fixed now. Not sure about the other instances mentioned in the task description?

I overlooked other instances, deployment-elastic0* seems fine, no clue about others and not sure I have access to them (re-assigning to Krenair so that it can confirm it's fixed everywhere)

In T205672#4694292, @dcausse wrote:

I overlooked other instances, deployment-elastic0* seems fine, no clue about others and not sure I have access to them (re-assigning to Krenair so that it can confirm it's fixed everywhere)

'it' can confirm it was fixed everywhere :)
btw, if you have access to any deployment-prep machine then you have access to all deployment-prep instances.

Sorry I totally forgot I wrote in the description about those tools- and keith- hosts - I don't have access to those. @Andrew @herron are we all good there?

All good on keith-logstash, that was just a temporary dev host.

Moving into a watching status for Discovery-Search

debt moved this task from needs triage to watching / waiting on the Discovery-Search board.Nov 2 2018, 9:58 PM

I'm going to go ahead and assume tools-elastic* is fine.

Elasticsearch puppet config changes broke puppet in various instancesClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Elasticsearch puppet config changes broke puppet in various instances
Closed, ResolvedPublic
Actions