Elasticsearch puppet config changes broke puppet in various instances
Open, Needs TriagePublic

Description

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441338/
Instances with known problems: deployment-elastic0[5-7], deployment-logstash2, tools-elastic-XX, keith-logstash

Krenair created this task.Sep 27 2018, 9:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 27 2018, 9:58 PM

Change 463385 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] tools: Update usage of ::elasticsearch

https://gerrit.wikimedia.org/r/463385

Change 463385 merged by Andrew Bogott:
[operations/puppet@production] tools: Update usage of ::elasticsearch

https://gerrit.wikimedia.org/r/463385

Change 463386 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] tools: Update usage of ::elasticsearch (take 2)

https://gerrit.wikimedia.org/r/463386

Krenair added a comment.EditedSep 27 2018, 10:37 PM

Added this to deployment-prep project hiera: profile::elasticsearch::base_data_dir: /srv/elasticsearch
Added the following hiera stuff to each deployment-elasitc*:

profile::elasticsearch::dc_settings:
  tls_port: 9243
  certificate_name: deployment-elastic07.deployment-prep.eqiad.wmflabs

elasticsearch_5@beta-search service on deployment-logstash2 is still failing shortly after starting.

Sep 27 23:04:22 deployment-logstash2 systemd[1]: Starting Elasticsearch (cluster beta-search)...
Sep 27 23:04:22 deployment-logstash2 systemd[1]: Started Elasticsearch (cluster beta-search).
Sep 27 23:04:30 deployment-logstash2 systemd[1]: elasticsearch_5@beta-search.service: main process exited, code=exited, status=1/FAILURE
Sep 27 23:04:30 deployment-logstash2 systemd[1]: Unit elasticsearch_5@beta-search.service entered failed state.

Change 463386 merged by Andrew Bogott:
[operations/puppet@production] tools: Update usage of ::elasticsearch (take 2)

https://gerrit.wikimedia.org/r/463386

dcausse added a subscriber: dcausse.EditedOct 22 2018, 5:59 PM

problems seen on deployment-logstash2 so far:

  • it has the cluster beta-search configured alongside labs-logstash-eqiad but only labs-logstash-eqiad should be present I think.
  • the instance labs-logstash-eqiad wants to load its data using the default path (java.nio.file.AccessDeniedException: /var/lib/elasticsearch/nodes) instead of /srv/elasticsearch/labs-logstash-eqiad/

I'm seeing only beta-search cluster configured by puppet on deployment-logstash2 ATM:

● elasticsearch.service                                                       masked failed failed    elasticsearch.service
● elasticsearch_5@beta-search.service                                         loaded failed failed    Elasticsearch (cluster beta-search)

And beta-search is failing to start / crashlooping with:

[2018-10-24T07:50:05,874][ERROR][org.elasticsearch.bootstrap.Bootstrap] Exception
java.lang.IllegalArgumentException: unknown setting [ltr.caches.max_mem] please check that any required plugins are installed, or check the breaking changes documentation for removed settings

Change 469387 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [deployment-prep] fix elastic config for deployment-logstash2

https://gerrit.wikimedia.org/r/469387

dcausse claimed this task.
dcausse moved this task from Backlog to Waiting/Blocked on the Discovery-Search (Current work) board.

Change 469387 merged by Filippo Giunchedi:
[operations/puppet@production] [deployment-prep] fix elastic config for deployment-logstash2

https://gerrit.wikimedia.org/r/469387

Patch merged, though ferm fails because of a known problem with ferm and @resolve for A+AAAA records when the latter are not present:

Oct 25 08:57:01 deployment-logstash2 systemd[1]: Starting LSB: ferm firewall configuration...
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: Starting Firewall: fermError in /etc/ferm/conf.d/10_prometheus_elasticsearch_exporter_9108 line 4:
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: deployment-prometheus01.deployment-prep.eqiad.wmflabs
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: )
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: , AAAA
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: )
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: <--
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: DNS query for 'deployment-prometheus01.deployment-prep.eqiad.wmflabs' failed: NXDOMAIN
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: (warning).
Oct 25 08:57:01 deployment-logstash2 systemd[1]: ferm.service: control process exited, code=exited status=255
Oct 25 08:57:01 deployment-logstash2 systemd[1]: Failed to start LSB: ferm firewall configuration.
Oct 25 08:57:01 deployment-logstash2 systemd[1]: Unit ferm.service entered failed state.

Looks like logs in deployment-prep are back now (cc @Ladsgroup) though we still have the ferm issue (tracked in T153468: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs). As far as I'm concerned logstash in deployment-prep is fixed now. Not sure about the other instances mentioned in the task description?

dcausse reassigned this task from dcausse to Krenair.EditedOct 25 2018, 12:21 PM

I overlooked other instances, deployment-elastic0* seems fine, no clue about others and not sure I have access to them (re-assigning to Krenair so that it can confirm it's fixed everywhere)

Krenair closed this task as Resolved.Oct 26 2018, 5:02 PM

I overlooked other instances, deployment-elastic0* seems fine, no clue about others and not sure I have access to them (re-assigning to Krenair so that it can confirm it's fixed everywhere)

'it' can confirm it was fixed everywhere :)
btw, if you have access to any deployment-prep machine then you have access to all deployment-prep instances.

Krenair reopened this task as Open.Oct 26 2018, 5:05 PM
Krenair added subscribers: herron, Andrew.

Sorry I totally forgot I wrote in the description about those tools- and keith- hosts - I don't have access to those. @Andrew @herron are we all good there?

All good on keith-logstash, that was just a temporary dev host.

debt added a subscriber: debt.

Moving into a watching status for Discovery-Search