Page MenuHomePhabricator

Elasticsearch puppet config changes broke puppet in various instances
Closed, ResolvedPublic

Description

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441338/
Instances with known problems: deployment-elastic0[5-7], deployment-logstash2, tools-elastic-XX, keith-logstash

Event Timeline

Change 463385 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] tools: Update usage of ::elasticsearch

https://gerrit.wikimedia.org/r/463385

Change 463385 merged by Andrew Bogott:
[operations/puppet@production] tools: Update usage of ::elasticsearch

https://gerrit.wikimedia.org/r/463385

Change 463386 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] tools: Update usage of ::elasticsearch (take 2)

https://gerrit.wikimedia.org/r/463386

Added this to deployment-prep project hiera: profile::elasticsearch::base_data_dir: /srv/elasticsearch
Added the following hiera stuff to each deployment-elasitc*:

profile::elasticsearch::dc_settings:
  tls_port: 9243
  certificate_name: deployment-elastic07.deployment-prep.eqiad.wmflabs

elasticsearch_5@beta-search service on deployment-logstash2 is still failing shortly after starting.

Sep 27 23:04:22 deployment-logstash2 systemd[1]: Starting Elasticsearch (cluster beta-search)...
Sep 27 23:04:22 deployment-logstash2 systemd[1]: Started Elasticsearch (cluster beta-search).
Sep 27 23:04:30 deployment-logstash2 systemd[1]: elasticsearch_5@beta-search.service: main process exited, code=exited, status=1/FAILURE
Sep 27 23:04:30 deployment-logstash2 systemd[1]: Unit elasticsearch_5@beta-search.service entered failed state.

Change 463386 merged by Andrew Bogott:
[operations/puppet@production] tools: Update usage of ::elasticsearch (take 2)

https://gerrit.wikimedia.org/r/463386

problems seen on deployment-logstash2 so far:

  • it has the cluster beta-search configured alongside labs-logstash-eqiad but only labs-logstash-eqiad should be present I think.
  • the instance labs-logstash-eqiad wants to load its data using the default path (java.nio.file.AccessDeniedException: /var/lib/elasticsearch/nodes) instead of /srv/elasticsearch/labs-logstash-eqiad/

I'm seeing only beta-search cluster configured by puppet on deployment-logstash2 ATM:

● elasticsearch.service                                                       masked failed failed    elasticsearch.service
● elasticsearch_5@beta-search.service                                         loaded failed failed    Elasticsearch (cluster beta-search)

And beta-search is failing to start / crashlooping with:

[2018-10-24T07:50:05,874][ERROR][org.elasticsearch.bootstrap.Bootstrap] Exception
java.lang.IllegalArgumentException: unknown setting [ltr.caches.max_mem] please check that any required plugins are installed, or check the breaking changes documentation for removed settings

Change 469387 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [deployment-prep] fix elastic config for deployment-logstash2

https://gerrit.wikimedia.org/r/469387

Change 469387 merged by Filippo Giunchedi:
[operations/puppet@production] [deployment-prep] fix elastic config for deployment-logstash2

https://gerrit.wikimedia.org/r/469387

Patch merged, though ferm fails because of a known problem with ferm and @resolve for A+AAAA records when the latter are not present:

Oct 25 08:57:01 deployment-logstash2 systemd[1]: Starting LSB: ferm firewall configuration...
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: Starting Firewall: fermError in /etc/ferm/conf.d/10_prometheus_elasticsearch_exporter_9108 line 4:
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: deployment-prometheus01.deployment-prep.eqiad.wmflabs
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: )
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: , AAAA
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: )
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: <--
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: DNS query for 'deployment-prometheus01.deployment-prep.eqiad.wmflabs' failed: NXDOMAIN
Oct 25 08:57:01 deployment-logstash2 ferm[20167]: (warning).
Oct 25 08:57:01 deployment-logstash2 systemd[1]: ferm.service: control process exited, code=exited status=255
Oct 25 08:57:01 deployment-logstash2 systemd[1]: Failed to start LSB: ferm firewall configuration.
Oct 25 08:57:01 deployment-logstash2 systemd[1]: Unit ferm.service entered failed state.

Looks like logs in deployment-prep are back now (cc @Ladsgroup) though we still have the ferm issue (tracked in T153468: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs). As far as I'm concerned logstash in deployment-prep is fixed now. Not sure about the other instances mentioned in the task description?

I overlooked other instances, deployment-elastic0* seems fine, no clue about others and not sure I have access to them (re-assigning to Krenair so that it can confirm it's fixed everywhere)

I overlooked other instances, deployment-elastic0* seems fine, no clue about others and not sure I have access to them (re-assigning to Krenair so that it can confirm it's fixed everywhere)

'it' can confirm it was fixed everywhere :)
btw, if you have access to any deployment-prep machine then you have access to all deployment-prep instances.

Krenair added subscribers: herron, Andrew.

Sorry I totally forgot I wrote in the description about those tools- and keith- hosts - I don't have access to those. @Andrew @herron are we all good there?

All good on keith-logstash, that was just a temporary dev host.

I'm going to go ahead and assume tools-elastic* is fine.