https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441338/
Instances with known problems: deployment-elastic0[5-7], deployment-logstash2, tools-elastic-XX, keith-logstash
Description
Details
Related Objects
Event Timeline
Change 463385 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] tools: Update usage of ::elasticsearch
Change 463385 merged by Andrew Bogott:
[operations/puppet@production] tools: Update usage of ::elasticsearch
Change 463386 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] tools: Update usage of ::elasticsearch (take 2)
Added this to deployment-prep project hiera: profile::elasticsearch::base_data_dir: /srv/elasticsearch
Added the following hiera stuff to each deployment-elasitc*:
profile::elasticsearch::dc_settings: tls_port: 9243 certificate_name: deployment-elastic07.deployment-prep.eqiad.wmflabs
elasticsearch_5@beta-search service on deployment-logstash2 is still failing shortly after starting.
Sep 27 23:04:22 deployment-logstash2 systemd[1]: Starting Elasticsearch (cluster beta-search)... Sep 27 23:04:22 deployment-logstash2 systemd[1]: Started Elasticsearch (cluster beta-search). Sep 27 23:04:30 deployment-logstash2 systemd[1]: elasticsearch_5@beta-search.service: main process exited, code=exited, status=1/FAILURE Sep 27 23:04:30 deployment-logstash2 systemd[1]: Unit elasticsearch_5@beta-search.service entered failed state.
Change 463386 merged by Andrew Bogott:
[operations/puppet@production] tools: Update usage of ::elasticsearch (take 2)
problems seen on deployment-logstash2 so far:
- it has the cluster beta-search configured alongside labs-logstash-eqiad but only labs-logstash-eqiad should be present I think.
- the instance labs-logstash-eqiad wants to load its data using the default path (java.nio.file.AccessDeniedException: /var/lib/elasticsearch/nodes) instead of /srv/elasticsearch/labs-logstash-eqiad/
I'm seeing only beta-search cluster configured by puppet on deployment-logstash2 ATM:
● elasticsearch.service masked failed failed elasticsearch.service ● elasticsearch_5@beta-search.service loaded failed failed Elasticsearch (cluster beta-search)
And beta-search is failing to start / crashlooping with:
[2018-10-24T07:50:05,874][ERROR][org.elasticsearch.bootstrap.Bootstrap] Exception java.lang.IllegalArgumentException: unknown setting [ltr.caches.max_mem] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
Change 469387 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [deployment-prep] fix elastic config for deployment-logstash2
Change 469387 merged by Filippo Giunchedi:
[operations/puppet@production] [deployment-prep] fix elastic config for deployment-logstash2
Patch merged, though ferm fails because of a known problem with ferm and @resolve for A+AAAA records when the latter are not present:
Oct 25 08:57:01 deployment-logstash2 systemd[1]: Starting LSB: ferm firewall configuration... Oct 25 08:57:01 deployment-logstash2 ferm[20167]: Starting Firewall: fermError in /etc/ferm/conf.d/10_prometheus_elasticsearch_exporter_9108 line 4: Oct 25 08:57:01 deployment-logstash2 ferm[20167]: deployment-prometheus01.deployment-prep.eqiad.wmflabs Oct 25 08:57:01 deployment-logstash2 ferm[20167]: ) Oct 25 08:57:01 deployment-logstash2 ferm[20167]: , AAAA Oct 25 08:57:01 deployment-logstash2 ferm[20167]: ) Oct 25 08:57:01 deployment-logstash2 ferm[20167]: <-- Oct 25 08:57:01 deployment-logstash2 ferm[20167]: DNS query for 'deployment-prometheus01.deployment-prep.eqiad.wmflabs' failed: NXDOMAIN Oct 25 08:57:01 deployment-logstash2 ferm[20167]: (warning). Oct 25 08:57:01 deployment-logstash2 systemd[1]: ferm.service: control process exited, code=exited status=255 Oct 25 08:57:01 deployment-logstash2 systemd[1]: Failed to start LSB: ferm firewall configuration. Oct 25 08:57:01 deployment-logstash2 systemd[1]: Unit ferm.service entered failed state.
Looks like logs in deployment-prep are back now (cc @Ladsgroup) though we still have the ferm issue (tracked in T153468: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs). As far as I'm concerned logstash in deployment-prep is fixed now. Not sure about the other instances mentioned in the task description?
I overlooked other instances, deployment-elastic0* seems fine, no clue about others and not sure I have access to them (re-assigning to Krenair so that it can confirm it's fixed everywhere)
'it' can confirm it was fixed everywhere :)
btw, if you have access to any deployment-prep machine then you have access to all deployment-prep instances.