Page MenuHomePhabricator

/var/run/elasticsearch deleted by elasticsearch
Closed, ResolvedPublic3 Estimated Story Points

Description

I've been looking into bug affecting elastic2045 with @dcausse where Elastic failed to start since /var/run/elasticsearch isn't present.

The elasticsearch-oss Debian package declares /var/run/elasticsearch in the package (via .dirs), and also ships a tmpfiles config in /usr/lib/tmpfiles.d/elasticsearch.conf

Looking at "systemd-analyze critical-chain" systemd-tmpfiles-setup.service was correctly started and there are also no errors in the journal for it.

Running "systemd-tmpfiles --create" correctly brought up /var/run/elasticsearch.

The next time this error happens, we should check whether the lack of the run dir is specific to elasticsearch or whether other directories declared in /usr/lib/tmpfiles.d are also affected.

Event Timeline

We tracked it down to an Elasticsearch restart removing /var/run/elasticsearch, this is the log by auditd (which was installed in deployment-prep, which shows the same error):

type=PROCTITLE msg=audit(03/02/2021 10:54:03.250:74) : proctitle=/usr/bin/java -Xms2G -Xmx2G -XX:+UseNUMA -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfi
type=PATH msg=audit(03/02/2021 10:54:03.250:74) : item=1 name=/var/run/elasticsearch/beta-search.pid inode=4956182 dev=00:14 mode=file,644 ouid=elasticsearch ogid=elasticsearch rdev=00:00 nametype=CREATE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=PATH msg=audit(03/02/2021 10:54:03.250:74) : item=0 name=/var/run/elasticsearch/ inode=4944609 dev=00:14 mode=dir,755 ouid=elasticsearch ogid=elasticsearch rdev=00:00 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=CWD msg=audit(03/02/2021 10:54:03.250:74) : cwd=/usr/share/elasticsearch
type=SYSCALL msg=audit(03/02/2021 10:54:03.250:74) : arch=x86_64 syscall=open success=yes exit=71 a0=0x7f908854c930 a1=O_WRONLY|O_CREAT|O_TRUNC a2=0666 a3=0x7f90790183fb items=2 ppid=1 pid=17398 auid=unset uid=elasticsearch gid=elasticsearch euid=elasticsearch suid=elasticsearch fsuid=elasticsearch egid=elasticsearch sgid=elasticsearch fsgid=elasticsearch tty=(none) ses=unset comm=java exe=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java subj==unconfined key=(null)
----
type=PROCTITLE msg=audit(03/02/2021 10:54:06.922:76) : proctitle=/usr/bin/java -Xms2G -Xmx2G -XX:+UseNUMA -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfi
type=PATH msg=audit(03/02/2021 10:54:06.922:76) : item=1 name=/var/run/elasticsearch/beta-search.pid inode=4956182 dev=00:14 mode=file,644 ouid=elasticsearch ogid=elasticsearch rdev=00:00 nametype=DELETE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=PATH msg=audit(03/02/2021 10:54:06.922:76) : item=0 name=/var/run/elasticsearch/ inode=4944609 dev=00:14 mode=dir,755 ouid=elasticsearch ogid=elasticsearch rdev=00:00 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=CWD msg=audit(03/02/2021 10:54:06.922:76) : cwd=/usr/share/elasticsearch
type=SYSCALL msg=audit(03/02/2021 10:54:06.922:76) : arch=x86_64 syscall=unlink success=yes exit=0 a0=0x7f902c0008c0 a1=0x7f9069997800 a2=0x7f902c0008c0 a3=0x7f90795dc901 items=2 ppid=1 pid=17398 auid=unset uid=elasticsearch gid=elasticsearch euid=elasticsearch suid=elasticsearch fsuid=elasticsearch egid=elasticsearch sgid=elasticsearch fsgid=elasticsearch tty=(none) ses=unset comm=Thread-2 exe=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java subj==unconfined key=(null)
dcausse renamed this task from /var/run/elasticsearch not created after reboot to /var/run/elasticsearch deleted by elasticsearch.Mar 2 2021, 3:38 PM

Triaging to high as this can cause serious problems.
The cause seems to be in elastic itself but I could not spot the exact problem looking at the elastic code. We might want to workaround the issue by always running systemd-tmpfiles --create from the elasticsearch systemd unit to make sure the folder exists when it's needed.

We might want to workaround the issue by always running systemd-tmpfiles --create from the elasticsearch systemd unit to make sure the folder exists when it's needed.

We can add a ExecStartPre to the systemd unit which runs "systemd-tmpfiles --create /usr/lib/tmpfiles.d/elasticsearch.conf" (we can't run systemd-tmpfiles --create since most of the other conffiles need root and that'll cause a lot of logspam)
The elasticsearch class already uses systemd::unit, so this can simply passed via the override option.

Change 720667 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] elasticsearch: Force creation of tmp files before restart

https://gerrit.wikimedia.org/r/720667

Change 720667 merged by Ryan Kemper:

[operations/puppet@production] elasticsearch: Force creation of tmp files before restart

https://gerrit.wikimedia.org/r/720667

Change 721644 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elasticsearch: it's ExecStartPre, not ExecPreStart

https://gerrit.wikimedia.org/r/721644

Change 721644 merged by Ryan Kemper:

[operations/puppet@production] elasticsearch: it's ExecStartPre, not ExecPreStart

https://gerrit.wikimedia.org/r/721644

Mentioned in SAL (#wikimedia-operations) [2021-11-16T23:19:15Z] <ryankemper> T276198 ryankemper@cumin1001:~$ sudo cumin '*elastic*' 'sudo disable-puppet "Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/721644"' (done a few mins ago)

Mentioned in SAL (#wikimedia-operations) [2021-11-16T23:27:51Z] <ryankemper> T276198 ryankemper@elastic1049:~$ sudo run-puppet-agent --force; elasticsearch_6@production-search-eqiad.service didn't restart but it looks like there might be slightly wrong with the new ExecPreStart line => Executable path is not absolute, ignoring: systemd-tmpfiles --create /usr/lib/tmpfiles.d/elasticsearch.conf

Change 739375 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elasticsearch: use absolute path of binary

https://gerrit.wikimedia.org/r/739375

Change 739375 merged by Ryan Kemper:

[operations/puppet@production] elasticsearch: use absolute path of binary

https://gerrit.wikimedia.org/r/739375

Mentioned in SAL (#wikimedia-operations) [2021-11-16T23:59:38Z] <ryankemper> T276198 ryankemper@elastic1049:~$ sudo run-puppet-agent --force to test out https://gerrit.wikimedia.org/r/c/operations/puppet/+/739375

Mentioned in SAL (#wikimedia-operations) [2021-11-17T00:19:18Z] <ryankemper> T276198 ryankemper@cumin1001:~$ sudo cumin -b 3 '*elastic*' 'sudo run-puppet-agent --force' Change looks good (no complaints from systemd), rolling out to rest of fleet / reenabling puppet

RKemper added a subscriber: RKemper.

We can probably move this to Needs Reporting but I'll check with @dcausse and others in our Search team Wednesday meeting. For now I'll stick this in Waiting while we make sure that the /var/run/elasticsearch gets created when intended (we might even want to simulate nuking /var/run/elasticsearch after stopping the service on a single host)

Gehel moved this task from Needs Reporting to Incoming on the Discovery-Search (Current work) board.
Gehel added a subscriber: Gehel.

This same problem is happening again on deployment-elastic07. This needs to be further investigated.

After the upgrade from 6.5 to 6.8, the service wouldn't restart. Initial error (when ES's rundir is missing):

Feb 17 10:34:07 deployment-elastic07 elasticsearch[12605]: Exception in thread "main" org.elasticsearch.bootstrap.BootstrapException: java.nio.file.AccessDeniedException: /var/run/elasticsearch

I was able to work around this with the following steps:

  1. Replace /usr/share/elasticsearch/bin/elasticsearch startup script on deployment-elastic07 with known-good version from deployment-elastic05 . shasum of known-good script: 2a11d1b38f6712e4898a383bf68c7ed5937ba0a1

This got me to the next error:
Feb 18 19:30:05 deployment-elastic07 elasticsearch[20988]: Exception in thread "main" org.elasticsearch.bootstrap.BootstrapException: org.elasticsearch.cli.UserException: unable to create temporary keystore at [/etc/elasticsearch/beta-search/elasticsearch.keystore.tmp], please check filesystem permissions

Worked around this by setting ownership of /etc/elasticsearch/beta-search to elasticsearch:elasticsearch . After ES does its internal upgrade, it no longer needs to own this directory, so I reverted its permissions to root:root and verified that the service started cleanly afterwards.

Long-term, we need to look into the tmpfiles creation in the unit file` /lib/systemd/system/elasticsearch_6@.service` , specifically the line ExecStartPre=/bin/systemd-tmpfiles --create /usr/lib/tmpfiles.d/elasticsearch.conf . No evidence of this, but I suspect we may need to create a separate rundir for each Elastic instance (we typically run 2 or 3 on a single host).

Confirmed the following:

  • Known-good ES startup script (shasum:2a11d1b38f6712e4898a383bf68c7ed5937ba0a1) is from Elastic's 6.5.4 release
  • Known-bad ES startup script (shasum:96904152795633d9ae8a7afd7886a4f62a30ee1f) is from Elastic's 6.8.23 release

In conclusion, we aren't introducing these breaking changes.

Ran a diff against good and bad, the bad has the following inserted in 23-29:

if [[ -d "$ES_HOME/plugins/quota-aware-fs" ]]; then
  PLUGIN_JAR_PATH="$(find "$ES_HOME/plugins/quota-aware-fs" -name quota-aware-fs\*.jar)"
  ES_JAVA_OPTS="$ES_JAVA_OPTS -Djava.nio.file.spi.DefaultFileSystemProvider=org.elasticsearch.fs.quotaaware.QuotaAwareFileSystemProvider"
  ES_JAVA_OPTS="$ES_JAVA_OPTS -Xbootclasspath/a:$PLUGIN_JAR_PATH"
fi

Running more tests to verify that the startup script is indeed the problem.

Upon further testing, the startup script seems irrelevant. However, I have noticed a new problem: the service can't start after a reboot. HOWEVER, the service will start manually. Will continue investigation tomorrow...

At least two (and sometimes three) instances of ES are using the same rundir, /var/run/elasticsearch .

Because systemd works in parallel, a single failed instance will delete the rundir and cause the other services to fail. I haven't proven this definitively, but I confirmed that if the other services are disabled at boot, the single instance will start correctly.

The next question is: what to do about it? I would prefer to give each service its own rundir, so if one exits, it can't affect the others.

That sounds like a very promising lead, nice detective work! I think we can test following as a fix:

/var/run/elasticsearch gets shipped via /usr/lib/tmpfiles.d/elasticsearch-oss.conf (shipped as part of the elastic deb). We can however create instance-specific runtime directories with systemd::tmpfile in the puppetisation of the elastic instances, e.g.

systemd::tmpfile { 'elastic-foo'}:
  content => 'd /var/run/elastic-foo 0755 elasticsearch elasticsearch - -'
}

And I think then we only need to modify the "Environment=PID_DIR" configuration in the existing (already puppetised) systemd units.

Thanks Moritz!

I started writing a playbook to make the changes as you described above. I don't really know puppet yet, but I believe @Gehel is going to help me translate it on Monday. If you would like to join us feel free to reach out, I'm generally around ~ 1400 - 2300 UTC.

Change 766876 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: prevent rundir from deletion

https://gerrit.wikimedia.org/r/766876

Mentioned in SAL (#wikimedia-operations) [2022-03-01T22:32:37Z] <inflatador> T276198 disabling puppet on elastic1052.eqiad.wmnet to test failure condition (rebooting shortly)

Mentioned in SAL (#wikimedia-operations) [2022-03-01T22:37:10Z] <inflatador> T276198 rebooting elastic1052.eqiad.wmnet to test failure condition

Mentioned in SAL (#wikimedia-operations) [2022-03-01T22:50:59Z] <inflatador> T276198 reenabled puppet on elastic1052.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-03-02T21:53:23Z] <ryankemper> T276198 Disabled puppet across all of elastic*, cloudelastic*, and relforge* to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/766876/ on a single elastic host

Change 766876 merged by Bking:

[operations/puppet@production] elastic: prevent rundir from deletion

https://gerrit.wikimedia.org/r/766876

Mentioned in SAL (#wikimedia-operations) [2022-03-02T22:16:30Z] <ryankemper> T276198 Testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/766876/ on elastic1052; elasticsearch service fails to start. It's expecting to find /etc/tmpfiles.d/elasticsearch-production-search-psi-eqiad.conf but the actual filename is elasticsearch-production-search-psi-eqiad-conf.conf. Not sure why that trailing -conf is there in the filename. It doesn't look like something systemd::tmpfile is doing.

Mentioned in SAL (#wikimedia-operations) [2022-03-02T22:21:04Z] <ryankemper> T276198 Downtimed elastic1052 for 2 hours while troubleshooting

Mentioned in SAL (#wikimedia-operations) [2022-03-02T22:16:30Z] <ryankemper> T276198 Testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/766876/ on elastic1052; elasticsearch service fails to start. It's expecting to find /etc/tmpfiles.d/elasticsearch-production-search-psi-eqiad.conf but the actual filename is elasticsearch-production-search-psi-eqiad-conf.conf. Not sure why that trailing -conf is there in the filename. It doesn't look like something systemd::tmpfile is doing.

It _is_ added by systemd::tmpfile in fact, see https://github.com/wikimedia/puppet/blob/production/modules/systemd/manifests/tmpfile.pp#L28 Sorry for missing that in the review.

So I think we simply need to drop the .conf suffix from https://github.com/wikimedia/puppet/blob/production/modules/elasticsearch/manifests/instance.pp#L306 to make it work.

Mentioned in SAL (#wikimedia-operations) [2022-03-02T22:16:30Z] <ryankemper> T276198 Testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/766876/ on elastic1052; elasticsearch service fails to start. It's expecting to find /etc/tmpfiles.d/elasticsearch-production-search-psi-eqiad.conf but the actual filename is elasticsearch-production-search-psi-eqiad-conf.conf. Not sure why that trailing -conf is there in the filename. It doesn't look like something systemd::tmpfile is doing.

It _is_ added by systemd::tmpfile in fact, see https://github.com/wikimedia/puppet/blob/production/modules/systemd/manifests/tmpfile.pp#L28 Sorry for missing that in the review.

So I think we simply need to drop the .conf suffix from https://github.com/wikimedia/puppet/blob/production/modules/elasticsearch/manifests/instance.pp#L306 to make it work.

Thanks @MoritzMuehlenhoff. Indeed I realized the same after glancing at PCC, but you beat me to it :P Below is the (now redundant) comment I was working on :D


If we'd inspected the PCC output of the most recent patchset a bit more deeply, we'd probably have caught this:

Systemd::Tmpfile[elasticsearch-production-search-eqiad.conf]
File[/etc/tmpfiles.d/elasticsearch-production-search-psi-eqiad-conf.conf]

The combination of https://github.com/wikimedia/puppet/blob/18beae82799fbc3fb4e9c46c1271df62aff1569e/modules/systemd/manifests/tmpfile.pp#L27-L28 w/
https://gerrit.wikimedia.org/r/c/operations/puppet/+/766876/7/modules/elasticsearch/manifests/instance.pp is what leads to this behavior.

Patch coming up soon.

Change 767600 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: fix filename of tmpfile

https://gerrit.wikimedia.org/r/767600

Change 767600 merged by Ryan Kemper:

[operations/puppet@production] elastic: fix filename of tmpfile

https://gerrit.wikimedia.org/r/767600

Change 767603 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: disable readahead script needs new fp

https://gerrit.wikimedia.org/r/767603

Change 767603 merged by Ryan Kemper:

[operations/puppet@production] elastic: disable readahead script needs new fp

https://gerrit.wikimedia.org/r/767603

Mentioned in SAL (#wikimedia-operations) [2022-03-02T23:25:15Z] <ryankemper> T276198 Re-enabled puppet across fleet: ryankemper@cumin1001:~$ sudo -E cumin 'R:Elasticsearch::instance' 'enable-puppet "deploy fix from T276198"'

Change 767607 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] opensearch: prevent rundir from deletion

https://gerrit.wikimedia.org/r/767607

Change 767607 merged by Cwhite:

[operations/puppet@production] opensearch: use separate rundir per instance

https://gerrit.wikimedia.org/r/767607