Page MenuHomePhabricator

/var/run/elasticsearch deleted by elasticsearch
Open, HighPublic3 Estimated Story Points

Description

I've been looking into bug affecting elastic2045 with @dcausse where Elastic failed to start since /var/run/elasticsearch isn't present.

The elasticsearch-oss Debian package declares /var/run/elasticsearch in the package (via .dirs), and also ships a tmpfiles config in /usr/lib/tmpfiles.d/elasticsearch.conf

Looking at "systemd-analyze critical-chain" systemd-tmpfiles-setup.service was correctly started and there are also no errors in the journal for it.

Running "systemd-tmpfiles --create" correctly brought up /var/run/elasticsearch.

The next time this error happens, we should check whether the lack of the run dir is specific to elasticsearch or whether other directories declared in /usr/lib/tmpfiles.d are also affected.

Event Timeline

We tracked it down to an Elasticsearch restart removing /var/run/elasticsearch, this is the log by auditd (which was installed in deployment-prep, which shows the same error):

type=PROCTITLE msg=audit(03/02/2021 10:54:03.250:74) : proctitle=/usr/bin/java -Xms2G -Xmx2G -XX:+UseNUMA -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfi
type=PATH msg=audit(03/02/2021 10:54:03.250:74) : item=1 name=/var/run/elasticsearch/beta-search.pid inode=4956182 dev=00:14 mode=file,644 ouid=elasticsearch ogid=elasticsearch rdev=00:00 nametype=CREATE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=PATH msg=audit(03/02/2021 10:54:03.250:74) : item=0 name=/var/run/elasticsearch/ inode=4944609 dev=00:14 mode=dir,755 ouid=elasticsearch ogid=elasticsearch rdev=00:00 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=CWD msg=audit(03/02/2021 10:54:03.250:74) : cwd=/usr/share/elasticsearch
type=SYSCALL msg=audit(03/02/2021 10:54:03.250:74) : arch=x86_64 syscall=open success=yes exit=71 a0=0x7f908854c930 a1=O_WRONLY|O_CREAT|O_TRUNC a2=0666 a3=0x7f90790183fb items=2 ppid=1 pid=17398 auid=unset uid=elasticsearch gid=elasticsearch euid=elasticsearch suid=elasticsearch fsuid=elasticsearch egid=elasticsearch sgid=elasticsearch fsgid=elasticsearch tty=(none) ses=unset comm=java exe=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java subj==unconfined key=(null)
----
type=PROCTITLE msg=audit(03/02/2021 10:54:06.922:76) : proctitle=/usr/bin/java -Xms2G -Xmx2G -XX:+UseNUMA -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfi
type=PATH msg=audit(03/02/2021 10:54:06.922:76) : item=1 name=/var/run/elasticsearch/beta-search.pid inode=4956182 dev=00:14 mode=file,644 ouid=elasticsearch ogid=elasticsearch rdev=00:00 nametype=DELETE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=PATH msg=audit(03/02/2021 10:54:06.922:76) : item=0 name=/var/run/elasticsearch/ inode=4944609 dev=00:14 mode=dir,755 ouid=elasticsearch ogid=elasticsearch rdev=00:00 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
type=CWD msg=audit(03/02/2021 10:54:06.922:76) : cwd=/usr/share/elasticsearch
type=SYSCALL msg=audit(03/02/2021 10:54:06.922:76) : arch=x86_64 syscall=unlink success=yes exit=0 a0=0x7f902c0008c0 a1=0x7f9069997800 a2=0x7f902c0008c0 a3=0x7f90795dc901 items=2 ppid=1 pid=17398 auid=unset uid=elasticsearch gid=elasticsearch euid=elasticsearch suid=elasticsearch fsuid=elasticsearch egid=elasticsearch sgid=elasticsearch fsgid=elasticsearch tty=(none) ses=unset comm=Thread-2 exe=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java subj==unconfined key=(null)
dcausse renamed this task from /var/run/elasticsearch not created after reboot to /var/run/elasticsearch deleted by elasticsearch.Mar 2 2021, 3:38 PM
dcausse triaged this task as High priority.Mar 3 2021, 12:13 PM

Triaging to high as this can cause serious problems.
The cause seems to be in elastic itself but I could not spot the exact problem looking at the elastic code. We might want to workaround the issue by always running systemd-tmpfiles --create from the elasticsearch systemd unit to make sure the folder exists when it's needed.

We might want to workaround the issue by always running systemd-tmpfiles --create from the elasticsearch systemd unit to make sure the folder exists when it's needed.

We can add a ExecStartPre to the systemd unit which runs "systemd-tmpfiles --create /usr/lib/tmpfiles.d/elasticsearch.conf" (we can't run systemd-tmpfiles --create since most of the other conffiles need root and that'll cause a lot of logspam)
The elasticsearch class already uses systemd::unit, so this can simply passed via the override option.