Page MenuHomePhabricator

hhvm root:adm owned log files cause failures for logrotate
Closed, ResolvedPublic

Description

Logrotate on mw servers sometimes fails with errors like:

/etc/cron.daily/logrotate:
error: error opening /var/log/hhvm/error.log-20160624: Permission denied
run-parts: /etc/cron.daily/logrotate exited with return code 1

After a bit of digging with Joe:

  1. logfiles were owned by root:adm rather than www-data:www-data
  2. hhvm's logrotate settings contain rotate 90 (files older than 90 days are deleted)
  3. The root:adm logfiles' stat metadata suggest that they were created months ago during a known maintenance or that they are related to new reimaged hosts (the ones being re-installed with Debian Jessie).

I chowned the current wrong files across all the app/api servers but we'd need to put in place a permanent solution for new installs, like applying the HHVM Rsyslog config before installing the package.

Event Timeline

elukey lowered the priority of this task from Medium to Low.Sep 23 2016, 10:14 AM
elukey added a project: User-Elukey.

Another occurrence happened today for all the codfw hosts, seems related to the same file, but only aggregating by cluster. Examples for future investigation:

  • API
mw2151.codfw.wmnet:
    -rw-r----- 1 root adm 26068 Oct  5 00:04 /var/log/hhvm/error.log-20161004
  • Jobrunner /var/log/hhvm/error.log-20161108
mw2085.codfw.wmnet:
    -rw-r----- 1 root adm 7537474 Nov  8 11:38 /var/log/hhvm/error.log-20161108
  • Videoscaler - nothing
  • Imagescaler
mw2151.codfw.wmnet:
    -rw-r----- 1 root adm 26068 Oct  5 00:04 /var/log/hhvm/error.log-20161004
  • Appserver
mw2197.codfw.wmnet:
    -rw-r----- 1 root adm 25039 Oct  5 00:04 /var/log/hhvm/error.log-20161004

Didn't find anything in the SAL and the puppet logs are gone, we might need to re-check periodically (before getting the cronspam) to catch one occurrence and get more logs.

From /etc/rsyslog.conf:

#
# Set the default permissions for all log files.
#
$FileOwner root
$FileGroup adm
$FileCreateMode 0640
$DirCreateMode 0755
$Umask 0022

Meanwhile /etc/logrotate.d/hhvm:

# logrotate(8) config for HHVM
# This file is managed by Puppet

# HHVM's error log, managed by rsyslog.
/var/log/hhvm/error.log {
    su www-data www-data
    daily
    dateext
    dateyesterday
    rotate 90
    missingok
    create 0640 www-data www-data
    delaycompress
    sharedscripts
    postrotate
        reload rsyslog >/dev/null 2>&1 || true
    endscript
}

# Delete local stack trace logs after one week.
/var/log/hhvm/stacktrace.*.log.* {
    su www-data www-data
    rotate 0
    maxage 7
    missingok
}

Maybe a race condition happening once in a while between rsyslog and logrotate?

Occurred also on eqiad hosts. Examples:

  • jobrunner
mw1166.eqiad.wmnet:
    -rw-r----- 1 root adm 4.1M Oct 16 06:15 /var/log/hhvm/error.log-20161015
  • api appserver
mw1203.eqiad.wmnet:
    -rw-r----- 1 root adm 13M Oct  6 06:17 /var/log/hhvm/error.log-20161005
  • appserver
mw1186.eqiad.wmnet:
    -rw-r----- 1 root adm 777K Oct  7 05:44 /var/log/hhvm/error.log-20161006

The same day log observation was not true for these occurrences.

This is happening on tin too:

-rw-r----- 1 root adm 254 Nov 27 21:39 /var/log/hhvm/error.log-20161127

I've looked for other instances of the problem across the cluster and could only find the following:

tin.eqiad.wmnet: -rw-r----- 1 root     adm       254 Nov 27 21:39 /var/log/hhvm/error.log-20161127
mw2092.codfw.wmnet: -rw-r----- 1 root     adm       329220 Dec  2 00:41 /var/log/hhvm/error.log-20161201

Nothing in 2017.

I worked today on mw2092, it was put back in service around Dec 2 (https://phabricator.wikimedia.org/T151427) but it wasn't set up correctly. I think it was reimaged after a disk failure.

Mentioned in SAL (#wikimedia-operations) [2017-06-11T14:14:54Z] <elukey> executed cumin 'mw22[51-60].codfw.wmnet' 'find /var/log/hhvm/* -user root -exec chown www-data:www-data {} \;' to reduce cron-spam (new hosts added in March) - T146464

On mw2251 today:

root@mw2251:/var/log/hhvm# ls -lht
total 24M
-rw-r----- 1 www-data www-data    0 Jun 10 06:25 error.log
-rw-r----- 1 www-data www-data    0 Jun  9 06:25 error.log-20170609
[..]
-rw-r----- 1 root     adm       15K Mar 10 16:27 error.log-20170311

From the SAL mw22[51-60] were put in production on March 10th.

A possibile fix for this issue might to add FileCreateMode="0640" FileOwner="www-data" FileGroup="www-data" to the HHVM rsyslog config.

I suspect that until the first logrotate runs (triggering create 0640 www-data www-data) the perms set are the default ones by rsyslog, namely root:adm.

Change 358318 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hhvm: force rsyslog config to create log files with www-data perms

https://gerrit.wikimedia.org/r/358318

I was able to repro the issue on mw2251 simply doing the following:

rm /var/log/hhvm/error.log
systemctl stop nutcracker (this causes errors for HHVM)

With https://gerrit.wikimedia.org/r/358318 I am not able to repro anymore.

Change 358318 merged by Elukey:
[operations/puppet@production] hhvm: force rsyslog config to create log files with www-data perms

https://gerrit.wikimedia.org/r/358318

Change 358959 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] hhvm: do not stop rsyslog rules processing too early

https://gerrit.wikimedia.org/r/358959

Change 358959 merged by Filippo Giunchedi:
[operations/puppet@production] hhvm: do not stop rsyslog rules processing too early

https://gerrit.wikimedia.org/r/358959