Page MenuHomePhabricator

hhvm root:adm owned log files cause failures for logrotate
Closed, ResolvedPublic

Description

Logrotate on mw servers sometimes fails with errors like:

/etc/cron.daily/logrotate:
error: error opening /var/log/hhvm/error.log-20160624: Permission denied
run-parts: /etc/cron.daily/logrotate exited with return code 1

After a bit of digging with Joe:

  1. logfiles were owned by root:adm rather than www-data:www-data
  2. hhvm's logrotate settings contain rotate 90 (files older than 90 days are deleted)
  3. The root:adm logfiles' stat metadata suggest that they were created months ago during a known maintenance or that they are related to new reimaged hosts (the ones being re-installed with Debian Jessie).

I chowned the current wrong files across all the app/api servers but we'd need to put in place a permanent solution for new installs, like applying the HHVM Rsyslog config before installing the package.

Details

Related Gerrit Patches:

Event Timeline

elukey created this task.Sep 23 2016, 10:14 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 23 2016, 10:14 AM
elukey lowered the priority of this task from Medium to Low.Sep 23 2016, 10:14 AM
elukey removed elukey as the assignee of this task.Dec 14 2016, 11:06 AM
elukey added a project: User-Elukey.
elukey moved this task from Backlog to Ops Backlog on the User-Elukey board.Dec 14 2016, 11:08 AM
elukey added a comment.EditedJan 3 2017, 7:42 AM

Another occurrence happened today for all the codfw hosts, seems related to the same file, but only aggregating by cluster. Examples for future investigation:

  • API
mw2151.codfw.wmnet:
    -rw-r----- 1 root adm 26068 Oct  5 00:04 /var/log/hhvm/error.log-20161004
  • Jobrunner /var/log/hhvm/error.log-20161108
mw2085.codfw.wmnet:
    -rw-r----- 1 root adm 7537474 Nov  8 11:38 /var/log/hhvm/error.log-20161108
  • Videoscaler - nothing
  • Imagescaler
mw2151.codfw.wmnet:
    -rw-r----- 1 root adm 26068 Oct  5 00:04 /var/log/hhvm/error.log-20161004
  • Appserver
mw2197.codfw.wmnet:
    -rw-r----- 1 root adm 25039 Oct  5 00:04 /var/log/hhvm/error.log-20161004
elukey added a comment.Jan 3 2017, 8:39 AM

Didn't find anything in the SAL and the puppet logs are gone, we might need to re-check periodically (before getting the cronspam) to catch one occurrence and get more logs.

elukey added a comment.Jan 5 2017, 7:53 AM

From /etc/rsyslog.conf:

#
# Set the default permissions for all log files.
#
$FileOwner root
$FileGroup adm
$FileCreateMode 0640
$DirCreateMode 0755
$Umask 0022

Meanwhile /etc/logrotate.d/hhvm:

# logrotate(8) config for HHVM
# This file is managed by Puppet

# HHVM's error log, managed by rsyslog.
/var/log/hhvm/error.log {
    su www-data www-data
    daily
    dateext
    dateyesterday
    rotate 90
    missingok
    create 0640 www-data www-data
    delaycompress
    sharedscripts
    postrotate
        reload rsyslog >/dev/null 2>&1 || true
    endscript
}

# Delete local stack trace logs after one week.
/var/log/hhvm/stacktrace.*.log.* {
    su www-data www-data
    rotate 0
    maxage 7
    missingok
}

Maybe a race condition happening once in a while between rsyslog and logrotate?

elukey added a comment.Jan 5 2017, 8:04 AM

Occurred also on eqiad hosts. Examples:

  • jobrunner
mw1166.eqiad.wmnet:
    -rw-r----- 1 root adm 4.1M Oct 16 06:15 /var/log/hhvm/error.log-20161015
  • api appserver
mw1203.eqiad.wmnet:
    -rw-r----- 1 root adm 13M Oct  6 06:17 /var/log/hhvm/error.log-20161005
  • appserver
mw1186.eqiad.wmnet:
    -rw-r----- 1 root adm 777K Oct  7 05:44 /var/log/hhvm/error.log-20161006

The same day log observation was not true for these occurrences.

ema added a subscriber: ema.Feb 27 2017, 1:54 PM

This is happening on tin too:

-rw-r----- 1 root adm 254 Nov 27 21:39 /var/log/hhvm/error.log-20161127
ema added a comment.EditedFeb 27 2017, 2:08 PM

I've looked for other instances of the problem across the cluster and could only find the following:

tin.eqiad.wmnet: -rw-r----- 1 root     adm       254 Nov 27 21:39 /var/log/hhvm/error.log-20161127
mw2092.codfw.wmnet: -rw-r----- 1 root     adm       329220 Dec  2 00:41 /var/log/hhvm/error.log-20161201

Nothing in 2017.

I worked today on mw2092, it was put back in service around Dec 2 (https://phabricator.wikimedia.org/T151427) but it wasn't set up correctly. I think it was reimaged after a disk failure.

Mentioned in SAL (#wikimedia-operations) [2017-06-11T14:14:54Z] <elukey> executed cumin 'mw22[51-60].codfw.wmnet' 'find /var/log/hhvm/* -user root -exec chown www-data:www-data {} \;' to reduce cron-spam (new hosts added in March) - T146464

On mw2251 today:

root@mw2251:/var/log/hhvm# ls -lht
total 24M
-rw-r----- 1 www-data www-data    0 Jun 10 06:25 error.log
-rw-r----- 1 www-data www-data    0 Jun  9 06:25 error.log-20170609
[..]
-rw-r----- 1 root     adm       15K Mar 10 16:27 error.log-20170311

From the SAL mw22[51-60] were put in production on March 10th.

A possibile fix for this issue might to add FileCreateMode="0640" FileOwner="www-data" FileGroup="www-data" to the HHVM rsyslog config.

I suspect that until the first logrotate runs (triggering create 0640 www-data www-data) the perms set are the default ones by rsyslog, namely root:adm.

Change 358318 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hhvm: force rsyslog config to create log files with www-data perms

https://gerrit.wikimedia.org/r/358318

I was able to repro the issue on mw2251 simply doing the following:

rm /var/log/hhvm/error.log
systemctl stop nutcracker (this causes errors for HHVM)

With https://gerrit.wikimedia.org/r/358318 I am not able to repro anymore.

elukey moved this task from Ops Backlog to In Progress on the User-Elukey board.Jun 12 2017, 12:13 PM

Change 358318 merged by Elukey:
[operations/puppet@production] hhvm: force rsyslog config to create log files with www-data perms

https://gerrit.wikimedia.org/r/358318

elukey moved this task from In Progress to Done on the User-Elukey board.Jun 12 2017, 3:42 PM
elukey closed this task as Resolved.Jun 12 2017, 4:01 PM

Change 358959 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] hhvm: do not stop rsyslog rules processing too early

https://gerrit.wikimedia.org/r/358959

Change 358959 merged by Filippo Giunchedi:
[operations/puppet@production] hhvm: do not stop rsyslog rules processing too early

https://gerrit.wikimedia.org/r/358959