Page MenuHomePhabricator

error starting webservice
Closed, ResolvedPublic

Description

I'm experimenting with hosting my .net tool on tool-tabs server with mono. I had to change some of the params in my Web.config file and to make mono re-read the config, I was executing webservice restart command. After doing so for several times I suddenly got this error message in my ~/error.log:

Traceback (most recent call last):
  File "/usr/bin/webservice-runner", line 27, in <module>
    webservice.run(port)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/services/lighttpdwebservice.py", line 108, in run
    with open(config_path, 'w') as f:
IOError: [Errno 13] Permission denied: '/var/run/lighttpd/fountain'
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 138, in apport_excepthook
    os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o640), 'wb') as f:
OSError: [Errno 2] No such file or directory: '/var/crash/_usr_bin_webservice-runner.53098.crash'

Original exception was:
Traceback (most recent call last):
  File "/usr/bin/webservice-runner", line 27, in <module>
    webservice.run(port)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/services/lighttpdwebservice.py", line 108, in run
    with open(config_path, 'w') as f:
IOError: [Errno 13] Permission denied: '/var/run/lighttpd/fountain'

After that happened, I could no longer start the web server - I was just repeatedly getting the same above error message on every webservice start. Then, after about 10 minutes the problem resolved itself - I didn't even change any of my files.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
valhallasw claimed this task.
valhallasw subscribed.

The job is currently running on tools-webgrid-lighttpd-1402; the accounting log shows the errors are probably from tools-webgrid-lighttpd-1412. There is a file /var/run/lighttpd/fountain there, but I can open it (as tools.fountain, mode a) without issues.

The apport_excepthook error is odd, because apport should be disabled. At the same time, that isn't really a problem as the program has already crashed at that point.

I'm not entirely sure about the origin of the issue; it might just be that mono takes a long time to shut down, keeping /var/run/lighttpd/fountain (the lighttpd config file) in use. A restart on the same host then fails.

I have removed the /var/run/lighttpd/fountain on -1412 for now; that should resolve the issue. Please re-open this ticket if you see the issue again.

This also just happened for the ptwikis tool:

Traceback (most recent call last):
  File "/usr/bin/webservice-runner", line 27, in <module>
    webservice.run(port)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/services/lighttpdwebservice.py", line 108, in run
    with open(config_path, 'w') as f:
IOError: [Errno 13] Permission denied: '/var/run/lighttpd/ptwikis'
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 128, in apport_excepthook
    os.O_WRONLY|os.O_CREAT|os.O_EXCL, 0o640), 'w')
OSError: [Errno 2] No such file or directory: '/var/crash/_usr_bin_webservice-runner.51206.crash'

Original exception was:
Traceback (most recent call last):
  File "/usr/bin/webservice-runner", line 27, in <module>
    webservice.run(port)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/services/lighttpdwebservice.py", line 108, in run
    with open(config_path, 'w') as f:
IOError: [Errno 13] Permission denied: '/var/run/lighttpd/ptwikis'
2016-08-17 21:02:48: (server.c.1512) server stopped by UID = 0 PID = 13082

in this case on tools-webgrid-lighttpd-1207.eqiad.wmflabs...

`
valhallasw@tools-webgrid-lighttpd-1207:/var/run$ ls -l lighttpd
ls: cannot open directory lighttpd: Permission denied
valhallasw@tools-webgrid-lighttpd-1207:/var/run$ ls -ld lighttpd
drwxr-x--- 2 www-data www-data 40 Aug 17 08:45 lighttpd

wat.

valhallasw@tools-webgrid-lighttpd-1208:/var/run$ ls -ld lighttpd
drwxrwxrwt 2 www-data www-data 5140 Aug 17 19:32 lighttpd

tools-webgrid-lighttpd-1207 is also the host where puppet isn't running (T143191: Puppet not running on tools-webgrid-lighttpd-1207), which means that puppet doesn't fix this. Still, how did the permissions get changed back?

Forcing a puppet run indeed displays:

Notice: /Stage[main]/Toollabs::Node::Web::Lighttpd/File[/var/run/lighttpd]/mode: mode changed '0750' to '1777

maybe /var/run/lighttpd is not owned by a debian package, so that shouldn't change the mode:

valhallasw@tools-webgrid-lighttpd-1207:/var/run$ dpkg -S /var/run/lighttpd
dpkg-query: no path found matching pattern /var/run/lighttpd.

There is no 'mode changed' to be found in the puppet log.

I can't observe this on any other host:

valhallasw@tools-bastion-03:~/accountingtools$ for i in tools-webgrid-lighttpd-12{01..15}; do echo $i `ssh $i ls -ld /var/run/lighttpd`; done
tools-webgrid-lighttpd-1201 drwxrwxrwt 2 www-data www-data 5740 Aug 17 18:46 /var/run/lighttpd
tools-webgrid-lighttpd-1202 drwxrwxrwt 2 www-data www-data 40 Aug 16 16:02 /var/run/lighttpd
tools-webgrid-lighttpd-1203 drwxrwxrwt 2 www-data www-data 6580 Aug 17 21:16 /var/run/lighttpd
tools-webgrid-lighttpd-1204 drwxrwxrwt 2 www-data www-data 5100 Aug 17 16:00 /var/run/lighttpd
tools-webgrid-lighttpd-1205 drwxrwxrwt 2 www-data www-data 4300 Aug 17 21:02 /var/run/lighttpd
tools-webgrid-lighttpd-1206 drwxrwxrwt 2 www-data www-data 5840 Aug 17 21:16 /var/run/lighttpd
tools-webgrid-lighttpd-1207 drwxrwxrwt 2 www-data www-data 40 Aug 17 08:45 /var/run/lighttpd
tools-webgrid-lighttpd-1208 drwxrwxrwt 2 www-data www-data 5140 Aug 17 19:32 /var/run/lighttpd
tools-webgrid-lighttpd-1209 drwxrwxrwt 2 www-data www-data 5980 Aug 17 18:31 /var/run/lighttpd
tools-webgrid-lighttpd-1210 drwxrwxrwt 2 www-data www-data 5220 Aug 17 20:01 /var/run/lighttpd
`
valhallasw@tools-bastion-03:~/accountingtools$ for i in tools-webgrid-lighttpd-14{01..15}; do echo $i `ssh $i ls -ld /var/run/lighttpd`; done
tools-webgrid-lighttpd-1401 drwxrwxrwt 2 www-data www-data 6740 Aug 17 17:10 /var/run/lighttpd
tools-webgrid-lighttpd-1402 drwxrwxrwt 2 www-data www-data 1140 Aug 17 21:17 /var/run/lighttpd
tools-webgrid-lighttpd-1403 drwxrwxrwt 2 www-data www-data 5560 Aug 17 17:08 /var/run/lighttpd
tools-webgrid-lighttpd-1404 drwxrwxrwt 2 www-data www-data 5420 Aug 12 20:40 /var/run/lighttpd
tools-webgrid-lighttpd-1405 drwxrwxrwt 2 www-data www-data 5980 Aug 17 17:10 /var/run/lighttpd
tools-webgrid-lighttpd-1406 drwxrwxrwt 2 www-data www-data 6660 Aug 12 20:41 /var/run/lighttpd
tools-webgrid-lighttpd-1407 drwxrwxrwt 2 www-data www-data 4720 Aug 17 21:28 /var/run/lighttpd
tools-webgrid-lighttpd-1408 drwxrwxrwt 2 www-data www-data 2080 Aug 17 17:10 /var/run/lighttpd
tools-webgrid-lighttpd-1409 drwxrwxrwt 2 www-data www-data 5620 Aug 17 17:09 /var/run/lighttpd
tools-webgrid-lighttpd-1410 drwxrwxrwt 2 www-data www-data 5200 Aug 17 17:18 /var/run/lighttpd
tools-webgrid-lighttpd-1411 drwxrwxrwt 2 www-data www-data 11080 Aug 15 19:11 /var/run/lighttpd
tools-webgrid-lighttpd-1412 drwxrwxrwt 2 www-data www-data 3860 Aug 17 21:38 /var/run/lighttpd
tools-webgrid-lighttpd-1413 drwxrwxrwt 2 www-data www-data 10780 Aug 11 23:44 /var/run/lighttpd
tools-webgrid-lighttpd-1414 drwxrwxrwt 2 www-data www-data 12200 Aug 17 21:12 /var/run/lighttpd

Looking at puppet on other hosts shows that puppet has reset the mode on tools-webgrid-lighttpd-1202 as well. This was in at (11 aug 2016 @ 8:47pm (UTC)), while everything was still OK at (09 aug 2016 @ 5:34am (UTC)). However, puppet was apparently not running for a few days, so that doesn't really help.

Nothing shows up in other puppet logs, likely because they have been rotated away already. Bah.

Several -14XX hosts also have entries.

  • -1401: changed at 14 aug 2016 @ 6:40am (UTC), last OK was 14 aug 2016 @ 6:10am (UTC)
  • -1402: changed 14 aug 2016 @ 7:20am (UTC), last OK was 14 aug 2016 @ 6:50am (UTC)
  • -1403: changed 14 aug 2016 @ 7:00am (UTC), last OK was 14 aug 2016 @ 6:30am (UTC)
  • all other hosts except for 06 and 11 also have a mode change entry.

Although the times differ, the change is consistently during the first puppet run in the /var/log/puppet.log.3.gz file. Is logrotate somehow messing with the permissions?!

valhallasw triaged this task as Lowest priority.Aug 18 2016, 9:51 AM

Given that the webgrid will be moved to k8s soon-ish, and this is only an issue for short periods of time (approx. 15 minutes once a week or so), I think it's not worth the effort to dive into this much further. I'll keep the issue open, but marked as 'lowest' priority.

This is probably caused by /usr/lib/tmpfiles.d/lighttpd.tmpfile.conf on those hosts being:

d /var/run/lighttpd 0750 www-data www-data -

tmpfiles is a concept for systemd; cf. https://www.freedesktop.org/software/systemd/man/tmpfiles.d.html. On Precise systems, no systemd is running, but on Trusty ones. To resolve, probably something like:

if (os_version('ubuntu trusty')) {
    file { '/usr/lib/tmpfiles.d/lighttpd.tmpfile.conf':
    content => "d /var/run/lighttpd 1777 www-data www-data -\n",
    }
}

would need to be added to role::toollabs::node::web::lighttpd.

If that is the case it seems this should affect jessie as readily as trusty?

There are no webgrid hosts running Jessie, so this issue can only occur on Trusty instances.

Sure, I was thinking of webservices running in k8s land. We had some issues a week ago that would fit the failure model that included containerized services.

Sorry, I don't know enough (yet) about Kubernetes to answer that.

Change 323869 had a related patch set uploaded (by Andrew Bogott):
lighttpd nodes: customize lighttpd.tmpfile.conf

https://gerrit.wikimedia.org/r/323869

[…]
Although the times differ, the change is consistently during the first puppet run in the /var/log/puppet.log.3.gz file. Is logrotate somehow messing with the permissions?!

It turns out: Indirectly. /etc/logrotate.d/lighttpd has:

postrotate
     if [ -x /usr/sbin/invoke-rc.d ]; then \
        invoke-rc.d lighttpd reopen-logs > /dev/null 2>&1; \
     else \
        /etc/init.d/lighttpd reopen-logs > /dev/null 2>&1; \
     fi; \
endscript

/etc/init.d/lighttpd runs:

if [ "$1" != status ]; then
        # be sure there is a /var/run/lighttpd, even with tmpfs
        # The directory is defined as volatile and may thus be non-existing
        # after a boot (DPM §9.3.2)
        if ! dpkg-statoverride --list /var/run/lighttpd >/dev/null 2>&1; then
                install -d -o www-data -g www-data -m 0750 "/var/run/lighttpd"
        fi
fi

So in theory every week the directory's mode would be set to 0750. (I don't understand why it does not show up then weekly in puppet.log.)

dpkg-statoverride has its database at /var/lib/dpkg/statoverride:

puppet puppet 750 /var/log/puppet
puppet puppet 750 /var/lib/puppet
root ssl-cert 710 /etc/ssl/private
_lldpd adm 4750 /usr/sbin/lldpcli
root Debian-exim 640 /etc/exim4/passwd.client
root crontab 2755 /usr/bin/crontab

So the configuration for /var/run/lighttpd needs to be added to that (probably with an exec { command => '/usr/bin/dpkg-override …', unless => '/usr/bin/dpkg-statoverride --list …', } if there is no Puppet type for that).

Change 323869 abandoned by Andrew Bogott:
lighttpd nodes: customize lighttpd.tmpfile.conf

https://gerrit.wikimedia.org/r/323869

scfc reassigned this task from valhallasw to Andrew.

Sure, I was thinking of webservices running in k8s land. We had some issues a week ago that would fit the failure model that included containerized services.

I looked at a pod (?) of a k8s webservice with kubectl exec typoscan-264490057-a51ge, and AFAICS there is no systemd (or anything else) running in that container, so it would be unlikely in my opinion that something could change the permissions for /var/run/lighttpd after setup. But this should be verifiable in case of an error with kubectl exec typoscan-264490057-a51ge -- ls -dl /var/run/lighttpd.

I believe the issue here has been fixed by f0a3d985b86a67cfa262e4b834cec0d4189b60ec (the "Bug:" footer was set to TT142932 so @gerritbot did not report it here). If someone sees Puppet resetting the permissions of /var/run/lighttpd, please reopen.