Page MenuHomePhabricator

bacula-director not running on helium
Closed, ResolvedPublic

Description

bacula-director on helium showed as failed on Icinga (https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=helium&service=bacula+director+process).

I had made a config change (https://gerrit.wikimedia.org/r/#/c/344561/ , https://gerrit.wikimedia.org/r/#/c/344560/) but that itself seems fine, just that it triggered a service restart which then failed.

At first i saw this in syslog:

2824 Mar 23 23:46:20 helium puppet-agent[18271]: (/Stage[main]/Bacula::Director/Service[bacula-director]) Triggered 'refresh' from 1 events
2825 Mar 23 23:46:20 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:0200100D:system library:fopen:Permission denied
2826 Mar 23 23:46:20 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:20074002:BIO routines:FILE_CTRL:system lib
2827 Mar 23 23:46:20 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:140B0002:SSL routines:SSL_CTX_use_PrivateKey_file:system lib

Then i saw the same thing had happened before but many hours earlier, so separate from when i made any change:

 96 Mar 23 09:23:19 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:0200100D:system library:fopen:Permission denied
 97 Mar 23 09:23:19 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:20074002:BIO routines:FILE_CTRL:system lib
 98 Mar 23 09:23:19 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:140B0002:SSL routines:SSL_CTX_use_PrivateKey_file:system lib
 99 Mar 23 09:23:24 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:0200100D:system library:fopen:Permission denied
100 Mar 23 09:23:24 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:20074002:BIO routines:FILE_CTRL:system lib
101 Mar 23 09:23:24 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:140B0002:SSL routines:SSL_CTX_use_PrivateKey_file:system lib

So i looked why it can't load the private key and i found the situation like this:

root@helium:/var/lib/puppet/ssl/private_keys# ls -hals
total 24K
4.0K drwxr-x--- 2 puppet puppet 4.0K Mar 17 14:06 .
4.0K drwxrwx--x 7 puppet puppet 4.0K Mar 17 13:56 ..
4.0K -r-------- 1 bacula bacula 3.2K Mar 17 14:02 bacula-helium.eqiad.wmnet.pem
8.0K -rw-r--r-- 1 root   root   5.1K Mar 17 14:06 bacula-keypair-helium.eqiad.wmnet.pem
4.0K -rw-r----- 1 puppet puppet 3.2K Mar 17 13:56 helium.eqiad.wmnet.pem

Then i found in puppet code:

47     # Puppet manages the permissions of its private key file and they are too
48     # restrictive to allow any other user/group to read it. Copy it, keep it in
49     # sync and set the require permissions for bacula-dir to be able to read it
50     exec { 'bacula_cp_private_key':
51         command => "/bin/cp /var/lib/puppet/ssl/private_keys/${::fqdn}.pem \
52  /var/lib/puppet/ssl/private_keys/bacula-${::fqdn}.pem",
53         unless  => "/usr/bin/cmp /var/lib/puppet/ssl/private_keys/${::fqdn}.pem \
54  /var/lib/puppet/ssl/private_keys/bacula-${::fqdn}.pem",
55     }
56 
57     file { "/var/lib/puppet/ssl/private_keys/bacula-${::fqdn}.pem":
58         ensure  => present,
59         owner   => 'bacula',
60         group   => 'bacula',
61         mode    => '0400',
62         require => Exec['bacula_cp_private_key'],
63         notify  => Service['bacula-director'],
64     }

So i thought "aha, that is it" and fixed the permissions to reflect that and started it again.

chown bacula:bacula bacula-keypair-helium.eqiad.wmnet.pem
chown bacula:bacula helium.eqiad.wmnet.pem
chmod 400 *.pem

But the service STILL would not start. Now i found this in bacula.log

root@helium:/var/lib/puppet/ssl/private_keys# tail -f /var/log/bacula/bacula.log 
17-Mar 14:03 bacula-dir JobId 0: Fatal error: Could not open Catalog "MyCatalog", database "bacula".
17-Mar 14:03 bacula-dir JobId 0: Fatal error: mysql.c:210 Unable to connect to MySQL server.
Database=bacula User=bacula
MySQL connect failed either server not running or your authorization is incorrect.
17-Mar 14:03 bacula-dir ERROR TERMINATION
Please correct configuration file: /etc/bacula/bacula-dir.conf

i also tried the restart with the /usr/sbin/invoke-rc.d bacula-director reload as seen in puppet code as the restart command. (besides systemctl and /etc/init.d/)

and then we had an unrelated outage ... :p Do you know what's happening here?

Event Timeline

A puppet run changes /var/lib/puppet/ssl/private_keys/helium.eqiad.wmnet.pem back to be owned by puppet:puppet.

Scratch the part about "Unable to connect to MySQL server", that was from March 17th obviously.

After trying again i now see:

-- Subject: Unit bacula-director.service has finished start-up

and

 /etc/init.d/bacula-director status
● bacula-director.service - LSB: Start Bacula Director at boot time
   Loaded: loaded (/etc/init.d/bacula-director)
   Active: active (exited) since Thu 2017-03-23 09:26:20 UTC; 17h ago
  Process: 28558 ExecReload=/etc/init.d/bacula-director reload (code=exited, status=0/SUCCESS)

Icinga did not recover and there is no bacula-dir process running.

Ah it's puppet agent managing the permissions of it's directory more aggresively than previous versions (3.8 vs 3.4). The issue was triggered when the 2 changes above were merged. On a regular start, bacula-director drops privileges after reading all these files hence the first read is happening as root. Subsequent reloads however happen as bacula and hence the issue we are witnessing here (which also explains why we haven't seen it before).

The good thing is we now have a better construct for exposing those certificates than before, that is base::expose_puppet_certs. I 'll try and refactor the code to use these instead of the once-off constructs I had created back then. in the meantime, please refrain from merging changes that would cause a bacula configuration reload.

Change 344603 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] base::expose_puppet_certs: Also provide a keypair

https://gerrit.wikimedia.org/r/344603

Change 344606 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] Update bacula::client to use base::expose_puppet_certs

https://gerrit.wikimedia.org/r/344606

Change 344626 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] Update bacula::storage to use base::expose_puppet_certs

https://gerrit.wikimedia.org/r/344626

Change 344627 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] Update bacula::director to use base::expose_puppet_certs

https://gerrit.wikimedia.org/r/344627

Change 344603 merged by Alexandros Kosiaris:
[operations/puppet@production] base::expose_puppet_certs: Also provide a keypair

https://gerrit.wikimedia.org/r/344603

Change 344606 merged by Alexandros Kosiaris:
[operations/puppet@production] Update bacula::client to use base::expose_puppet_certs

https://gerrit.wikimedia.org/r/344606

Change 344626 merged by Alexandros Kosiaris:
[operations/puppet@production] Update bacula::storage to use base::expose_puppet_certs

https://gerrit.wikimedia.org/r/344626

Change 344627 merged by Alexandros Kosiaris:
[operations/puppet@production] Update bacula::director to use base::expose_puppet_certs

https://gerrit.wikimedia.org/r/344627

A long tail of commits and fixes later, the puppet modules are freak of that hideous keypair generation in the module and rather rely on base::expose_puppet_certs infrastructure making it way more robust.