bacula-director on helium showed as failed on Icinga (https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=helium&service=bacula+director+process).
I had made a config change (https://gerrit.wikimedia.org/r/#/c/344561/ , https://gerrit.wikimedia.org/r/#/c/344560/) but that itself seems fine, just that it triggered a service restart which then failed.
At first i saw this in syslog:
2824 Mar 23 23:46:20 helium puppet-agent[18271]: (/Stage[main]/Bacula::Director/Service[bacula-director]) Triggered 'refresh' from 1 events 2825 Mar 23 23:46:20 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:0200100D:system library:fopen:Permission denied 2826 Mar 23 23:46:20 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:20074002:BIO routines:FILE_CTRL:system lib 2827 Mar 23 23:46:20 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:140B0002:SSL routines:SSL_CTX_use_PrivateKey_file:system lib
Then i saw the same thing had happened before but many hours earlier, so separate from when i made any change:
96 Mar 23 09:23:19 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:0200100D:system library:fopen:Permission denied 97 Mar 23 09:23:19 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:20074002:BIO routines:FILE_CTRL:system lib 98 Mar 23 09:23:19 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:140B0002:SSL routines:SSL_CTX_use_PrivateKey_file:system lib 99 Mar 23 09:23:24 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:0200100D:system library:fopen:Permission denied 100 Mar 23 09:23:24 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:20074002:BIO routines:FILE_CTRL:system lib 101 Mar 23 09:23:24 helium bacula-dir: openssl.c:86 Error loading private key: ERR=error:140B0002:SSL routines:SSL_CTX_use_PrivateKey_file:system lib
So i looked why it can't load the private key and i found the situation like this:
root@helium:/var/lib/puppet/ssl/private_keys# ls -hals total 24K 4.0K drwxr-x--- 2 puppet puppet 4.0K Mar 17 14:06 . 4.0K drwxrwx--x 7 puppet puppet 4.0K Mar 17 13:56 .. 4.0K -r-------- 1 bacula bacula 3.2K Mar 17 14:02 bacula-helium.eqiad.wmnet.pem 8.0K -rw-r--r-- 1 root root 5.1K Mar 17 14:06 bacula-keypair-helium.eqiad.wmnet.pem 4.0K -rw-r----- 1 puppet puppet 3.2K Mar 17 13:56 helium.eqiad.wmnet.pem
Then i found in puppet code:
47 # Puppet manages the permissions of its private key file and they are too 48 # restrictive to allow any other user/group to read it. Copy it, keep it in 49 # sync and set the require permissions for bacula-dir to be able to read it 50 exec { 'bacula_cp_private_key': 51 command => "/bin/cp /var/lib/puppet/ssl/private_keys/${::fqdn}.pem \ 52 /var/lib/puppet/ssl/private_keys/bacula-${::fqdn}.pem", 53 unless => "/usr/bin/cmp /var/lib/puppet/ssl/private_keys/${::fqdn}.pem \ 54 /var/lib/puppet/ssl/private_keys/bacula-${::fqdn}.pem", 55 } 56 57 file { "/var/lib/puppet/ssl/private_keys/bacula-${::fqdn}.pem": 58 ensure => present, 59 owner => 'bacula', 60 group => 'bacula', 61 mode => '0400', 62 require => Exec['bacula_cp_private_key'], 63 notify => Service['bacula-director'], 64 }
So i thought "aha, that is it" and fixed the permissions to reflect that and started it again.
chown bacula:bacula bacula-keypair-helium.eqiad.wmnet.pem chown bacula:bacula helium.eqiad.wmnet.pem chmod 400 *.pem
But the service STILL would not start. Now i found this in bacula.log
root@helium:/var/lib/puppet/ssl/private_keys# tail -f /var/log/bacula/bacula.log 17-Mar 14:03 bacula-dir JobId 0: Fatal error: Could not open Catalog "MyCatalog", database "bacula". 17-Mar 14:03 bacula-dir JobId 0: Fatal error: mysql.c:210 Unable to connect to MySQL server. Database=bacula User=bacula MySQL connect failed either server not running or your authorization is incorrect. 17-Mar 14:03 bacula-dir ERROR TERMINATION Please correct configuration file: /etc/bacula/bacula-dir.conf
i also tried the restart with the /usr/sbin/invoke-rc.d bacula-director reload as seen in puppet code as the restart command. (besides systemctl and /etc/init.d/)
and then we had an unrelated outage ... :p Do you know what's happening here?