Page MenuHomePhabricator

[tools] toolserver.org cert is expiring in 2 days
Closed, ResolvedPublic

Description

This task is to check why it was not updated by acme-chief and to fix it.

Event Timeline

dcaro changed the task status from Open to In Progress.
dcaro triaged this task as High priority.
dcaro moved this task from To refine to Doing on the User-dcaro board.

I see these errors in the logs of root@tools-acme-chief-01:

root@tools-acme-chief-01:/etc/acme-chief# systemctl status acme-chief-certs-sync
● acme-chief-certs-sync.service - Sync acme-chief certificates
   Loaded: loaded (/lib/systemd/system/acme-chief-certs-sync.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2022-05-02 08:32:15 UTC; 2min 0s ago
  Process: 12024 ExecStart=/usr/local/bin/acme-chief-certs-sync (code=exited, status=255/EXCEPTION)
 Main PID: 12024 (code=exited, status=255/EXCEPTION)

May 02 08:32:14 tools-acme-chief-01 systemd[1]: Started Sync acme-chief certificates.
May 02 08:32:14 tools-acme-chief-01 acme-chief-certs-sync[12024]: Could not create directory '/nonexistent/.ssh'.
May 02 08:32:15 tools-acme-chief-01 acme-chief-certs-sync[12024]: Connection closed by 172.16.0.18 port 22
May 02 08:32:15 tools-acme-chief-01 systemd[1]: acme-chief-certs-sync.service: Main process exited, code=exited, status=255/EXCEPTION
May 02 08:32:15 tools-acme-chief-01 acme-chief-certs-sync[12024]: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
May 02 08:32:15 tools-acme-chief-01 acme-chief-certs-sync[12024]: rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.3]
May 02 08:32:15 tools-acme-chief-01 systemd[1]: acme-chief-certs-sync.service: Failed with result 'exit-code'.

Checking the service file, it's using a user that exists but has no home:

root@tools-acme-chief-01:/etc/acme-chief# cat /lib/systemd/system/acme-chief-certs-sync.service;
[Unit]
Description=Sync acme-chief certificates

[Service]
User=acme-chief
ExecStart=/usr/local/bin/acme-chief-certs-sync

root@tools-acme-chief-01:/etc/acme-chief# id acme-chief
uid=497(acme-chief) gid=497(acme-chief) groups=497(acme-chief)

root@tools-acme-chief-01:/etc/acme-chief# cd ~acme-chief
-bash: cd: /nonexistent: No such file or directory

looking

Mentioned in SAL (#wikimedia-cloud) [2022-05-02T08:54:11Z] <taavi> restart acme-chief.service T307333

Restarting the acme-chief service and re-running puppet on toolserver-proxy-01 did the trick, prabably related to T273956.

Acme-chief logs before restarting:

taavi@tools-acme-chief-01:~ $ sudo journalctl -u acme-chief.service 
-- Logs begin at Sun 2022-05-01 17:27:25 UTC, end at Mon 2022-05-02 08:52:38 UTC. --
May 01 19:59:59 tools-acme-chief-01 acme-chief-backend[18675]: Refreshing live OCSP response for certificate toolse
May 01 19:59:59 tools-acme-chief-01 acme-chief-backend[18675]: live OCSP response refreshed successfully for toolse
May 01 19:59:59 tools-acme-chief-01 acme-chief-backend[18675]: Refreshing live OCSP response for certificate toolse
May 01 19:59:59 tools-acme-chief-01 acme-chief-backend[18675]: live OCSP response refreshed successfully for toolse

Change 788294 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/software/acme-chief@master] acme_chief: add log_level to the config

https://gerrit.wikimedia.org/r/788294

for some reason reload-acme-chief-backend.timer isn't being triggered on tools-acme-chief-01. that's not related to T273956

Change 788294 abandoned by David Caro:

[operations/software/acme-chief@master] acme_chief: add log_level to the config

Reason:

Not needed

https://gerrit.wikimedia.org/r/788294