Page MenuHomePhabricator

[tools] toolserver.org cert is expiring in 2 days
Closed, ResolvedPublic

Description

This task is to check why it was not updated by acme-chief and to fix it.

Event Timeline

dcaro changed the task status from Open to In Progress.May 2 2022, 8:30 AM
dcaro triaged this task as High priority.
dcaro created this task.
dcaro moved this task from To refine to Doing on the User-dcaro board.

I see these errors in the logs of root@tools-acme-chief-01:

root@tools-acme-chief-01:/etc/acme-chief# systemctl status acme-chief-certs-sync
● acme-chief-certs-sync.service - Sync acme-chief certificates
   Loaded: loaded (/lib/systemd/system/acme-chief-certs-sync.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2022-05-02 08:32:15 UTC; 2min 0s ago
  Process: 12024 ExecStart=/usr/local/bin/acme-chief-certs-sync (code=exited, status=255/EXCEPTION)
 Main PID: 12024 (code=exited, status=255/EXCEPTION)

May 02 08:32:14 tools-acme-chief-01 systemd[1]: Started Sync acme-chief certificates.
May 02 08:32:14 tools-acme-chief-01 acme-chief-certs-sync[12024]: Could not create directory '/nonexistent/.ssh'.
May 02 08:32:15 tools-acme-chief-01 acme-chief-certs-sync[12024]: Connection closed by 172.16.0.18 port 22
May 02 08:32:15 tools-acme-chief-01 systemd[1]: acme-chief-certs-sync.service: Main process exited, code=exited, status=255/EXCEPTION
May 02 08:32:15 tools-acme-chief-01 acme-chief-certs-sync[12024]: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
May 02 08:32:15 tools-acme-chief-01 acme-chief-certs-sync[12024]: rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.3]
May 02 08:32:15 tools-acme-chief-01 systemd[1]: acme-chief-certs-sync.service: Failed with result 'exit-code'.

Checking the service file, it's using a user that exists but has no home:

root@tools-acme-chief-01:/etc/acme-chief# cat /lib/systemd/system/acme-chief-certs-sync.service;
[Unit]
Description=Sync acme-chief certificates

[Service]
User=acme-chief
ExecStart=/usr/local/bin/acme-chief-certs-sync

root@tools-acme-chief-01:/etc/acme-chief# id acme-chief
uid=497(acme-chief) gid=497(acme-chief) groups=497(acme-chief)

root@tools-acme-chief-01:/etc/acme-chief# cd ~acme-chief
-bash: cd: /nonexistent: No such file or directory

looking

Mentioned in SAL (#wikimedia-cloud) [2022-05-02T08:54:11Z] <taavi> restart acme-chief.service T307333

Restarting the acme-chief service and re-running puppet on toolserver-proxy-01 did the trick, prabably related to T273956.

Acme-chief logs before restarting:

taavi@tools-acme-chief-01:~ $ sudo journalctl -u acme-chief.service 
-- Logs begin at Sun 2022-05-01 17:27:25 UTC, end at Mon 2022-05-02 08:52:38 UTC. --
May 01 19:59:59 tools-acme-chief-01 acme-chief-backend[18675]: Refreshing live OCSP response for certificate toolse
May 01 19:59:59 tools-acme-chief-01 acme-chief-backend[18675]: live OCSP response refreshed successfully for toolse
May 01 19:59:59 tools-acme-chief-01 acme-chief-backend[18675]: Refreshing live OCSP response for certificate toolse
May 01 19:59:59 tools-acme-chief-01 acme-chief-backend[18675]: live OCSP response refreshed successfully for toolse

Change 788294 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/software/acme-chief@master] acme_chief: add log_level to the config

https://gerrit.wikimedia.org/r/788294

for some reason reload-acme-chief-backend.timer isn't being triggered on tools-acme-chief-01. that's not related to T273956

Change 788294 abandoned by David Caro:

[operations/software/acme-chief@master] acme_chief: add log_level to the config

Reason:

Not needed

https://gerrit.wikimedia.org/r/788294