Page MenuHomePhabricator

Warning about /etc/acmecerts/unified contents during puppet run on deployment-cache-text08 & deployment-cache-upload08
Open, In Progress, LowPublicBUG REPORT

Description

Seen during a forced puppet run on deployment-cache-text08:

Warning: The directory '/etc/acmecerts/unified' contains 1013 entries, which exceeds the default soft limit 1000 and may cause excessive resource consumption and degraded performance. To remove this warning set a value for `max_files` parameter or consider using an alternate method to manage large directory trees
root@deployment-cache-text08:/etc/acmecerts/unified# ls -lht
total 196K
lrwxrwxrwx 1 root haproxy   55 Jul 12 10:21 live -> /etc/acmecerts/unified/59a26a546bee479e80c66eab3976743e
drwxr-x--- 2 root haproxy 4.0K Jul 12 10:21 59a26a546bee479e80c66eab3976743e
lrwxrwxrwx 1 root haproxy   55 Jul 11 07:22 new -> /etc/acmecerts/unified/59a26a546bee479e80c66eab3976743e
drwxr-x--- 2 root haproxy 4.0K Jun 18 01:23 4a391354768646118582750128fbdd6c
drwxr-x--- 2 root haproxy 4.0K Jun  5 04:52 6b06f6db7ab34265862617dc224c0e57
drwxr-x--- 2 root haproxy 4.0K Jun  4 17:52 5deb0687ea234282bac26db1fea3d02b
drwxr-x--- 2 root haproxy 4.0K Jun  1 16:22 eb8d883249594d2c9e2b44a6c12e0dd9
drwxr-x--- 2 root haproxy 4.0K May 31 06:52 cbdcb021a109423c8a12f236369c0796
drwxr-x--- 2 root haproxy 4.0K Apr  1 06:21 d5250601537d419485e2ab87d0d16043
drwxr-x--- 2 root haproxy 4.0K Jan 31 23:52 5b923c1c47f54e7a90c6587f030cd627
drwxr-x--- 2 root haproxy 4.0K Dec  1  2024 a648a4caa4654f678ea3a4b684f5ff12
drwxr-x--- 2 root haproxy 4.0K Oct  2  2024 c72b5be99df74a429f7f5b2016869140
drwxr-x--- 2 root haproxy 4.0K Aug  4  2024 494ba7b3eee2491bb098cb1c3d514e97
drwxr-x--- 2 root haproxy 4.0K Jun  7  2024 206b7e3a881d475381791448b8513168
drwxr-x--- 2 root haproxy 4.0K Jun  2  2024 e6c588ed7590429d8e918b5368a16e33
drwxr-x--- 2 root haproxy 4.0K Jun  1  2024 38ee63b619284c0fb06a989d6a43f3a9
drwxr-x--- 2 root haproxy 4.0K May 27  2024 705af2df9de343d080f7580945dff7c1
drwxr-x--- 2 root haproxy 4.0K May  4  2024 025a2b8791254a168a7a271f6b68b925
drwxr-x--- 2 root haproxy 4.0K Mar  5  2024 370d921da1fe469eac2e25a7ffa69710
drwxr-x--- 2 root haproxy 4.0K Jan  5  2024 6be2f94d3593470cb55104e3fb710c6c
drwxr-x--- 2 root haproxy 4.0K Nov  7  2023 dbd87f29dbb746518d4d4bbb2ef881f6
drwxr-x--- 2 root haproxy 4.0K Aug 13  2023 4c49c4081f834ddb8712968b295983c0
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 fd2d69d1a7a149cebc73d00cfdd0ac2e
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 f51b6677b4dd4135a912d035cf291e76
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 f2c942538af0405793badc5f40c81362
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 e03102bc11c444f3a0f9a3ea86c743da
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 d73e8673a2684d3f889cb91f833f5f09
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ce8e4d22acee4b708e03904c1087c40c
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 c0b936e8b2294189bc6248bff3671842
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 bf5e612e77174be08859dcf617306e92
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 b5f2741ce23e4bf4b2ffab40c171650f
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 b5a4fe4892e54ede81b680247fa2b736
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 b3fc07a93a2442f4ad1227759811ed80
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 b33df9ca50c4483dabe54f49c14487c7
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ad3f236e3cb747bead1101ef4c5296fd
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 9e3fd170440d4191a02f8fce272aae4b
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 9a6fd2f5e4c94e239e9685e0b493b234
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 916fc10e496b4ecc95c27b91a783d4f2
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 7a504960c6094193b35777105719fa82
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 79154a7bf6b445cb9dce3d7fdddfdd12
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 75b10323938040dd9e76fb60af2ff797
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 6ea106733def4814a235074668cb9652
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 4dad0611440d4460a49cebb9f6b92c6d
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 4aff45196800449ba0678f736e610a4b
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 3099a6344aea4a7a97719536fdd1d625
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 306cce488a754178bb9ea1c762fbeffa
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 2f991dd5078f40aead36cd6d041a2311
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 1f27eb87bcc1489e857b8caf4803f628
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 1c321ac6da0c4103bf630165d91bca2c
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 0d829dc82393450fa34fabd837364efa
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 002fe8e1d5e6417d9fa5a70eee4b92e9

Event Timeline

I think that the directory linked to by /etc/acmecerts/unified/live is the only one that is actually used. Let's try getting rid of the pile of directories from 2023-06-22.

root@deployment-cache-text08:/etc/acmecerts/unified# find . -maxdepth 1 -type d -mtime +750 -exec ls -lhd {} \;
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./b5a4fe4892e54ede81b680247fa2b736
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./4dad0611440d4460a49cebb9f6b92c6d
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./6ea106733def4814a235074668cb9652
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./2f991dd5078f40aead36cd6d041a2311
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./e03102bc11c444f3a0f9a3ea86c743da
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./3099a6344aea4a7a97719536fdd1d625
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./1f27eb87bcc1489e857b8caf4803f628
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./306cce488a754178bb9ea1c762fbeffa
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./002fe8e1d5e6417d9fa5a70eee4b92e9
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./fd2d69d1a7a149cebc73d00cfdd0ac2e
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./ad3f236e3cb747bead1101ef4c5296fd
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./ce8e4d22acee4b708e03904c1087c40c
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./f51b6677b4dd4135a912d035cf291e76
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./d73e8673a2684d3f889cb91f833f5f09
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./916fc10e496b4ecc95c27b91a783d4f2
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./4aff45196800449ba0678f736e610a4b
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./9a6fd2f5e4c94e239e9685e0b493b234
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./0d829dc82393450fa34fabd837364efa
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./79154a7bf6b445cb9dce3d7fdddfdd12
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./b3fc07a93a2442f4ad1227759811ed80
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./9e3fd170440d4191a02f8fce272aae4b
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./c0b936e8b2294189bc6248bff3671842
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./1c321ac6da0c4103bf630165d91bca2c
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./bf5e612e77174be08859dcf617306e92
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./7a504960c6094193b35777105719fa82
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./b5f2741ce23e4bf4b2ffab40c171650f
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./75b10323938040dd9e76fb60af2ff797
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./f2c942538af0405793badc5f40c81362
drwxr-x--- 2 root haproxy 4.0K Jun 22  2023 ./b33df9ca50c4483dabe54f49c14487c7
root@deployment-cache-text08:/etc/acmecerts/unified# find . -maxdepth 1 -type d -mtime +750 -exec rm -r {} \;
root@deployment-cache-text08:/etc/acmecerts/unified# ls
025a2b8791254a168a7a271f6b68b925  6be2f94d3593470cb55104e3fb710c6c
206b7e3a881d475381791448b8513168  705af2df9de343d080f7580945dff7c1
370d921da1fe469eac2e25a7ffa69710  a648a4caa4654f678ea3a4b684f5ff12
38ee63b619284c0fb06a989d6a43f3a9  c72b5be99df74a429f7f5b2016869140
494ba7b3eee2491bb098cb1c3d514e97  cbdcb021a109423c8a12f236369c0796
4a391354768646118582750128fbdd6c  d5250601537d419485e2ab87d0d16043
4c49c4081f834ddb8712968b295983c0  dbd87f29dbb746518d4d4bbb2ef881f6
59a26a546bee479e80c66eab3976743e  e6c588ed7590429d8e918b5368a16e33
5b923c1c47f54e7a90c6587f030cd627  eb8d883249594d2c9e2b44a6c12e0dd9
5deb0687ea234282bac26db1fea3d02b  live
6b06f6db7ab34265862617dc224c0e57  new

I think that the directory linked to by /etc/acmecerts/unified/live is the only one that is actually used. Let's try getting rid of the pile of directories from 2023-06-22.

The next puppet run put them all back, so whatever the fix is that wasn't it.

bd808 renamed this task from Warning about /etc/acmecerts/unified contents during puppet run on deployment-cache-text08 to Warning about /etc/acmecerts/unified contents during puppet run on deployment-cache-text08 & deployment-cache-upload08.Jul 16 2025, 9:28 PM

You can clean old certs in the acme-chief instance, those should be stored in /var/lib/acme-chief/

You can clean old certs in the acme-chief instance, those should be stored in /var/lib/acme-chief/

Is this normal maintenance for an acme-chief instance? I was wondering if there maybe should be some systemd timer as part of the role that would take care of this cleanup of stale/expired certs. If this feels like it is just weirdness that arises from Beta Cluster changing things too often or keeping acme-chief instances longer than generally expected I can write a doc somewhere, do the needful, and move on.

yes, I'm gonna add a systemd timer to take care of that, but definitely not at 23:33 local time :) I replied to unblock you and I'll send a CR soon

Change #1170281 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] acme_chief: Remove certs older than 1 year

https://gerrit.wikimedia.org/r/1170281

Change #1170281 merged by Vgutierrez:

[operations/puppet@production] acme_chief: Remove certs older than 1 year

https://gerrit.wikimedia.org/r/1170281

Testing the new timer service out on deployment-acme-chief05:

bd808@deployment-acme-chief05:~$ sudo find /var/lib/acme-chief/certs/unified -maxdepth 1 -type d -mtime +366
/var/lib/acme-chief/certs/unified/b5a4fe4892e54ede81b680247fa2b736
/var/lib/acme-chief/certs/unified/494ba7b3eee2491bb098cb1c3d514e97
/var/lib/acme-chief/certs/unified/4dad0611440d4460a49cebb9f6b92c6d
/var/lib/acme-chief/certs/unified/6ea106733def4814a235074668cb9652
/var/lib/acme-chief/certs/unified/2f991dd5078f40aead36cd6d041a2311
/var/lib/acme-chief/certs/unified/e6c588ed7590429d8e918b5368a16e33
/var/lib/acme-chief/certs/unified/e03102bc11c444f3a0f9a3ea86c743da
/var/lib/acme-chief/certs/unified/3099a6344aea4a7a97719536fdd1d625
/var/lib/acme-chief/certs/unified/dbd87f29dbb746518d4d4bbb2ef881f6
/var/lib/acme-chief/certs/unified/1f27eb87bcc1489e857b8caf4803f628
/var/lib/acme-chief/certs/unified/306cce488a754178bb9ea1c762fbeffa
/var/lib/acme-chief/certs/unified/002fe8e1d5e6417d9fa5a70eee4b92e9
/var/lib/acme-chief/certs/unified/38ee63b619284c0fb06a989d6a43f3a9
/var/lib/acme-chief/certs/unified/fd2d69d1a7a149cebc73d00cfdd0ac2e
/var/lib/acme-chief/certs/unified/ad3f236e3cb747bead1101ef4c5296fd
/var/lib/acme-chief/certs/unified/ce8e4d22acee4b708e03904c1087c40c
/var/lib/acme-chief/certs/unified/6be2f94d3593470cb55104e3fb710c6c
/var/lib/acme-chief/certs/unified/f51b6677b4dd4135a912d035cf291e76
/var/lib/acme-chief/certs/unified/d73e8673a2684d3f889cb91f833f5f09
/var/lib/acme-chief/certs/unified/370d921da1fe469eac2e25a7ffa69710
/var/lib/acme-chief/certs/unified/4c49c4081f834ddb8712968b295983c0
/var/lib/acme-chief/certs/unified/916fc10e496b4ecc95c27b91a783d4f2
/var/lib/acme-chief/certs/unified/4aff45196800449ba0678f736e610a4b
/var/lib/acme-chief/certs/unified/206b7e3a881d475381791448b8513168
/var/lib/acme-chief/certs/unified/705af2df9de343d080f7580945dff7c1
/var/lib/acme-chief/certs/unified/9a6fd2f5e4c94e239e9685e0b493b234
/var/lib/acme-chief/certs/unified/0d829dc82393450fa34fabd837364efa
/var/lib/acme-chief/certs/unified/79154a7bf6b445cb9dce3d7fdddfdd12
/var/lib/acme-chief/certs/unified/b3fc07a93a2442f4ad1227759811ed80
/var/lib/acme-chief/certs/unified/9e3fd170440d4191a02f8fce272aae4b
/var/lib/acme-chief/certs/unified/025a2b8791254a168a7a271f6b68b925
/var/lib/acme-chief/certs/unified/c0b936e8b2294189bc6248bff3671842
/var/lib/acme-chief/certs/unified/1c321ac6da0c4103bf630165d91bca2c
/var/lib/acme-chief/certs/unified/bf5e612e77174be08859dcf617306e92
/var/lib/acme-chief/certs/unified/7a504960c6094193b35777105719fa82
/var/lib/acme-chief/certs/unified/b5f2741ce23e4bf4b2ffab40c171650f
/var/lib/acme-chief/certs/unified/75b10323938040dd9e76fb60af2ff797
/var/lib/acme-chief/certs/unified/f2c942538af0405793badc5f40c81362
/var/lib/acme-chief/certs/unified/b33df9ca50c4483dabe54f49c14487c7
bd808@deployment-acme-chief05:~$ sudo systemctl list-timers | grep -E 'clean-stale-certs|NEXT'
NEXT                        LEFT                LAST                        PASSED       UNIT                                            ACTIVATES
Fri 2025-08-01 00:00:00 UTC 2 weeks 0 days left n/a                         n/a          clean-stale-certs.timer                         clean-stale-certs.service
bd808@deployment-acme-chief05:~$ sudo systemctl start clean-stale-certs.service
Job for clean-stale-certs.service failed because the control process exited with error code.
See "systemctl status clean-stale-certs.service" and "journalctl -xe" for details.
bd808@deployment-acme-chief05:~$ sudo systemctl status clean-stale-certs.service --no-pager --full
● clean-stale-certs.service - clean certs older than 1 year
     Loaded: loaded (/lib/systemd/system/clean-stale-certs.service; static)
     Active: failed (Result: exit-code) since Thu 2025-07-17 15:53:38 UTC; 1min 30s ago
TriggeredBy: ● clean-stale-certs.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 402437 ExecStart=/usr/local/bin/systemd-timer-mail-wrapper --subject clean-stale-certs --mail-to root@deployment-acme-chief05.deployment-prep.eqiad1.wikimedia.cloud --only-on-error /usr/bin/find /var/lib/acme-chief/certs -mtime +365 -delete (code=exited, status=1/FAILURE)
   Main PID: 402437 (code=exited, status=1/FAILURE)
        CPU: 154ms

Jul 17 15:53:38 deployment-acme-chief05 systemd[1]: Starting clean certs older than 1 year...
Jul 17 15:53:38 deployment-acme-chief05 find[402437]: /usr/bin/find: cannot delete ‘/var/lib/acme-chief/certs/mx/90fb608a18b24ad884b6d934b815d23b’: Directory not empty
Jul 17 15:53:38 deployment-acme-chief05 find[402437]: /usr/bin/find: cannot delete ‘/var/lib/acme-chief/certs/unified/494ba7b3eee2491bb098cb1c3d514e97’: Directory not empty
Jul 17 15:53:38 deployment-acme-chief05 find[402437]: /usr/bin/find: cannot delete ‘/var/lib/acme-chief/certs’: Directory not empty
Jul 17 15:53:38 deployment-acme-chief05 systemd[1]: clean-stale-certs.service: Main process exited, code=exited, status=1/FAILURE
Jul 17 15:53:38 deployment-acme-chief05 systemd[1]: clean-stale-certs.service: Failed with result 'exit-code'.
Jul 17 15:53:38 deployment-acme-chief05 systemd[1]: Failed to start clean certs older than 1 year.
bd808@deployment-acme-chief05:~$ sudo find /var/lib/acme-chief/certs/unified -maxdepth 1 -type d -mtime +366
bd808@deployment-acme-chief05:~$ echo $?
0

It did the cleanup that was wanted, but also returned as failed because find issued the equivalent of rm $some_directory when there were still files in that directory. One possible fix for that is something like -depth -exec rm -rv {} + instead of -delete.

I like the idea of using two passes instead of rm -r, so something like:

  • find /var/lib/acme-chief/certs -type f -mtime +365 -delete
  • find /var/lib/acme-chief/certs -type d -empty -delete

thanks for the debugging @bd808

I like the idea of using two passes instead of rm -r, so something like:

  • find /var/lib/acme-chief/certs -type f -mtime +365 -delete
  • find /var/lib/acme-chief/certs -type d -empty -delete

thanks for the debugging @bd808

I prefer the staggered approach over the blanket rm -r. I think they are functionally equivalent and so this may be subjective but it's more controlled.

Change #1170497 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] acme_chief: Delete empty directories after pruning expired certs

https://gerrit.wikimedia.org/r/1170497

Change #1170497 merged by Vgutierrez:

[operations/puppet@production] acme_chief: Delete empty directories after pruning expired certs

https://gerrit.wikimedia.org/r/1170497

Change #1174881 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] acme-chief: Move clean-stale-certs to file

https://gerrit.wikimedia.org/r/1174881

BCornwall changed the task status from Open to In Progress.Aug 1 2025, 1:44 AM
BCornwall claimed this task.
BCornwall triaged this task as Low priority.

Change #1174881 merged by BCornwall:

[operations/puppet@production] acme-chief: Move clean-stale-certs to file

https://gerrit.wikimedia.org/r/1174881

Mentioned in SAL (#wikimedia-operations) [2025-09-12T16:38:53Z] <brett> Manually running clean-stale-certs.service on acmechief2002 - T399419

Change #1187861 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Move SPDX identifier below shebang

https://gerrit.wikimedia.org/r/1187861

Change #1187861 merged by BCornwall:

[operations/puppet@production] acme-chief: Fixes for cert cleanup script

https://gerrit.wikimedia.org/r/1187861

The unit has run and cleanup has occurred - everything looking good, @bd808?

The unit has run and cleanup has occurred - everything looking good, @bd808?

bd808@deployment-acme-chief05:~$ sudo systemctl status clean-stale-certs.service --no-pager --full
● clean-stale-certs.service - clean certs older than 1 year
     Loaded: loaded (/lib/systemd/system/clean-stale-certs.service; static)
     Active: failed (Result: exit-code) since Mon 2025-09-01 00:00:04 UTC; 1 weeks 4 days ago
TriggeredBy: ● clean-stale-certs.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
   Main PID: 3525660 (code=exited, status=1/FAILURE)
        CPU: 108ms

Sep 01 00:00:03 deployment-acme-chief05 systemd[1]: Starting clean certs older than 1 year...
Sep 01 00:00:03 deployment-acme-chief05 find[3525660]: /usr/bin/find: paths must precede expression: `&&'
Sep 01 00:00:04 deployment-acme-chief05 systemd[1]: clean-stale-certs.service: Main process exited, code=exited, status=1/FAILURE
Sep 01 00:00:04 deployment-acme-chief05 systemd[1]: clean-stale-certs.service: Failed with result 'exit-code'.
Sep 01 00:00:04 deployment-acme-chief05 systemd[1]: Failed to start clean certs older than 1 year.
bd808@deployment-acme-chief05:~$ sudo puppet agent -tv
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for deployment-acme-chief05.deployment-prep.eqiad1.wikimedia.cloud
Info: Applying configuration version '(0174fe839b) gitpuppet - trafficserver: Add missing REST Gateway for Beta Cluster'
Notice: Applied catalog in 8.54 seconds
bd808@deployment-acme-chief05:~$ sudo systemctl start clean-stale-certs.service
bd808@deployment-acme-chief05:~$ sudo systemctl status clean-stale-certs.service --no-pager --full
● clean-stale-certs.service - clean certs older than 1 year
     Loaded: loaded (/lib/systemd/system/clean-stale-certs.service; static)
     Active: inactive (dead) since Fri 2025-09-12 20:12:14 UTC; 5s ago
TriggeredBy: ● clean-stale-certs.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 157915 ExecStart=/usr/local/bin/systemd-timer-mail-wrapper --subject clean-stale-certs --mail-to root@deployment-acme-chief05.deployment-prep.eqiad1.wikimedia.cloud --only-on-error /usr/local/bin/clean-stale-acme-chief-certs (code=exited, status=0/SUCCESS)
   Main PID: 157915 (code=exited, status=0/SUCCESS)
        CPU: 79ms

Sep 12 20:12:14 deployment-acme-chief05 clean-stale-acme-chief-certs[157915]: /var/lib/acme-chief/certs/unified/c72b5be99df74a429f7f5b2016869140/rsa-2048.chained.crt.key
Sep 12 20:12:14 deployment-acme-chief05 clean-stale-acme-chief-certs[157915]: /var/lib/acme-chief/certs/unified/c72b5be99df74a429f7f5b2016869140/rsa-2048.alt.chained.crt.key
Sep 12 20:12:14 deployment-acme-chief05 clean-stale-acme-chief-certs[157915]: /var/lib/acme-chief/certs/unified/c72b5be99df74a429f7f5b2016869140/rsa-2048.key
Sep 12 20:12:14 deployment-acme-chief05 clean-stale-acme-chief-certs[157915]: /var/lib/acme-chief/certs/unified/c72b5be99df74a429f7f5b2016869140/rsa-2048.chained.crt
Sep 12 20:12:14 deployment-acme-chief05 clean-stale-acme-chief-certs[157915]: /var/lib/acme-chief/certs/unified/c72b5be99df74a429f7f5b2016869140/ec-prime256v1.alt.chained.crt.key
Sep 12 20:12:14 deployment-acme-chief05 clean-stale-acme-chief-certs[157915]: /var/lib/acme-chief/certs/unified/c72b5be99df74a429f7f5b2016869140/ec-prime256v1.crt.key
Sep 12 20:12:14 deployment-acme-chief05 clean-stale-acme-chief-certs[157915]: /var/lib/acme-chief/certs/mx/90fb608a18b24ad884b6d934b815d23b
Sep 12 20:12:14 deployment-acme-chief05 clean-stale-acme-chief-certs[157915]: /var/lib/acme-chief/certs/unified/494ba7b3eee2491bb098cb1c3d514e97
Sep 12 20:12:14 deployment-acme-chief05 systemd[1]: clean-stale-certs.service: Succeeded.
Sep 12 20:12:14 deployment-acme-chief05 systemd[1]: Finished clean certs older than 1 year.

The cleanup script and timer look good on deployment-acme-chief05.

On deployment-cache-upload08 the Puppet runs were still printing a Warning: The directory '/etc/acmecerts/unified' contains 1032 entries message until I manually cleaned the directory with sudo find /etc/acmecerts/unified -type f -mtime +365 -delete -print. After that purge and a subsequent puppet run to catch up with whatever might be put back there are 190 files per sudo find /etc/acmecerts/unified -type f | wc -l.

Hm, it sounded like the directory should have been synced without you manually needing to delete them from /etc/.

Hm, it sounded like the directory should have been synced without you manually needing to delete them from /etc/.

I can see that the cleaner script is running on the acme-chief side:

bd808@deployment-acme-chief05:~$ sudo systemctl status clean-stale-certs.service --no-pager --full
● clean-stale-certs.service - clean certs older than 1 year
     Loaded: loaded (/lib/systemd/system/clean-stale-certs.service; static)
     Active: inactive (dead) since Sun 2026-03-01 00:00:00 UTC; 1 weeks 1 days ago
TriggeredBy: ● clean-stale-certs.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 3478827 ExecStart=/usr/local/bin/systemd-timer-mail-wrapper --subject clean-stale-certs --mail-to root@deployment-acme-chief05.deployment-prep.eqiad1.wikimedia.cloud --only-on-error /usr/local/bin/clean-stale-acme-chief-certs (code=exited, status=0/SUCCESS)
   Main PID: 3478827 (code=exited, status=0/SUCCESS)
        CPU: 80ms

Mar 01 00:00:00 deployment-acme-chief05 clean-stale-acme-chief-certs[3478827]: /var/lib/acme-chief/certs/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.alt.chained.crt
Mar 01 00:00:00 deployment-acme-chief05 clean-stale-acme-chief-certs[3478827]: /var/lib/acme-chief/certs/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.chain.crt
Mar 01 00:00:00 deployment-acme-chief05 clean-stale-acme-chief-certs[3478827]: /var/lib/acme-chief/certs/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.chained.crt.key
Mar 01 00:00:00 deployment-acme-chief05 clean-stale-acme-chief-certs[3478827]: /var/lib/acme-chief/certs/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.alt.chained.crt.key
Mar 01 00:00:00 deployment-acme-chief05 clean-stale-acme-chief-certs[3478827]: /var/lib/acme-chief/certs/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.key
Mar 01 00:00:00 deployment-acme-chief05 clean-stale-acme-chief-certs[3478827]: /var/lib/acme-chief/certs/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.chained.crt
Mar 01 00:00:00 deployment-acme-chief05 clean-stale-acme-chief-certs[3478827]: /var/lib/acme-chief/certs/unified/d5250601537d419485e2ab87d0d16043/ec-prime256v1.alt.chained.crt.key
Mar 01 00:00:00 deployment-acme-chief05 clean-stale-acme-chief-certs[3478827]: /var/lib/acme-chief/certs/unified/d5250601537d419485e2ab87d0d16043/ec-prime256v1.crt.key
Mar 01 00:00:00 deployment-acme-chief05 systemd[1]: clean-stale-certs.service: Succeeded.
Mar 01 00:00:00 deployment-acme-chief05 systemd[1]: Finished clean certs older than 1 year.

But on a client like deployment-cache-upload08 the certs are sticking around:

bd808@deployment-cache-upload08.deployment-prep.eqiad1:~$ sudo find /etc/acmecerts/unified -type f -mtime +365 | grep d5250601537d419485e2ab87d0d16043
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/ec-prime256v1.crt
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/ec-prime256v1.key
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/ec-prime256v1.chained.crt.key
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.crt
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.alt.chain.crt
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/ec-prime256v1.alt.chained.crt
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/ec-prime256v1.chained.crt
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.crt.key
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/ec-prime256v1.alt.chain.crt
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/ec-prime256v1.chain.crt
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.alt.chained.crt
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.chain.crt
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.chained.crt.key
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.alt.chained.crt.key
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.key
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/rsa-2048.chained.crt
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/ec-prime256v1.alt.chained.crt.key
/etc/acmecerts/unified/d5250601537d419485e2ab87d0d16043/ec-prime256v1.crt.key

In ops/pupppet I see:

modules/acme_chief/manifests/cert.pp
$acmechief_host = lookup('acmechief_host')  # lint:ignore:wmf_styleguide
# lint:ignore:puppet_url_without_modules
file { "${certs_path}/${title}":
    ensure    => stdlib::ensure($ensure, 'directory'),
    owner     => 'root',
    group     => $key_group,
    mode      => '0640',
    recurse   => true,
    # TODO: remove hiera guard after sufficient testing in production
    purge     => lookup('acme_chief::purge_old_certs', { 'default_value' => false }), # lint:ignore:wmf_styleguide
    show_diff => false,
    backup    => false,
    source    => "puppet://${acmechief_host}/acmedata/${title}",
    force     => true,
}

I don't see acme_chief::purge_old_cert: true anywhere in the Beta Cluster specific config. It looks like it has been set in production only for role(alerting_host). @BCornwall is that the missing magic?

@bd808 You would think so with such a key name! Instead, it looks like deployment-acme-chief05 is failing to sync the changes to hosts and has for some time!:

root@deployment-acme-chief05:/# systemctl status acme-chief-certs-sync.service
● acme-chief-certs-sync.service - Sync acme-chief certificates
     Loaded: loaded (/lib/systemd/system/acme-chief-certs-sync.service; static)
     Active: failed (Result: exit-code) since Tue 2026-03-10 18:00:04 UTC; 15min ago
TriggeredBy: ● acme-chief-certs-sync.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 4152815 ExecStart=/usr/local/bin/acme-chief-certs-sync (code=exited, status=255/EXCEPTION)
   Main PID: 4152815 (code=exited, status=255/EXCEPTION)
        CPU: 25ms

Mar 10 18:00:04 deployment-acme-chief05 systemd[1]: Starting Sync acme-chief certificates...
Mar 10 18:00:04 deployment-acme-chief05 acme-chief-certs-sync[4152830]: acme-chief@deployment-acme-chief06.deployment-prep.eqiad1.wikimedia.cloud: Permission denied (publickey).
Mar 10 18:00:04 deployment-acme-chief05 acme-chief-certs-sync[4152822]: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
Mar 10 18:00:04 deployment-acme-chief05 acme-chief-certs-sync[4152822]: rsync error: unexplained error (code 255) at io.c(228) [sender=3.2.3]
Mar 10 18:00:04 deployment-acme-chief05 systemd[1]: acme-chief-certs-sync.service: Main process exited, code=exited, status=255/EXCEPTION
Mar 10 18:00:04 deployment-acme-chief05 systemd[1]: acme-chief-certs-sync.service: Failed with result 'exit-code'.
Mar 10 18:00:04 deployment-acme-chief05 systemd[1]: Failed to start Sync acme-chief certificates.

Ah, that service doesn't handle certs to consumers, my apologies. Regardless, it appears that acme-chief was restarted without a keyholder arm. I've armed it and manually triggered acme-chief-certs-sync.service. I'll keep looking.

If I'm reading sources correctly, the certs are not automatically cleaned up on the client hosts, only the acme-chief host. The certs are downloaded via an API request when they're required, and will remain for the lifetime of the client. Is that correct, @Vgutierrez?