As reported on T273956 acme-chief sometimes hangs and it's unable to keep performing its duties. A watchdog mechanism would help detecting this quickly and on a reliable manner.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | bd808 | T273956 acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP | |||
Resolved | Vgutierrez | T292619 Implement a watchdog mechanism on acme-chief |
Event Timeline
Change 728379 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/software/acme-chief@master] acme_chief: implement file and systemd based watchdogs
Change 730016 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] acme_chief: Enable file and systemd watchdogs
Change 730703 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/software/acme-chief@master] Release 0.33
Change 728379 merged by jenkins-bot:
[operations/software/acme-chief@master] acme_chief: Add systemd based watchdog support
Change 730703 merged by jenkins-bot:
[operations/software/acme-chief@master] Release 0.33
Change 730711 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/software/acme-chief@debian] acme_chief: Add systemd based watchdog support
Change 730712 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/software/acme-chief@debian] Release 0.33
Change 730713 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/software/acme-chief@debian] debian: Add release 0.33 to the changelog
Change 730711 merged by jenkins-bot:
[operations/software/acme-chief@debian] acme_chief: Add systemd based watchdog support
Change 730712 merged by jenkins-bot:
[operations/software/acme-chief@debian] Release 0.33
Change 730713 merged by jenkins-bot:
[operations/software/acme-chief@debian] debian: Add release 0.33 to the changelog
Change 730749 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/software/acme-chief@master] acme_chief: auto-detect systemd watchdog
Change 730749 merged by Vgutierrez:
[operations/software/acme-chief@master] acme_chief: auto-detect systemd watchdog
Change 730977 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/software/acme-chief@master] Release 0.34
Change 730977 merged by jenkins-bot:
[operations/software/acme-chief@master] Release 0.34
Change 730979 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/software/acme-chief@debian] acme_chief: auto-detect systemd watchdog
Change 730980 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/software/acme-chief@debian] Release 0.34
Change 730981 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/software/acme-chief@debian] debian: Add release 0.34 to the changelog
Change 730979 merged by jenkins-bot:
[operations/software/acme-chief@debian] acme_chief: auto-detect systemd watchdog
Change 730980 merged by jenkins-bot:
[operations/software/acme-chief@debian] Release 0.34
Change 730981 merged by jenkins-bot:
[operations/software/acme-chief@debian] debian: Add release 0.34 to the changelog
Change 731018 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] acme_chief: Enable watchdog on acmechief-test1001
Mentioned in SAL (#wikimedia-operations) [2021-10-15T13:14:21Z] <vgutierrez> upload acme-chief 0.34 to apt.wikimedia.org (buster) - T292619
Change 730016 merged by Vgutierrez:
[operations/puppet@production] acme_chief: Support systemd watchdog
Change 731018 merged by Vgutierrez:
[operations/puppet@production] acme_chief: Enable watchdog on acmechief-test1001
Mentioned in SAL (#wikimedia-operations) [2021-10-15T13:21:17Z] <vgutierrez> updating acme-chief to version 0.34 on acmechief-test instances - T292619
Change 731101 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] systemd: Allow paging on a systemd::service failure
Change 731101 merged by Vgutierrez:
[operations/puppet@production] systemd: Allow paging on a systemd::service failure
Change 731335 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] acme_chief: Enable watchdog on production servers
Mentioned in SAL (#wikimedia-operations) [2021-10-18T09:39:09Z] <vgutierrez> updating acme-chief to version 0.34 on acmechief instances - T292619
Change 731335 merged by Vgutierrez:
[operations/puppet@production] acme_chief: Enable watchdog on production servers
Change 731343 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] acme_chief: Enable monitoring on systemd
Change 731343 merged by Vgutierrez:
[operations/puppet@production] acme_chief: Enable monitoring on systemd
Change 735297 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] acme_chief: Page on acme-chief unit failure
Change 735297 merged by Vgutierrez:
[operations/puppet@production] acme_chief: Page on acme-chief unit failure
Change 759439 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/puppet@production] P:acme_chief: set watchdog_sec default on cloud
Change 759439 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] P:acme_chief: set watchdog_sec default on cloud