Page MenuHomePhabricator

Implement a watchdog mechanism on acme-chief
Closed, ResolvedPublic

Description

As reported on T273956 acme-chief sometimes hangs and it's unable to keep performing its duties. A watchdog mechanism would help detecting this quickly and on a reliable manner.

Event Timeline

Vgutierrez triaged this task as Medium priority.Oct 6 2021, 9:29 AM
Vgutierrez moved this task from Triage to TLS on the Traffic board.

Change 728379 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/software/acme-chief@master] acme_chief: implement file and systemd based watchdogs

https://gerrit.wikimedia.org/r/728379

Change 730016 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] acme_chief: Enable file and systemd watchdogs

https://gerrit.wikimedia.org/r/730016

Change 730703 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/software/acme-chief@master] Release 0.33

https://gerrit.wikimedia.org/r/730703

Change 728379 merged by jenkins-bot:

[operations/software/acme-chief@master] acme_chief: Add systemd based watchdog support

https://gerrit.wikimedia.org/r/728379

Change 730703 merged by jenkins-bot:

[operations/software/acme-chief@master] Release 0.33

https://gerrit.wikimedia.org/r/730703

Change 730711 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/software/acme-chief@debian] acme_chief: Add systemd based watchdog support

https://gerrit.wikimedia.org/r/730711

Change 730712 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/software/acme-chief@debian] Release 0.33

https://gerrit.wikimedia.org/r/730712

Change 730713 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/software/acme-chief@debian] debian: Add release 0.33 to the changelog

https://gerrit.wikimedia.org/r/730713

Change 730711 merged by jenkins-bot:

[operations/software/acme-chief@debian] acme_chief: Add systemd based watchdog support

https://gerrit.wikimedia.org/r/730711

Change 730712 merged by jenkins-bot:

[operations/software/acme-chief@debian] Release 0.33

https://gerrit.wikimedia.org/r/730712

Change 730713 merged by jenkins-bot:

[operations/software/acme-chief@debian] debian: Add release 0.33 to the changelog

https://gerrit.wikimedia.org/r/730713

Change 730749 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/software/acme-chief@master] acme_chief: auto-detect systemd watchdog

https://gerrit.wikimedia.org/r/730749

Change 730749 merged by Vgutierrez:

[operations/software/acme-chief@master] acme_chief: auto-detect systemd watchdog

https://gerrit.wikimedia.org/r/730749

Change 730977 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/software/acme-chief@master] Release 0.34

https://gerrit.wikimedia.org/r/730977

Change 730977 merged by jenkins-bot:

[operations/software/acme-chief@master] Release 0.34

https://gerrit.wikimedia.org/r/730977

Change 730979 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/software/acme-chief@debian] acme_chief: auto-detect systemd watchdog

https://gerrit.wikimedia.org/r/730979

Change 730980 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/software/acme-chief@debian] Release 0.34

https://gerrit.wikimedia.org/r/730980

Change 730981 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/software/acme-chief@debian] debian: Add release 0.34 to the changelog

https://gerrit.wikimedia.org/r/730981

Change 730979 merged by jenkins-bot:

[operations/software/acme-chief@debian] acme_chief: auto-detect systemd watchdog

https://gerrit.wikimedia.org/r/730979

Change 730980 merged by jenkins-bot:

[operations/software/acme-chief@debian] Release 0.34

https://gerrit.wikimedia.org/r/730980

Change 730981 merged by jenkins-bot:

[operations/software/acme-chief@debian] debian: Add release 0.34 to the changelog

https://gerrit.wikimedia.org/r/730981

Change 731018 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] acme_chief: Enable watchdog on acmechief-test1001

https://gerrit.wikimedia.org/r/731018

Mentioned in SAL (#wikimedia-operations) [2021-10-15T13:14:21Z] <vgutierrez> upload acme-chief 0.34 to apt.wikimedia.org (buster) - T292619

Change 730016 merged by Vgutierrez:

[operations/puppet@production] acme_chief: Support systemd watchdog

https://gerrit.wikimedia.org/r/730016

Change 731018 merged by Vgutierrez:

[operations/puppet@production] acme_chief: Enable watchdog on acmechief-test1001

https://gerrit.wikimedia.org/r/731018

Mentioned in SAL (#wikimedia-operations) [2021-10-15T13:21:17Z] <vgutierrez> updating acme-chief to version 0.34 on acmechief-test instances - T292619

Change 731101 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] systemd: Allow paging on a systemd::service failure

https://gerrit.wikimedia.org/r/731101

Change 731101 merged by Vgutierrez:

[operations/puppet@production] systemd: Allow paging on a systemd::service failure

https://gerrit.wikimedia.org/r/731101

Change 731335 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] acme_chief: Enable watchdog on production servers

https://gerrit.wikimedia.org/r/731335

Mentioned in SAL (#wikimedia-operations) [2021-10-18T09:39:09Z] <vgutierrez> updating acme-chief to version 0.34 on acmechief instances - T292619

Change 731335 merged by Vgutierrez:

[operations/puppet@production] acme_chief: Enable watchdog on production servers

https://gerrit.wikimedia.org/r/731335

Change 731343 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] acme_chief: Enable monitoring on systemd

https://gerrit.wikimedia.org/r/731343

Change 731343 merged by Vgutierrez:

[operations/puppet@production] acme_chief: Enable monitoring on systemd

https://gerrit.wikimedia.org/r/731343

Vgutierrez claimed this task.

Change 735297 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] acme_chief: Page on acme-chief unit failure

https://gerrit.wikimedia.org/r/735297

Change 735297 merged by Vgutierrez:

[operations/puppet@production] acme_chief: Page on acme-chief unit failure

https://gerrit.wikimedia.org/r/735297

Change 759439 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:acme_chief: set watchdog_sec default on cloud

https://gerrit.wikimedia.org/r/759439

Change 759439 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] P:acme_chief: set watchdog_sec default on cloud

https://gerrit.wikimedia.org/r/759439