Page MenuHomePhabricator

Various debmonitor-client systemdtimer errors starting April 21st
Closed, ResolvedPublic

Description

Starting on April 21st there has been an increase in cron messages from SYSTEMDTIMER due to various reasons. From a cursory look, it seems that the two main category of errors are (1) TLS issues (2) missing dependency on the python library providing JSONDecodeError

To: root@cloudvirt1040.eqiad.wmnet
ERROR:debmonitor:Failed to execute DebMonitor CLI: [('PEM routines', 'get_name', 'no start line'), ('SSL routines', 'use_certificate_chain_file', 'PEM lib')]

To: root@restbase1016.eqiad.wmnet
ERROR:debmonitor:Failed to execute DebMonitor CLI: [SSL] PEM lib (_ssl.c:2947)

To: root@cloudvirt1032.eqiad.wmnet
ERROR:debmonitor:Failed to execute DebMonitor CLI: [('PEM routines', 'get_name', 'no start line'), ('SSL routines', 'use_certificate_chain_file', 'PEM lib')]

To: root@db2103.codfw.wmnet
ERROR:debmonitor:Failed to execute DebMonitor CLI: [SSL] PEM lib (_ssl.c:2947)

To: root@logstash2006.codfw.wmnet
ERROR:debmonitor:Failed to execute DebMonitor CLI: [SSL] PEM lib (_ssl.c:2947)

To: root@ms-be2049.codfw.wmnet
ERROR:debmonitor:Failed to execute DebMonitor CLI: [SSL] PEM lib (_ssl.c:2947)

To: root@ms-be1030.eqiad.wmnet
ERROR:debmonitor:Failed to execute DebMonitor CLI: unsupported operand type(s) for -=: 'Retry' and 'int'

To: root@mwlog2001.codfw.wmnet
ImportError: cannot import name 'JSONDecodeError'

To: root@conf2001.codfw.wmnet
ImportError: cannot import name 'JSONDecodeError'

To: root@conf2003.codfw.wmnet
ImportError: cannot import name 'JSONDecodeError'

To: root@mwlog1001.eqiad.wmnet 
ImportError: cannot import name 'JSONDecodeError'

To: root@db1124.eqiad.wmnet
ERROR:debmonitor:Failed to execute DebMonitor CLI: [SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:720)

To: root@sretest1002.eqiad.wmnet
ERROR:debmonitor:Failed to execute DebMonitor CLI: HTTPSConnectionPool(host='debmonitor.discovery.wmnet', port=443): Max retries exceeded with url: /hosts/sretest1002.eqiad.wmnet/update (Caused by SSLError(SSLError(1, '[SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:2622)')))

Event Timeline

The "ImportError: cannot import name 'JSONDecodeError'" errors are from our five remaining jessie hosts, there was a patch by @jbond to address this, which is in 0.2.9, but apparently there's still more. But those few hosts will vanish soon anyway.

The "unsupported operand type(s) for -=: 'Retry' and 'int'" should all be fixed, the log errors should predate the rollout of the fixed version (although some hosts are still at 0.2.8, I'll upgrade those now).

Mentioned in SAL (#wikimedia-operations) [2021-04-26T08:28:56Z] <moritzm> update debmonitor to 0.2.9 on remaining hosts T281090

To confirm i have just pushed out 0.2.9 which should fix the JSONDecodeError and 'Retry' and 'int' issues.

The expiry for sretest1002 was valid as it was missing puppet certs and thus unable to renew its pki cert (I will be making the expiry times a bit more libral today)

Ill check on the other issues

Failed to execute DebMonitor CLI: [SSL] PEM lib (_ssl.c:2947)
'PEM routines', 'get_name', 'no start line'), ('SSL routines', 'use_certificate_chain_file', 'PEM lib'

Theses changes all happened at 12:10PM on the 21/04/2021. I was doing lots of work on the infrastructure that day so im going to count this as a temporary failure and see if it comes up again

Change 682540 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] systemd::timer::job: update the program name of systemd-timer-mail-wrapper

https://gerrit.wikimedia.org/r/682540

akosiaris triaged this task as Medium priority.Apr 26 2021, 9:59 AM

Change 682540 merged by Jbond:

[operations/puppet@production] systemd::timer::job: update the program name of systemd-timer-mail-wrapper

https://gerrit.wikimedia.org/r/682540

Change 682596 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] P:pki::multirootca: increase default expire policy

https://gerrit.wikimedia.org/r/682596

Change 682596 merged by Jbond:

[operations/puppet@production] P:pki::multirootca: increase default expire policy

https://gerrit.wikimedia.org/r/682596

jbond claimed this task.

I have now increased the default expiry of certs, deployed the newest version of debmonitor-client and fixed systemd service logging. I'm going to optimistically assume this has fixed all issues and mark this resolved but please reopen if more issues arise